Text-to-speech framework supporting inaudible watermarks

文档序号：617692 发布日期：2021-05-07 浏览：10次中文

阅读说明：本技术 支持听不见的水印的文本到语音框架 (Text-to-speech framework supporting inaudible watermarks ) 是由平伟仲震宇程越强李幸韦韬于 2020-06-15 设计创作，主要内容包括：根据各种实施例,端到端TTS框架可以将水印过程集成到TTS框架的训练中,这使得水印能够在TTS框架生成的合成的/克隆的音频段内是察觉不到的。以这种方式添加的水印在统计上是不可检测的,以防止经授权的去除。根据训练TTS框架的示例性方法,TTS框架中的TTS神经网络模型和水印神经网络模式以端对端的方式被训练,其中水印是TTS框架的优化过程的部分。在训练期间,基于训练数据调整TTS神经网络模型的神经元值,以准备用于在将由TTS框架生成的合成的音频段中添加水印的一个或多个空间。响应于TTS神经网络模型中的神经元值调整,相应地调整水印神经网络模型的神经元值,以将水印添加到一个或多个准备的空间。(According to various embodiments, the end-to-end TTS framework may integrate the watermarking process into the training of the TTS framework, which enables the watermark to be imperceptible within the synthesized/cloned audio segment generated by the TTS framework. The watermark added in this manner is statistically undetectable to prevent authorized removal. According to an exemplary method of training a TTS framework, a TTS neural network model and a watermark neural network model in the TTS framework are trained in an end-to-end manner, wherein the watermark is part of an optimization process of the TTS framework. During training, neuron values of a TTS neural network model are adjusted based on training data to prepare one or more spaces for watermarking in a synthesized audio segment to be generated by the TTS framework. In response to neuron value adjustments in the TTS neural network model, neuron values of the watermark neural network model are adjusted accordingly to add the watermark to the one or more prepared spaces.)

1. A computer-implemented method of training a text-to-speech (TTS) framework, the method comprising:

receiving, at a TTS framework, a set of training data for training the TTS framework to generate a watermarked synthesized audio segment, wherein the TTS framework includes a TTS neural network model and a watermark neural network model;

adjusting neuron values of the TTS neural network model to prepare one or more spaces in a synthesized audio segment generated by the TTS framework for adding the watermark; and

adjusting neuron values of the watermark neural network model to add the watermark to one or more prepared spaces.

2. The method of claim 1, wherein the TTS framework is trained end-to-end using the set of training data, comprising training the TTS neural network model and the watermark neural network model together.

3. The method of claim 1, wherein the watermark neural network model is a reversible neural network that provides a one-to-one mapping between input audio segments and watermarked audio segments.

4. The method of claim 1, wherein the neuron values in each of the TTS neural network model and the watermark neural network model comprise weights, biases, and activation functions.

5. The method of claim 4, wherein the neuron values of the TTS neural network are adjusted during the training of the TTS framework such that the watermark added to the one or more spaces is inaudible in the synthesized audio segment generated by the TTS framework.

6. The method of claim 5, wherein adding the watermark is performed by a plurality of neuron layers associated with weights, biases, and activation functions in the watermark neural network model.

7. The method of claim 1, wherein the TTS framework is trained to generate the synthesized audio segment including one or more speech phrases that overlap with a speech phrase representing the watermark such that the one or more speech phrases overlay the watermark speech phrase.

8. The method of claim 7, wherein one or more physical attributes associated with the one or more speech phrases are altered during the training of the TTS framework to overlay the watermarked speech phrase.

9. The method of claim 8, wherein altering the physical properties of the one or more voice phrases comprises altering a length of each of the one or more voice phrases such that each voice phrase overlays the watermark phrase.

10. A computer-implemented method of verifying a watermarked audio segment, comprising:

obtaining an original audio segment from a watermarked audio segment based on proprietary information using a neural network model based on the proprietary information, wherein the neural network model is part of a synthesis component used to generate the watermarked audio segment;

obtaining an actual watermark embedded in the watermarked audio segment based on a comparison between the watermarked audio segment and the original audio segment; and

it is determined whether the watermarked audio segment was generated by the synthesizing component by comparing the actual watermark to a predetermined watermark used to train the synthesizing component.

11. The method of claim 10, wherein the neural network model is a reversible neural network model, and wherein the original audio segment is stripped of any watermark embedded therein.

12. A non-transitory machine-readable medium having instructions stored therein, which when executed by a processor, cause the processor to perform the method of any of claims 1 to 9 or the method of any of claims 10 to 11.

13. A data processing system comprising:

one or more processors; and

a non-transitory computer readable medium comprising one or more sets of instructions which, when executed by at least one of the one or more processors, cause performance of the steps of the method of any one of claims 1-9 or cause performance of the steps of the method of any one of claims 10-11.

Technical Field

Embodiments of the present disclosure generally relate to neural network based speech synthesis. More particularly, embodiments of the present disclosure relate to a Text To Speech (TTS) framework for adding an inaudible watermark.

Background

Neural network based speech synthesis (also known as text-to-speech) has achieved high fidelity speech like humans, and has successfully produced different sounds in a single text-to-speech (TTS) model. Due to the lack of distinction between the synthesized voice and the actual human voice produced by such models, the models may be used for malicious purposes, such as synthetic hate statements.

Some companies have used watermarking techniques to verify whether the synthesized audio was generated by a special TTS model to prevent malicious sound cloning and to enforce their copyrights. However, under existing solutions, watermarks are typically added as part of the post-processing of the synthesized audio sample, which can easily be bypassed or forged. Furthermore, the watermark typically represents additional signal/noise of the synthesized audio sample, which makes the watermark user unfriendly.

Disclosure of Invention

In a first aspect, there is provided a computer-implemented method of training a text-to-speech (TTS) framework, the method comprising:

adjusting neuron values of the TTS neural network model to prepare one or more spaces in a synthesized audio segment generated by the TTS framework for adding the watermark; and

adjusting neuron values of the watermark neural network model to add the watermark to one or more prepared spaces.

In a second aspect, there is provided a computer-implemented method of verifying a watermarked audio segment, comprising:

obtaining an actual watermark embedded in the watermarked audio segment based on a comparison between the watermarked audio segment and the original audio segment; and

In a third aspect, there is provided a non-transitory machine-readable medium having instructions stored therein, which when executed by a processor, cause the processor to perform the method of the first aspect or the method of the second aspect.

In a fourth aspect, there is provided a data processing system comprising:

one or more processors; and

a non-transitory computer readable medium comprising one or more sets of instructions which, when executed by at least one of the one or more processors, cause performance of the steps of the method according to the first aspect or cause performance of the steps of the method according to the second aspect.

According to embodiments of the present disclosure, watermarking may be integrated into the training of the TTS framework, which enables the watermark to be imperceptible within the synthesized/cloned audio segment generated by the TTS framework. The watermark added in this manner is statistically undetectable to prevent authorized removal.

Drawings

Embodiments of the present disclosure are illustrated by way of example and not limitation in the figures of the accompanying drawings in which like references indicate similar elements.

Fig. 1 illustrates an example text-to-speech (TTS) framework in accordance with an embodiment.

FIG. 2 illustrates an example system for training a TTS synthesis component, according to an embodiment.

Fig. 3 illustrates an example neural TTS subassembly, according to an embodiment.

Fig. 4 illustrates an example space in a synthesized audio segment generated by a synthesis component according to an embodiment.

Fig. 5 illustrates a watermark verification component according to an embodiment.

FIG. 6 illustrates an example process of training a TTS synthesis component according to an embodiment.

Fig. 7 illustrates an example process of verifying a synthesized audio segment according to an embodiment.

FIG. 8 shows an example of a data processing system according to one embodiment.

Detailed Description

Various embodiments and aspects of the disclosure will be described with reference to details discussed below, and the accompanying drawings will illustrate the various embodiments. The following description and drawings are illustrative of the disclosure and are not to be construed as limiting the disclosure. Numerous specific details are described to provide a thorough understanding of various embodiments of the present disclosure. However, in certain instances, well-known or conventional details are not described in order to provide a concise discussion of embodiments of the present disclosure.

Reference in the specification to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the disclosure. The appearances of the phrase "in one embodiment" in various places in the specification are not necessarily all referring to the same embodiment.

According to various embodiments, the end-to-end TTS framework may integrate watermarking into the training of the TTS framework, which enables the watermark to be imperceptible within the synthesized/cloned audio segment generated by the TTS framework. The watermark added in this manner is statistically undetectable to prevent authorized removal.

According to an exemplary method of training a TTS framework, a TTS neural network model and a watermark neural network model in the TTS framework are trained together in an end-to-end manner. During training, neuron values of a TTS neural network model are adjusted based on a set of training data to prepare one or more spaces in a synthesized audio segment to be generated by the TTS framework for watermarking. In response to neuron value adjustments in the TTS neural network model, neuron values of the watermark neural network model are adjusted accordingly to add the watermark to the one or more prepared spaces.

In one embodiment, the watermark neural network model is a reversible neural network that provides a one-to-one mapping between input audio segments and watermarked audio segments. In one embodiment, the neuron values in each of the TTS neural network model and the watermark neural network model comprise weights, biases, and activation functions. Neuron values of a TTS neural network are adjusted during training of the TTS framework such that watermarks added to one or more spaces are inaudible in a synthesized audio segment generated by the TTS framework. Watermarking is performed by the multi-layer neurons associated with weights, biases and activation functions in the watermarking neural network model.

In one embodiment, the TTS framework may generate a synthesized audio segment that includes one or more speech phrases that overlap with the speech phrase representing the watermark such that the one or more speech phrases override the watermark speech phrase. During training of the TTS framework, one or more physical properties associated with the speech phrase may be modified to override the watermarked speech phrase.

According to another embodiment, a method of verifying a watermarked audio segment may comprise the operations of: receiving the watermarked audio segment and the proprietary information; and obtaining the original audio segment from the watermarked audio segment using a neural network model based on the proprietary information, the neural network model being part of a synthesis component used to generate the watermarked audio segment. The method further comprises an operation of obtaining an actual watermark embedded in the watermarked audio segment based on a comparison between the watermarked audio segment and the original audio segment. By comparing the actual watermark with the predetermined watermark used to train the synthesis component, the method can determine whether the watermarked audio segment was generated by the synthesis component.

Fig. 1 illustrates an example text-to-Speech (TTs) framework in accordance with an embodiment. As shown in FIG. 1, TTS framework 103 may be provided in cloud environment 101 to end users who may access speech synthesis functions via a set of Application Programming Interfaces (APIs).

A synthesis component 115 in the cloud environment 101 can be invoked via the API to generate synthesized speech from the text, the synthesized speech having one or more predetermined watermarks embedded in the synthesis component 115 during training of the component. The synthesis component 115 may include a neural TTS subcomponent 117 and a watermarking subcomponent 119, each of which may be a trained neural network model.

In one embodiment, the neural TTS subcomponent 117 may be any end-to-end neural network model for speech synthesis, and the watermarking subcomponent 119 may be an invertible neural network that provides a one-to-one mapping between input audio segments and watermarked audio output.

The watermarking sub-assembly 119 is trained to add a watermark to the synthesized audio segment. However, the watermarking sub-component 119 adds a watermark during training of the synthesized component 115, rather than as part of post-processing of the synthesized audio segment; that is, the watermark is part of the optimization process during training of TTS framework 103.

With the above features, the watermarking process can be integrated into a speech synthesis process, which enables the watermark to be imperceptible within the synthesized/cloned audio segment. Watermarks added in this manner are statistically undetectable to prevent authorized removal and are robust to audio operations and individual processing operations, such as noise, compression, over-the-air playback, and so forth. As an illustrative example, the watermark in such a synthesized audio segment cannot be removed by playing the audio segment in the air and recording it-the recorded audio segment will still have the watermark.

Furthermore, the use of the reversible neural network model 121 may make it easy to extract the watermark for verifying whether the watermarked audio segment was generated by the TTS framework 103, so that the copyright owner may be verified.

FIG. 2 illustrates an example system for training a TTS synthesis component, according to an embodiment. As depicted in fig. 1, each of the neural TTS subcomponent 117 and the watermarking subcomponent 119 may be a neural network model. Neural network models typically include a collection of connected neurons. The neurons may be fully connected, where each neuron in one layer is connected to each neuron in the next layer with parameters (e.g., weights and biases).

During training of the neural network model, gradient descent (i.e., back propagation) may be used to determine a set of parameters that minimize the difference between the expected value and the actual output of the neural network model. The gradient descent includes the steps of calculating a gradient of the loss/error function, and updating the existing parameter in response to the gradient. The loop may be repeated until a minimum value of the loss function is reached.

Referring back to fig. 2, instead of each of the neural TTS sub-component 117 and the watermarking sub-component in the synthesis component 115 being trained independently, the entire synthesis component 115 is trained end-to-end as a single unit.

As shown in fig. 2, there may be continued interaction between the two subcomponents (neural TTS subcomponent 117 and watermarking subcomponent 119) during training of the synthesis component 115. Each subcomponent may have its own penalty function. The neural TTS sub-component 115 may have losses from the decoder and vocoder used to synthesize high fidelity sound. The watermark component 119, which is a reversible neural network, may have a perceptual loss for penalizing deviations from the synthesized high fidelity sound.

In one embodiment, the interaction between the two subcomponents 117 and 119 may represent a synergy between the two subcomponents during training, wherein errors in one subcomponent are corrected by the other subcomponent.

During training of the synthesis component, the input data set 203 and the proprietary information 204 are provided as input to the synthesis component 115. The input data set 203 may comprise a plurality of samples, each sample representing a text/audio pair. The proprietary information 204 may include any information related to the watermark to be added to the synthesized audio segment that will be generated by the synthesis component 115 after it is trained.

Each input sample may be provided as an input to a neural TTS subcomponent 117, which neural TTS subcomponent 117 includes initial neuron values in its layers. Examples of neural values may include weight values, deviations, and associated activation functions. As each input sample passes through the synthesis component 117, the initial neuron values may be updated accordingly.

In one embodiment, the output of the neural TTS subcomponent 117 may be a set of neuron outputs 205, which may be fed into a watermarking subcomponent 119. Neuron values in each layer of the watermarking subassembly 119 may also be updated in response to updated neuron values received from the neural TTS subassembly 117.

Based on the loss function computation results from a collection of input data, the gradient values 206 are propagated back through the starting layer of the composition component 115. Based on the gradient values calculated for each layer, the weights for each layer from the synthesis component 115 are updated accordingly. The above process may be repeated until the loss of the entire synthesis component 115 converges.

From the perspective of the neural network architecture, the watermark is represented by multiple layers of neurons associated with weight parameters and activation functions. Such a representation may be obtained through various transformations. Different transformations may have different security levels. Examples of different transformations may include plain text tokens with weak protections; a hash token also with weak protection; symmetric or asymmetric cryptographic tokens, which is a more secure way to protect watermarks from forgery; and a signature token, which is a more secure way of protecting the watermark from forgery than a symmetric or asymmetric encryption mark.

Where the synthesis component 115 is trained, the input text can be passed through the trained model in a forward-pass manner. The trained model 115 may generate audio segments that include watermarks embedded during a training phase of the synthesis component 115. The watermark is inaudible, imperceptible, and cannot be removed without using a verification component that implements the same reversible neural network model 121 in the watermark subcomponent 119.

Fig. 3 illustrates an example neural TTS subassembly, according to an embodiment. In one embodiment, the example neural TTS subcomponent 117 may include a plurality of networks, such as an encoder network 305, a decoder network 309, an attention network 307, and a vocoder network 311. The neural TTS sub-component 117 may learn the alignment between the input text 301 and its intermediate representation (e.g., mel spectrum) 315 through the attention network 307.

The encoder network 305 encodes the character embedding into a hidden feature representation. Note that the network 307 may consume the output of the encoder network 305 to produce a fixed length context vector for each decoder output. The decoder network 309 may be an autoregressive recurrent neural network and may consume the output from the attention network 307 and represent a sequence of predicted spectrograms from hidden features. The vocoder 311 is used to analyze and synthesize a human voice signal from a spectrogram, and may be a deep neural network of time domain waveforms.

As an illustration of the synthesis process, the input text 301 may be converted by the example neural TTS subcomponent 117 into character embeddings, which are digital representations of words. The character embedding may then be fed into an encoder-attention-decoder architecture, which may constitute a cyclic sequence-to-sequence feature prediction network. The encoder-attention-decoder architecture may predict a sequence of spectrograms and transform or map characters to the spectrograms in an embedding manner. The spectrogram is then fed to the vocoder 311, and the vocoder 311 creates a time-domain waveform (i.e., speech) as an output audio segment 313.

Fig. 4 illustrates an example space in a synthesized audio segment generated by a synthesis component according to an embodiment. As shown in FIG. 4, once the synthesis component 115 has been trained, it may generate synthesized audio segments having predetermined tags that have been embedded into the trained synthesis component during the training phase. The watermark is inaudible and imperceptible and cannot be removed without authorization.

In one embodiment, the watermark in the synthesized audio segment generated by the synthesis component 115 is inaudible because it is added to the space where the watermark is covered by the speech phrase. The space is identified and prepared during the training phase by intelligently adjusting the appropriate neuron values of one or more layers of the neural TTS subassembly 117 and adjusting the appropriate neuron values of one or more layers of the watermarking subassembly 119.

As shown in FIG. 4, the watermark 401 may be added to the space occupied by the speech phrase A403, and to another space occupied by the speech phrase B407. Each space is selected based on one or more physical properties of the space, such as frequency band, loudness, or pitch of those spaces, so that the watermark 401, when added to those spaces, will be inaudible to normal human ears.

In one embodiment, the voice phrase (e.g., voice phrase B407) may be intentionally read at a slower speed in the audio segment so that the voice phrase may overlap the watermark so that a louder voice phrase may overlay the watermark 401.

Fig. 5 illustrates a watermark verification component according to an embodiment. As discussed above, the watermarking sub-assembly 119 includes a reversible neural network model that ensures a one-to-one mapping between the input audio segment and the watermarked audio segment. This feature may be used to verify whether the watermarked audio segment was generated from the synthesis component 115.

In the example verification process shown in fig. 5, the input data includes a watermarked audio file 515 and additional proprietary information 513. The additional proprietary information 513 can be any information that is used by a user of the API exposed by the composition component 115 to generate a watermark in the watermarked audio file 515. This information is generally not disclosed to the public and will be used for watermark extraction. For example, such information may include some private key in which the watermark is embedded.

Watermark verification component 501 may include the same reversible neural network model 121 as in watermark subcomponent 119. In response to receiving the watermarked audio 515, the watermark verification component 501 may run a reversible neural network to extract the watermark from the watermarked audio 515 to obtain the original audio 517 without the watermark. The watermark extraction may be based on additional proprietary information 513. The watermark extraction process corresponds to different security levels defined in the watermark sub-assembly 119 in the composition component 115.

The watermark verification component 501 may calculate the difference between the original audio file 517 and the input watermarked audio 515 to obtain the actual watermark embedded in the watermarked audio 515 for verification. In one embodiment, the actual watermark may be compared to the watermark embedded in the composition component 115 during the training phase to determine whether the watermarked audio 515 was generated by the trained composition component 115.

FIG. 6 illustrates an example process 600 for training a TTS synthesis component, according to an embodiment. Process 600 may be performed by processing logic that may comprise software, hardware, or a combination thereof. For example, processing logic may include a composition component 115 as described in fig. 1 and 2.

Referring back to fig. 6, in operation 601, the TTS framework receives a set of training data for training the TTS framework to generate a synthesized audio segment with a watermark, and the TTS framework includes a TTS neural network model and a watermark neural network model. In operation 602, neuron values of a TTS neural network model may be adjusted to prepare one or more spaces in a synthesized audio segment to be generated by a TTS framework for watermarking. In operation 603, neuron values of the watermark neural network model may be adjusted to add the watermark to one or more prepared spaces.

Fig. 7 illustrates an example process 700 of verifying a synthesized audio segment according to an embodiment. Process 700 may be performed by processing logic that may comprise software, hardware, or a combination thereof. For example, processing logic may be performed by watermark verification component 501 depicted in fig. 5.

Referring back to fig. 7, in operation 701, the watermarked audio segment and the proprietary information are received at a watermark verification component. In operation 702, the watermark verification component obtains the original audio segment from the watermarked audio segment based on the proprietary information using a neural network model based on the proprietary information, the neural network model being part of a synthesis component used to generate the watermarked audio segment. In operation 703, the watermark verification component obtains the actual watermark embedded in the watermarked audio segment based on a comparison between the watermarked audio segment and the original audio segment. In operation 704, the watermark verification component determines whether the watermarked audio segment was generated by the synthesis component by comparing the actual watermark to a predetermined watermark used to train the synthesis component.

FIG. 8 is a block diagram illustrating an example of a data processing system that may be used with one embodiment of the invention. For example, system 1500 may represent any of the data processing systems described above, e.g., client devices or servers described above, e.g., a cloud server or platform hosting a TTS framework as described above, to perform any of the processes or methods described above.

It should also be noted that system 1500 is intended to illustrate a high-level view of many components of a computer system. However, it is to be understood that additional components may be present in certain embodiments, and further, that a different arrangement of the illustrated components may occur in other embodiments. Further, while only a single machine or system is illustrated, the term "machine" or "system" shall also be taken to include any collection of machines or systems that individually or jointly execute a set (or multiple sets) of instructions to perform any one or more of the methodologies discussed herein.

In one embodiment, system 1500 includes a processor 1501, memory 1503, and devices 1505 and 1508 connected via a bus or interconnect 1510. Processor 1501 may represent a single processor or multiple processors including a single processor core or multiple processor cores therein. Processor 1501 may represent one or more general-purpose processors, such as a microprocessor, Central Processing Unit (CPU), or the like. More particularly, processor 1501 may be a Complex Instruction Set Computing (CISC) microprocessor, Reduced Instruction Set Computing (RISC) microprocessor, Very Long Instruction Word (VLIW) microprocessor, or a processor implementing other instruction sets, or processors implementing a combination of instruction sets. Processor 1501 may also be one or more special-purpose processors, such as an Application Specific Integrated Circuit (ASIC), a cellular or baseband processor, a Field Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), a network processor, a graphics processor, a communications processor, a cryptographic processor, a coprocessor, an embedded processor, or any other type of logic capable of processing instructions.

Processor 1501 may be a low power multi-core processor socket, such as an ultra-low voltage processor, which may act as a main processing unit and central hub for communicating with various components of the system. Such a processor may be implemented as a system on a chip (SoC). The processor 1501 is configured to execute instructions for performing the operations and steps discussed herein. The system 1500 may further include a graphics interface in communication with the optional graphics subsystem 1504, which may include a display controller, a graphics processor, and/or a display device.

Processor 1501 may be in communication with memory 1503, which in one embodiment may be implemented via a plurality of memory devices to provide a fixed amount of system memory. The memory 1503 may include one or more volatile storage (or memory) devices such as Random Access Memory (RAM), dynamic RAM (dram), synchronous dram (sdram), static RAM (sram), or other types of storage devices. Memory 1503 may store information including sequences of instructions that are executed by processor 1501, or any other device. For example, executable code and/or data for various operating systems, device drivers, firmware (e.g., an input output basic system or BIOS), and/or application programs may be loaded into memory 1503 and executed by processor 1501.

System 1500 may also include IO devices, such as device 1505 and 1508, including network interface device(s) 1505, optional input device(s) 1506, and other optional IO device(s) 1507. The network interface device 1505 may include a wireless transceiver and/or a Network Interface Card (NIC). The wireless transceiver may be a WiFi transceiver, an infrared transceiver, a bluetooth transceiver, a WiMax transceiver, a wireless cellular telephone transceiver, a satellite transceiver (e.g., a Global Positioning System (GPS) transceiver), or other Radio Frequency (RF) transceiver, or a combination thereof. The NIC may be an ethernet card.

The input device 1506 may include a mouse, a touchpad, a touch-sensitive screen (which may be integrated with the display device 1504), a pointer device such as a stylus, and/or a keyboard (e.g., a physical keyboard or a virtual keyboard displayed as part of the touch-sensitive screen). For example, the input device 1506 may include a touch screen controller connected to a touch screen. Touch screens and touch screen controllers can, for example, detect contact and movement or breaks thereof using any of a variety of touch sensitivity technologies, including but not limited to capacitive, resistive, infrared, and surface acoustic wave technologies, as well as other proximity sensor arrays or other elements for determining one or more points of contact with the touch screen.

IO device 1507 may include an audio device. The audio device may include a speaker and/or microphone to facilitate voice-enabled functions, such as voice recognition, voice replication, digital recording, and/or telephony functions. Other IO devices 1507 may also include Universal Serial Bus (USB) ports, parallel ports, serial ports, printers, network interfaces, bus bridges (e.g., PCI-PCI bridges), sensors (e.g., motion sensors such as accelerometers, gyroscopes, magnetometers, light sensors, compasses, proximity sensors, etc.), or combinations thereof.

A mass storage device (not shown) may also be connected to processor 1501 for the purpose of providing persistent storage of information, such as data, applications, one or more operating systems, and the like. In various embodiments, to enable thinner and lighter system designs and improve system responsiveness, the mass storage device may be implemented via a Solid State Device (SSD). However, in other embodiments, the mass storage device may be implemented primarily using a Hard Disk Drive (HDD) with a smaller amount of SSD storage acting as an SSD cache to enable non-volatile storage of context state and other such information during a power down event so that a fast power up may occur upon a restart of system activity. Further, a flash memory device may be connected to processor 1501, e.g., via a Serial Peripheral Interface (SPI). The flash memory device may provide non-volatile storage of system software, including the BIOS and other firmware of the system.

Storage 1508 may include a computer-accessible storage medium 1509 (also referred to as a machine-readable storage medium or a computer-readable medium) on which is stored one or more sets of instructions or software (e.g., modules, units, and/or logic 1528) embodying any one or more of the methodologies or functions described herein. The processing module/unit/logic 1528 may represent any of the above-described components, such as, for example, the watermarking component described above. The processing module/unit/logic 1528 may also reside, completely or at least partially, within the memory 1503 and/or within the processor 1501 during execution thereof by the data processing system 1500, the memory 1503 and the processor 1501 also constituting machine-accessible storage media. The processing module/unit/logic 1528 may also be transmitted or received over a network via the network interface device 1505.

The computer-readable storage medium 1509 may also be used to persistently store some of the software functions described above. While the computer-readable storage medium 1509 is shown in an exemplary embodiment to be a single medium, the term "computer-readable storage medium" should be taken to include a single medium or multiple media (e.g., a centralized or distributed database, and/or associated caches and servers) that store the one or more sets of instructions. The term "computer-readable storage medium" shall also be taken to include any medium that is capable of storing or encoding a set of instructions for execution by the machine and that cause the machine to perform any one or more of the methodologies of the present disclosure. The term "computer-readable storage medium" shall accordingly be taken to include, but not be limited to, solid-state memories, optical and magnetic media, or any other non-transitory machine-readable medium.

The processing module/unit/logic 1528, components and other features described herein may be implemented as discrete hardware components or integrated in the functionality of hardware components such as ASICs, FPGAs, DSPs or similar devices. Additionally, the processing module/unit/logic 1528 may be implemented as firmware or functional circuitry within a hardware device. Further, the processing module/unit/logic 1528 may be implemented in any combination of hardware devices and software components.

Note that some or all of the components shown and described above may be implemented in software, hardware, or a combination thereof. For example, these components may be implemented as software installed and stored in a persistent storage device, which may be loaded and executed by a processor (not shown) in memory to perform the processes or operations described throughout this application. Alternatively, these components may be implemented as executable code programmed or embedded into special-purpose hardware, such as an integrated circuit (e.g., an application specific IC or ASIC), a Digital Signal Processor (DSP) or a Field Programmable Gate Array (FPGA), which is accessible via corresponding drivers and/or an operating system from an application. Further, these components may be implemented as specific hardware logic within a processor or processor core as part of an instruction set accessible to software components via one or more specific instructions.

Some portions of the preceding detailed description have been presented in terms of algorithms and symbolic representations of operations on data bits within a computer memory. These algorithmic descriptions and representations are the means used by those skilled in the data processing arts to most effectively convey the substance of their work to others skilled in the art. An algorithm is here, and generally, conceived to be a self-consistent sequence of operations leading to a desired result. The operations are those requiring physical manipulations of physical quantities.

All of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities. Unless specifically stated otherwise as apparent from the above discussion, it is appreciated that throughout the description, discussions utilizing terms such as those set forth in the appended claims refer to the action and processes of a computer system, or similar electronic computing device, that manipulates and transforms data represented as physical (electronic) quantities within the computer system's registers and memories into other data similarly represented as physical quantities within the computer system memories or registers or other such information storage, transmission or display devices.

Embodiments of the present disclosure also relate to apparatuses for performing the operations herein. Such a computer program is stored in a non-transitory computer readable medium. A machine-readable medium includes any mechanism for storing information in a form readable by a machine (e.g., a computer). For example, a machine-readable (e.g., computer-readable) medium includes a machine (e.g., computer) readable storage medium (e.g., read only memory ("ROM"), random access memory ("RAM"), magnetic disk storage media, optical storage media, flash memory devices).

The processes or methods described in the foregoing figures may be performed by processing logic that comprises hardware (e.g., circuitry, dedicated logic, etc.), software (e.g., embodied on a non-transitory computer readable medium), or a combination of both. Although the processes or methods are described above in terms of some sequential operations, it should be understood that some of the operations described may be performed in a different order. Further, some operations may be performed in parallel rather than sequentially.

Embodiments of the present disclosure are not described with reference to any particular programming language. It will be appreciated that various programming languages may be used to implement the teachings of the embodiments of the disclosure as described herein.

In the foregoing specification, embodiments of the disclosure have been described with reference to specific exemplary embodiments thereof. It will be evident that various modifications may be made thereto without departing from the broader spirit and scope of the disclosure as set forth in the following claims. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense.

19页详细技术资料下载

Text-to-speech framework supporting inaudible watermarks

相关技术

网友询问留言