Duration aware network for text-to-speech analysis

文档序号：1895152 发布日期：2021-11-26 浏览：24次中文

阅读说明：本技术 用于文本到语音转换分析的持续时间知悉网络 (Duration aware network for text-to-speech analysis ) 是由俞承柱卢恒俞栋于 2020-03-05 设计创作，主要内容包括：一种方法和装置,包括：接收包括文本分量的序列的文本输入。使用持续时间模型来确定文本分量的相应持续时间。基于文本分量的序列来生成第一语谱集。基于第一语谱集和文本分量的序列的相应持续时间来生成第二语谱集。基于第二语谱集来生成语谱图帧。基于语谱图帧来生成音频波形。提供音频波形作为输出。(A method and apparatus, comprising: a text input comprising a sequence of text components is received. A duration model is used to determine respective durations of the text components. A first set of spectra is generated based on a sequence of text components. A second set of spectra is generated based on the first set of spectra and the respective durations of the sequences of text components. A speech spectral picture frame is generated based on the second set of speech spectra. An audio waveform is generated based on the spectrogram frame. An audio waveform is provided as an output.)

1. A method, comprising:

receiving, by a device, a text input comprising a sequence of text components;

determining, by the device and using a duration model, respective durations of the text components;

generating, by the device, a first set of spectra based on the sequence of text components;

generating, by the device, a second set of spectra based on the first set of spectra and respective durations of the sequence of text components;

generating, by the device, a speech spectrogram frame based on the second set of speech spectra;

generating, by the device, an audio waveform based on the speech spectrogram frame; and

providing the audio waveform as an output by the device.

2. The method of claim 1, wherein the text components are phonemes.

3. The method of claim 1, wherein the text component is a character.

4. The method of claim 1, the method further comprising:

copying respective speech spectra in the first set of speech spectra based on respective durations of the text components; and

wherein the generating the second set of spectra comprises generating the second set of spectra based on replicating the first set of spectra.

5. The method of claim 1, wherein the second set of spectra comprises mel-frequency cepstral spectra.

6. The method of claim 1, the method further comprising:

the duration model is trained using a set of predicted frames and training text components.

7. The method of claim 1, the method further comprising:

the duration model is trained using a hidden markov model forced alignment technique.

8. An apparatus, comprising:

at least one memory configured to store program code;

at least one processor configured to read the program code and to operate according to instructions of the program code, the program code comprising:

receiving code configured to cause the at least one processor to receive a text input comprising a sequence of text components;

determining code configured to cause the at least one processor to determine respective durations of the text components using a duration model;

generating code configured to cause the at least one processor to:

generating a first set of spectra based on the sequence of text components;

generating a second set of spectra based on the first set of spectra and respective durations of the sequence of text components;

generating a speech spectral picture frame based on the second set of speech spectra;

generating an audio waveform based on the speech spectrogram frame; and

code is provided configured to cause the at least one processor to provide the audio waveform as an output.

9. The apparatus of claim 8, wherein the text components are phonemes.

10. The apparatus of claim 8, wherein the text component is a character.

11. The apparatus of claim 8, the apparatus further comprising:

copying code configured to cause the at least one processor to copy respective speech spectra in the first set of speech spectra based on respective durations of the text components; and

wherein the generating code is configured to cause the at least one processor to generate a second set of spectra, including generating the second set of spectra based on replicating the first set of spectra.

12. The apparatus of claim 8, wherein the second set of spectra comprises mel-frequency cepstral spectra.

13. The apparatus of claim 8, the apparatus further comprising:

training code configured to cause the at least one processor to train the duration model using a set of predicted frames and training text components.

14. The apparatus of claim 8, the apparatus further comprising:

training code configured to cause the at least one processor to train the duration model using a set of predicted frames and training text components.

15. A non-transitory computer-readable medium storing instructions, the instructions comprising one or more instructions that, when executed by one or more processors of a device, cause the one or more processors to:

receiving a text input comprising a sequence of text components;

determining respective durations of the text components using a duration model;

generating a first set of spectra based on the sequence of text components;

generating a second set of spectra based on the first set of spectra and respective durations of the sequence of text components;

generating a speech spectral picture frame based on the second set of speech spectra;

generating an audio waveform based on the speech spectrogram frame; and

providing the audio waveform as an output.

16. The non-transitory computer-readable medium of claim 15, wherein the text components are phonemes.

17. The non-transitory computer-readable medium of claim 15, wherein the text component is a character.

18. The non-transitory computer-readable medium of claim 15, wherein the one or more instructions cause the one or more processors to:

copying respective speech spectra in the first set of speech spectra based on respective durations of the text components; and

wherein the one or more instructions that cause the one or more processors to generate a second set of spectra cause the one or more processors to generate the second set of spectra based on replicating the first set of spectra.

19. The non-transitory computer-readable medium of claim 15, wherein the second set of spectra includes mel-frequency cepstral spectra.

20. The non-transitory computer-readable medium of claim 15, wherein the second set of speech spectra includes a different number of speech spectra than the first set of speech spectra.

Background

Recently, tacontron-based end-to-end speech synthesis systems have shown impressive text-to-speech conversion (TTS) results from the perspective of prosody and naturalness of the synthesized speech. However, such systems have significant drawbacks in skipping or repeating certain words in the input text when synthesizing speech. This problem arises from the end-to-end nature of such systems, where an uncontrollable attention mechanism is used for speech generation. The present disclosure addresses these issues by replacing the end-to-end attention mechanism inside the Tacotron system with an attention network that will notify for a duration. The network proposed by the present disclosure achieves comparable or improved synthesis performance and solves problems within a Tacotron system.

Disclosure of Invention

According to some possible implementations, a method includes: receiving, by a device, a text input comprising a sequence of text components; determining, by the device and using the duration model, respective durations of the text components; generating, by the device, a first set of spectra based on the sequence of text components; generating, by the device, a second set of spectra based on the first set of spectra and respective durations of the sequence of text components; generating, by the device, a speech spectral picture frame based on the second set of speech spectra; generating, by a device, an audio waveform based on a spectrogram frame; and providing the audio waveform as an output by the device.

According to some possible implementations, an apparatus includes: at least one memory configured to store program code; at least one processor configured to read program code and to operate according to instructions of the program code, the program code comprising: receiving code configured to cause at least one processor to receive a text input comprising a sequence of text components; determining code configured to cause at least one processor to determine respective durations of the text components using a duration model; generating code configured to cause at least one processor to: generating a first set of spectra based on the sequence of text components; generating a second set of spectra based on the first set of spectra and respective durations of the sequence of text components; generating a speech spectrum frame based on the second speech spectrum set; generating an audio waveform based on the spectrogram frame; and providing code configured to cause the at least one processor to provide the audio waveform as an output.

According to some possible implementations, a non-transitory computer-readable medium stores instructions, the instructions comprising one or more instructions that, when executed by one or more processors of a device, cause the one or more processors to: receiving a text input comprising a sequence of text components; determining respective durations of the text components using a duration model; generating a first set of spectra based on the sequence of text components; generating a second set of spectra based on the first set of spectra and respective durations of the sequence of text components; generating a speech spectrum frame based on the second speech spectrum set; generating an audio waveform based on the spectrogram frame; and providing the audio waveform as an output.

Drawings

FIG. 1 is a diagrammatical illustration of an example implementation described herein;

FIG. 2 is a diagram of an example environment in which systems and/or methods described herein may be implemented;

FIG. 3 is a diagram of example components of one or more of the devices of FIG. 2; and

FIG. 4 is a flow diagram of an example process for generating an audio waveform using an attention network for text-to-speech synthesis that would signal duration.

Detailed Description

TTS systems have a wide variety of applications. However, most of the commercial systems employed are primarily based on parametric systems, which have a large gap compared to the natural speech of humans. Tacotron is a significantly different TTS synthesis system from the conventional TTS system based on parameters and is capable of producing highly natural speech sentences. The whole system can be trained in an end-to-end manner, and the traditional complex linguistic feature extraction part is replaced by a coder-convolution-stack-road network-bidirectional gating-cycle unit (CBHG) module.

The duration models used in conventional parametric systems are replaced with an end-to-end attention mechanism, where the alignment between the input text (or phoneme sequence) and the speech signal is learned from the attention model, rather than alignment based on Hidden Markov Models (HMMs). Another major difference associated with the tacontron system is that it directly predicts the mel/linear spectra that can be used directly by advanced vocoders (e.g., Wavenet and WaveRNN) to synthesize high quality speech.

Tacontron based systems are able to generate more accurate and natural sounding speech. However, the Tacotron system includes instability, such as skipping and/or repeating input text, which is a drawback inherent in synthesizing speech waveforms.

Some implementations herein address the aforementioned input text skipping and repetition problem of a Tacotron-based system while maintaining its excellent synthesis quality. Furthermore, some implementations herein address these instability issues and achieve significantly improved naturalness in synthesized speech.

The instability of Tacotron is mainly caused by its uncontrollable attention mechanism, and there is no guarantee that each input text can be synthesized sequentially without skipping or repeating.

Some implementations herein replace such unstable and uncontrollable attention mechanisms with duration-based attention mechanisms, where input text is guaranteed to be sequentially synthesized without skipping or repeating. The main reason for the need for attention in a Tacotron-based system is the lack of alignment information between the source text and the target spectrogram.

Typically, the length of the input text is much shorter than the length of the generated spectrogram. A single character/phoneme from the input text may generate multiple frames of a spectrogram while requiring this information to model the input/output relationships through any neural network architecture.

Tacontron-based systems solve this problem mainly with an end-to-end mechanism where the generation of the spectrogram relies on known attention to the source input text. However, this attention mechanism is basically unstable because the attention of this attention mechanism is highly uncontrollable. Some implementations herein replace the end-to-end attention mechanism within a Tacotron system with a duration model that predicts how long a single input character and/or phoneme lasts. In other words, the alignment between the output spectrogram and the input text is achieved by copying each input character and/or phoneme for a predetermined duration. The ground truth duration of input text learned from our system is achieved through HMM-based forced alignment. With the predicted duration, each target frame in the spectrogram can be matched to one character/phoneme in the input text. The entire model architecture is plotted in the following figure.

FIG. 1 is a diagrammatic view of embodiments described herein. As shown in fig. 1, referring to reference numeral 110, a platform (e.g., a server) may receive a text input comprising a sequence of text components. As shown, the text input may include a phrase, such as "this is a cat". The text input may include a sequence of text components shown as characters "DH", "IH", "S", "IH", "Z", "AX", "K", "AE", and "AX".

As further shown in FIG. 1, referring to reference numeral 120, the platform may use a duration model to determine respective durations of the text components. The duration model may include a model that receives an input text component and determines a duration of the text component. As an example, the phrase "this is a cat" may include a total duration of one second in audible output. The respective text components of the phrase may include different durations that collectively comprise the total duration.

By way of example, the word "this" may include a duration of 400 milliseconds, the word "yes" may include a duration of 200 milliseconds, the word "one" may include a duration of 100 milliseconds, and the word "cat" may include a duration of 300 milliseconds. The duration model may determine respective constituent durations of the text components.

As further shown in FIG. 1, referring to reference numeral 130, the platform may generate a first set of spectra based on the sequence of text components. For example, the platform may input the text component into a model that generates an output speech spectrum based on the input text component. As shown, the first set of spectra may include a respective spectrum for each text component (e.g., shown as "1", "2", "3", "4", "5", "6", "7", "8", and "9").

As further shown in fig. 1, referring to reference numeral 140, the platform may generate a second set of spectra based on the first set of spectra and respective durations of the sequence of text components. The platform may generate a second set of speech spectra by replicating the speech spectra based on their respective durations. By way of example, a speech spectrum of "1" may be copied such that the second set of speech spectra includes three speech spectral components corresponding to speech spectrum "1", and so on. The platform may use the output of the duration model to determine a manner of generating the second set of spectra.

As further shown in fig. 1, referring to reference numeral 140, the platform may generate a speech spectrogram frame based on the second speech spectrogram set. The speech spectral picture frame may be formed from respective constituent speech spectral components of the second set of speech spectra. As shown in fig. 1, the speech spectrogram frame can be aligned with the predicted frame. In other words, the speech spectrogram frames generated by the platform can be precisely aligned with the intended audio output of the text input.

The platform may use various techniques to generate audio waveforms based on the spectrogram frames and provide the audio waveforms as output.

In this manner, some implementations herein allow for more accurate audio output generation associated with speech-to-text synthesis by utilizing a duration model that determines respective durations of input text components.

FIG. 2 is a diagram of an example environment 200 in which systems and/or methods described herein may be implemented. As shown in FIG. 2, environment 200 may include user device 210, platform 220, and network 230. The devices of environment 200 may be interconnected by wired connections, wireless connections, or a combination of wired and wireless connections.

User device 210 includes one or more devices capable of receiving, generating, storing, processing, and/or providing information associated with platform 220. For example, the user device 210 may include a computing device (e.g., a desktop computer, a laptop computer, a tablet computer, a handheld computer, a smart speaker, a server, etc.), a mobile phone (e.g., a smartphone, a wireless phone, etc.), a wearable device (e.g., a pair of smart glasses or a smart watch), or the like. In some implementations, the user device 210 may receive information from the platform 220 and/or send information to the platform 220.

The platform 220 includes the ability to generate audio waveforms using attention networks for text-to-speech synthesis that would inform of duration, as described elsewhere herein. In some implementations, the platform 220 may include a cloud server or a group of cloud servers. In some implementations, the platform 220 may be designed as a modular platform such that certain software components may be swapped in and out according to particular needs. Thus, the platform 220 may be easily and/or quickly reconfigured for different uses.

In some implementations, as shown, the platform 220 may be hosted in a cloud computing environment 222. It should be noted that although the implementations described herein describe the platform 220 as being hosted in the cloud computing environment 222, in some implementations, the platform 220 is not cloud-based (i.e., may be implemented outside of the cloud computing environment) or may be partially cloud-based.

Cloud computing environment 222 includes an environment hosting platform 220. The cloud computing environment 222 may provide computing, software, data access, storage, etc. services that do not require an end user (e.g., user device 210) to be aware of the physical location and configuration of the systems and/or devices hosting the platform 220. As shown, the cloud computing environment 222 may include a set of computing resources 224 (collectively referred to as "computing resources 224," with a single computing resource being referred to as "computing resource 224").

Computing resources 224 include one or more personal computers, workstation computers, server devices, or other types of computing and/or communication devices. In some implementations, the computing resources 224 may control the platform 220. Cloud resources may include computing instances running in computing resources 224, storage devices provided in computing resources 224, data transfer devices provided by computing resources 224, and so forth. In some implementations, the computing resources 224 may communicate with other computing resources 224 through wired connections, wireless connections, or a combination of wired and wireless connections.

As further shown in FIG. 2, the computing resources 224 include a set of cloud resources, such as one or more applications ("APP") 224-1, one or more virtual machines ("VM") 224-2, virtualized memory ("VS") 224-3, one or more hypervisors ("HYP") 224-4, and so forth.

The applications 224-1 include one or more software applications that may be provided to or accessed by the user device 210 and/or the sensor device 220. The application 224-1 may eliminate the need to install and run software applications on the user device 210. For example, the application 224-1 may include software associated with the platform 220 and/or any other software capable of being provided through the cloud computing environment 222. In some implementations, one application 224-1 can send/receive information to/from one or more other applications 224-1 through the virtual machine 224-2.

The virtual machine 224-2 comprises a software implementation of a machine (e.g., a computer) running a program, similar to a physical machine. Virtual machine 224-2 may be a system virtual machine or a process virtual machine depending on how well and for which virtual machine 224-2 corresponds to any real machine. The system virtual machine may provide a complete system platform that supports the running of a complete operating system ("OS"). The process virtual machine may run a single program and may support a single process. In some implementations, the virtual machine 224-2 may run on behalf of a user (e.g., the user device 210) and may manage infrastructure of the cloud computing environment 222, such as data management, synchronization, or long-time data transfer.

Virtualized memory 224-3 comprises one or more storage systems and/or one or more devices that use virtualization techniques within a storage system or device of computing resources 224. In some implementations, in the context of a storage system, the types of virtualization can include block virtualization and file virtualization. Block virtualization may refer to the abstraction (or separation) of logical storage from physical storage so that a storage system may be accessed without regard to physical storage or heterogeneous structures. The separation may allow an administrator of the storage system flexibility in how the administrator manages the end-user's storage. File virtualization may eliminate dependencies between data accessed at the file level and the location where the file is physically stored. This may allow for optimization of memory usage, server consolidation, and/or non-interfering file migration performance.

The hypervisor 224-4 may provide hardware virtualization techniques that allow multiple operating systems (e.g., "guest operating systems") to run simultaneously on a host computer, such as the computing resources 224. Hypervisor 224-4 may present a virtual operating platform to the guest operating system and may manage the running of the guest operating system. Multiple instances of each operating system may share virtualized hardware resources.

Network 230 includes one or more wired and/or wireless networks. For example, network 230 may include a cellular network (e.g., a fifth generation (5G) network, a Long Term Evolution (LTE) network, a third generation (3G) network, a Code Division Multiple Access (CDMA) network, etc.), a Public Land Mobile Network (PLMN), a Local Area Network (LAN), a Wide Area Network (WAN), a Metropolitan Area Network (MAN), a telephone network (e.g., the Public Switched Telephone Network (PSTN)), a private network, an ad hoc network, an intranet, the internet, a fiber-based network, etc., and/or a combination of these or other types of networks.

The number and arrangement of devices and networks shown in fig. 2 are provided as examples. In practice, there may be additional devices and/or networks, fewer devices and/or networks, different devices and/or networks, or devices and/or networks arranged differently than those shown in fig. 2. Further, two or more of the devices shown in fig. 2 may be implemented within a single device, or a single device shown in fig. 2 may be implemented as multiple distributed devices. Additionally or alternatively, a set of devices (e.g., one or more devices) of environment 200 may perform one or more functions described as being performed by another set of devices of environment 200.

Fig. 3 is a diagram of example components of a device 300. The device 300 may correspond to the user device 210 and/or the platform 220. As shown in fig. 3, device 300 may include a bus 310, a processor 320, a memory 330, a storage component 340, an input component 350, an output component 360, and a communication interface 370.

Bus 310 includes components that allow communication among the components of device 300. Processor 320 is implemented in hardware, firmware, or a combination of hardware and software. Processor 320 is a Central Processing Unit (CPU), Graphics Processing Unit (GPU), Accelerated Processing Unit (APU), microprocessor, microcontroller, Digital Signal Processor (DSP), Field Programmable Gate Array (FPGA), Application Specific Integrated Circuit (ASIC), or another type of processing component. In some implementations, processor 320 includes one or more processors that can be programmed to perform functions. Memory 330 includes a Random Access Memory (RAM), a Read Only Memory (ROM), and/or another type of dynamic or static storage device (e.g., flash memory, magnetic memory, and/or optical memory) that stores information and/or instructions for use by processor 320.

The storage component 340 stores information and/or software related to the operation and use of the device 300. For example, storage component 340 may include a hard disk (e.g., a magnetic disk, an optical disk, a magneto-optical disk, and/or a solid state disk), an optical disk (CD), a Digital Versatile Disk (DVD), a floppy disk, a tape cartridge, a magnetic tape, and/or another type of non-transitory computer-readable medium, and a corresponding drive.

Input components 350 include components that allow device 300 to receive information, such as through user input (e.g., a touch screen display, keyboard, keypad, mouse, buttons, switches, and/or microphone). Additionally or alternatively, input component 350 may include sensors for sensing information (e.g., Global Positioning System (GPS) components, accelerometers, gyroscopes, and/or actuators). Output components 360 include components that provide output information from device 300 (e.g., a display, a speaker, and/or one or more Light Emitting Diodes (LEDs)).

Communication interface 370 includes transceiver-like components (e.g., a transceiver and/or separate receivers and transmitters) that enable device 300 to communicate with other devices, such as by wired connections, wireless connections, or a combination of wired and wireless connections. Communication interface 370 may allow device 300 to receive information from and/or provide information to another device. For example, communication interface 370 may include an ethernet interface, an optical interface, a coaxial interface, an infrared interface, a Radio Frequency (RF) interface, a Universal Serial Bus (USB) interface, a Wi-Fi interface, a cellular network interface, and/or the like.

Device 300 may perform one or more processes described herein. Device 300 may perform these processes in response to processor 320 executing software instructions stored by a non-transitory computer-readable medium, such as memory 330 and/or storage component 340. A computer-readable medium is defined herein as a non-transitory memory device. The memory device includes memory space within a single physical memory device or memory space distributed across multiple physical memory devices.

The software instructions may be read into memory 330 and/or storage component 340 from another computer-readable medium or read into memory 330 and/or storage component 340 from another device via communication interface 370. When executed, software instructions stored in memory 330 and/or storage component 340 may cause processor 320 to perform one or more processes described herein. Additionally or alternatively, hardwired circuitry may be used in place of or in combination with software instructions to implement one or more processes described herein. Thus, implementations described herein are not limited to any specific combination of hardware circuitry and software.

The number and arrangement of components shown in fig. 3 are provided as examples. In practice, device 300 may include additional components, fewer components, different components, or components arranged differently than those shown in FIG. 3. Additionally or alternatively, a set of components (e.g., one or more components) of device 300 may perform one or more functions described as being performed by another set of components of device 300.

Fig. 4 is a flow diagram of an example process 400 for generating an audio waveform using an attention network for text-to-speech synthesis that would signal duration. In some implementations, one or more of the process blocks of FIG. 4 may be performed by the platform 220. In some implementations, one or more of the process blocks of fig. 4 may be performed by another device or group of devices (e.g., user device 210) separate from or including platform 220.

As shown in fig. 4, process 400 may include receiving, by a device, a text input including a sequence of text components (block 410).

For example, the platform 220 may receive text input to be converted into audio output. The text components may include characters, phonemes, n-grams (n-grams), words, letters, and the like. The sequence of text components may form sentences, phrases, or the like.

As further shown in FIG. 4, process 400 may include determining, by the device and using a duration model, respective durations of the text components (block 420).

The duration model may include a model that receives the entered text component and determines a duration of the entered text component. The platform 220 may train the duration model. For example, the platform 220 may use machine learning techniques to analyze data (e.g., training data, such as historical data, etc.) and create a duration model. Machine learning techniques may include, for example, supervised techniques and/or unsupervised techniques such as artificial networking, bayesian statistics, learning automata, hidden markov modeling, linear classifiers, quadratic classifiers, decision trees, association rule learning, and the like.

The platform 220 may train the duration model by aligning sequences of speech spectrogram frames and text components of known duration. For example, the platform 220 may use HMM-based forced alignment to determine ground truth durations for input text sequences for text components. The platform 220 may train the duration model by using predicted or target speech spectrogram frames of known duration and known input text sequences comprising text components.

The platform 220 may input the text components into a duration model and determine information identifying or associated with respective durations of the text components based on the output of the model. Information identifying or relating to the respective time duration may be used to generate a second set of spectra, as described below.

As further shown in FIG. 4, process 400 may include determining whether a duration model has been used to determine a respective duration for each text component (block 430).

For example, the platform 220 may iteratively or simultaneously determine respective durations of the text components. The platform 220 may determine whether a duration has been determined for each text component of the input text sequence.

As further shown in FIG. 4, if the duration model is not used to determine the respective duration of each text component (block 430-NO), the process 400 may include returning to block 420.

For example, the platform 220 may enter text components for which a duration has not been determined into the duration model until a duration is determined for each text component.

As further shown in fig. 4, if the duration model has been used to determine the respective duration of each text component (block 430 — yes), process 400 may include generating, by the device, a first set of spectra based on the sequence of text components (block 440).

For example, the platform 220 may generate an output speech spectrum of text components corresponding to the sequence of input text components. The platform 220 may utilize the CBHG module to generate an output speech spectrum. The CBHG module may include a bank of one-dimensional (1-D) convolution filters, a set of road networks, a bidirectional Gated Round Unit (GRU), a Recurrent Neural Network (RNN), and/or other components.

In some implementations, the output speech spectrum may be a mel-frequency cepstrum (MFC) speech spectrum. The output speech spectrum may include any type of speech spectrum used to generate a speech spectrum frame.

As further shown in fig. 4, process 400 may include generating, by the device, a second set of spectra based on the first set of spectra and respective durations of the sequence of text components (block 450).

For example, the platform 220 may generate the second set of spectra using the first set of spectra and information identifying or associated with respective durations of the text components.

As an example, platform 220 may replicate individual speech spectra of the first set of speech spectra based on respective durations of base text components corresponding to the speech spectra. In some cases, platform 220 may replicate the speech spectrum based on a replication factor, a time factor, and/or the like. In other words, the output of the duration model may be used to determine factors for copying a particular speech spectrum, generating additional speech spectra, and so forth.

As further shown in fig. 4, process 400 may include generating, by the device, a speech spectrogram frame based on the second speech set (block 460).

For example, platform 220 may generate a speech spectrogram frame based on the second set of speech spectrums. The second set of speech spectra together form a speech spectra frame. As described elsewhere herein, a speech spectrogram frame generated using a duration model may more accurately resemble a target frame or a predicted frame. In this way, some implementations herein improve the accuracy of TTS synthesis, improve the naturalness of the generated speech, improve the prosody of the generated speech, and the like.

As further shown in fig. 4, the process 400 may include generating, by the device, an audio waveform based on the spectrogram frame (block 470), and providing, by the device, the audio waveform as output (block 480).

For example, the platform 220 may generate an audio waveform based on the spectrogram frame and provide the audio waveform for output. By way of example, the platform 220 may provide audio waveforms to an output component (e.g., a speaker, etc.), may provide audio waveforms to another device (e.g., the user device 210), may transmit audio waveforms to a server or another terminal, and so forth.

Although fig. 4 shows example blocks of the process 400, in some implementations, the process 400 may include additional blocks, fewer blocks, different blocks, or blocks arranged differently than those depicted in fig. 4. Additionally or alternatively, two or more blocks of process 400 may be performed in parallel.

The foregoing disclosure provides illustration and description, but is not intended to be exhaustive or to limit the implementations to the precise form disclosed. Modifications and variations are possible in light of the above disclosure or may be acquired from practice of the implementations.

As used herein, the term "component" is intended to be broadly interpreted as hardware, firmware, or a combination of hardware and software.

It is to be understood that the systems and/or methods described herein may be implemented in various forms of hardware, firmware, or combinations of hardware and software. The actual specialized control hardware or software code used to implement these systems and/or methods is not limiting of implementations. Thus, the operation and behavior of the systems and/or methods were described herein without reference to the specific software code-it being understood that software and hardware may be designed to implement the systems and/or methods based on the description herein.

Even if specific combinations of features are recited in the claims and/or disclosed in the specification, these combinations are not intended to limit the disclosure of possible implementations. Indeed, many of these features may be combined in ways not specifically recited in the claims and/or disclosed in the specification. Although each dependent claim listed below may be directly dependent on only one claim, the disclosure of possible implementations includes a combination of each dependent claim with every other claim in the set of claims.

No element, act, or instruction used herein should be construed as critical or essential unless explicitly described as such. In addition, as used herein, the articles "a" and "an" are intended to include one or more items, and may be used interchangeably with "one or more". Further, as used herein, the term "set" is intended to include one or more items (e.g., related items, unrelated items, combinations of related and unrelated items, etc.), and may be used interchangeably with "one or more. Where only one item is intended, the term "a" or similar language is used. Furthermore, as used herein, the terms "having," "containing," or similar terms are intended to be open-ended terms. Further, the phrase "based on" is intended to mean "based, at least in part, on" unless explicitly stated otherwise.

15页详细技术资料下载

Duration aware network for text-to-speech analysis

相关技术

网友询问留言