Text speech synthesis method, system, device, equipment and storage medium

文档序号：193302 发布日期：2021-11-02 浏览：32次中文

阅读说明：本技术 文本的语音合成方法、系统、装置、设备及存储介质 (Text speech synthesis method, system, device, equipment and storage medium ) 是由孙得心于 2021-06-30 设计创作，主要内容包括：本申请提出一种文本的语音合成方法、系统、装置、设备及存储介质,该方法包括：从客户端包括的数据转换接口获得待转换的文本数据；通过预设声学服务模块和预设编解码脚本将文本数据转换为对应的语音数据。本申请在用户终端本地设置预设声学服务模块及预设编解码脚本,或用户终端设置预设编解码脚本,且服务器配置预设声学服务模块。客户端中设置数据转换接口,通过该接口访问语音合成服务。任意客户端中都可设置该接口,任意能安装客户端的设备都可使用语音合成服务,不用购买任何特定设备,也无需下载并安装额外的应用程序,不会增加用户终端上安装的应用程序的数量,节省了用户终端的存储资源和计算资源,降低了用户使用语音合成服务的成本。(The application provides a text speech synthesis method, a system, a device, equipment and a storage medium, wherein the method comprises the following steps: obtaining text data to be converted from a data conversion interface included in a client; and converting the text data into corresponding voice data through a preset acoustic service module and a preset encoding and decoding script. According to the method and the system, the preset acoustic service module and the preset encoding and decoding script are locally arranged at the user terminal, or the preset encoding and decoding script is arranged at the user terminal, and the preset acoustic service module is configured at the server. A data conversion interface is set in the client, and the voice synthesis service is accessed through the interface. The interface can be arranged in any client, any equipment capable of installing the client can use the voice synthesis service, any specific equipment is not required to be purchased, additional application programs are not required to be downloaded and installed, the number of the application programs installed on the user terminal is not increased, the storage resources and the calculation resources of the user terminal are saved, and the cost of the user for using the voice synthesis service is reduced.)

1. A speech synthesis method of text is applied to a user terminal, and comprises the following steps:

obtaining text data to be converted from a data conversion interface included in a client;

the text data is converted into corresponding voice data through a preset acoustic service module and a preset coding and decoding script, the preset acoustic service module is used for converting text codes corresponding to the text data into voice codes, and the preset coding and decoding script is used for converting the voice codes into corresponding voice data.

2. The method of claim 1, wherein the converting the text data into corresponding voice data through a preset acoustic service module and a preset codec script comprises:

establishing full-duplex communication connection with a server, wherein the server comprises the preset acoustic service module;

calling a locally configured preset encoding and decoding script to convert the text data into corresponding text codes;

based on the full-duplex communication connection, sending the text code to the server so that the server converts the text code into a corresponding voice code through the preset acoustic service module;

and receiving the voice codes returned by the server, and converting the voice codes into corresponding voice data through the local preset encoding and decoding script.

3. The method of claim 1, wherein the converting the text data into corresponding voice data through a preset acoustic service module and a preset codec script comprises:

calling a preset encoding and decoding script included in a local plug-in library, and converting the text data into corresponding text codes;

calling a preset acoustic service module included in the local plug-in library, and converting the text code into a corresponding voice code;

and converting the voice codes into corresponding voice data through the preset coding and decoding script.

4. The method of claim 3, wherein said converting the text encoding into corresponding speech encoding comprises:

matching a first audio file corresponding to the text code from a preset voice library;

dividing the first audio file into a plurality of audio frames according to a preset framing rule;

extracting acoustic characteristic information corresponding to each audio frame in parallel;

respectively matching the voice codes corresponding to the audio frames from the preset voice library according to the acoustic characteristic information corresponding to the audio frames;

and splicing the speech codes corresponding to each audio frame into the speech codes corresponding to the text data.

5. The method according to claim 4, wherein said transcoding the speech into corresponding speech data by the preset codec script comprises:

and calling a voice code conversion program of the preset voice library through the preset coding and decoding script, and converting the voice code into corresponding voice data through the voice code conversion program.

6. The method according to any one of claims 1-5, wherein before converting the text data into corresponding voice data through a preset acoustic service module and a preset codec script, the method further comprises:

if the text data is identified to contain the preset forbidden words through the preset encoding and decoding script, displaying prompt information for prompting to input the text data again; and/or the presence of a gas in the gas,

and if the content of the preset file type is identified in the text data through the preset encoding and decoding script, deleting the content of the preset file type from the text data.

7. The method according to any one of claims 1-5, further comprising:

acquiring voice adjusting parameters set by a user from the data conversion interface, wherein the voice adjusting parameters at least comprise one or more of tone parameters, speech speed parameters, tone parameters and language type parameters;

and converting the voice codes corresponding to the text data into corresponding voice data through the preset coding and decoding script according to the voice adjusting parameters.

8. A speech synthesis method of text, applied to a server, includes:

receiving a text code corresponding to text data to be converted and sent by a user terminal, wherein the text code is obtained by converting the text data through a local preset encoding and decoding script of the user terminal;

converting the text code into a corresponding voice code through a preset acoustic service module;

and sending the voice code to the user terminal so that the user terminal converts the voice code into corresponding voice data through the local preset coding and decoding script.

9. The method of claim 8, wherein the converting the text encoding into corresponding speech encoding by a preset acoustic service module comprises:

matching a first audio file corresponding to the text code from a preset voice library through a preset acoustic service module;

dividing the first audio file into a plurality of audio frames according to a preset framing rule;

extracting acoustic characteristic information corresponding to each audio frame in parallel;

respectively matching the voice codes corresponding to the audio frames from the preset voice library according to the acoustic characteristic information corresponding to the audio frames;

and splicing the speech codes corresponding to each audio frame into the speech codes corresponding to the text data.

10. The method according to claim 8 or 9, wherein before receiving the text code corresponding to the text data to be converted sent by the user terminal, the method further comprises:

receiving a connection request of a user terminal, establishing full-duplex communication connection with the user terminal, and performing data interaction with the user terminal based on the full-duplex communication connection.

11. A speech synthesis system for text, characterized in that the system comprises a user terminal and a server; the local plug-in library of the user terminal comprises a preset acoustic service module and a preset coding and decoding script, and/or the user terminal is locally configured with the preset coding and decoding script, and the server comprises the preset acoustic service module;

the user terminal is used for obtaining text data to be converted from a data conversion interface included in the client; converting the text data into corresponding text codes through the local preset encoding and decoding script; converting the text code into a corresponding voice code through the local preset acoustic service module or through the preset acoustic service module in the server; converting the voice codes into corresponding voice data through the local preset coding and decoding script;

the server is used for receiving the text code sent by the user terminal; converting the text code into a corresponding voice code through a preset acoustic service module in the server; and sending the voice code to the user terminal.

12. A speech synthesis apparatus for text, applied to a user terminal, comprising:

the acquisition module is used for acquiring text data to be converted from a data conversion interface included by the client;

the conversion module is used for converting the text data into corresponding voice data through a preset acoustic service module and a preset coding and decoding script, the preset acoustic service module is used for converting text codes corresponding to the text data into voice codes, and the preset coding and decoding script is used for converting the voice codes into corresponding voice data.

13. A speech synthesis apparatus for text, applied to a server, comprising:

the receiving module is used for receiving a text code corresponding to text data to be converted and sent by a user terminal, wherein the text code is obtained by converting the text data through a local preset encoding and decoding script of the user terminal;

the conversion module is used for converting the text code into a corresponding voice code through a preset acoustic service module;

and the sending module is used for sending the voice code to the user terminal so that the user terminal converts the voice code into corresponding voice data through the local preset coding and decoding script.

14. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor executes the computer program to implement the method of any one of claims 1-10.

15. A computer-readable storage medium, on which a computer program is stored, characterized in that the program is executed by a processor to implement the method according to any of claims 1-10.

Technical Field

The present application belongs to the technical field of data processing, and in particular, to a method, a system, an apparatus, a device, and a storage medium for text speech synthesis.

Background

With the development of speech technology, automatic speech synthesis technology has been widely used in various fields of life, and speech synthesis technology can synthesize speech according to text, which greatly facilitates people's life, such as converting text reading materials into audio reading materials.

In the related art, specific speech synthesis software is usually adopted to convert speech and text, and a user needs to download and install the speech synthesis software, so that the number of software installed in a user terminal is increased, and a large amount of storage space and computing resources of the user terminal are occupied. Moreover, voice synthesis software provided by different manufacturers can only be operated based on specific hardware, so that the product is expensive and inconvenient to carry.

Disclosure of Invention

The application provides a text voice synthesis method, a text voice synthesis system, a text voice synthesis device, text voice synthesis equipment and a text voice synthesis storage medium. Any client can be provided with a data conversion interface, any equipment capable of installing the client can use the voice synthesis service, no specific equipment is needed to be purchased, no additional application program is needed to be downloaded and installed, the number of the application programs installed on the user terminal is not increased, the storage resource and the calculation resource of the user terminal are saved, and the cost of the user for using the voice synthesis service is reduced.

An embodiment of a first aspect of the present application provides a method for synthesizing text with speech, including:

obtaining text data to be converted from a data conversion interface included in a client;

In some embodiments of the present application, the converting the text data into corresponding voice data through a preset acoustic service module and a preset codec script includes:

establishing full-duplex communication connection with a server, wherein the server comprises the preset acoustic service module;

calling a locally configured preset encoding and decoding script to convert the text data into corresponding text codes;

and receiving the voice codes returned by the server, and converting the voice codes into corresponding voice data through the local preset encoding and decoding script.

In some embodiments of the present application, the converting the text data into corresponding voice data through a preset acoustic service module and a preset codec script includes:

calling a preset encoding and decoding script included in a local plug-in library, and converting the text data into corresponding text codes;

calling a preset acoustic service module included in the local plug-in library, and converting the text code into a corresponding voice code;

and converting the voice codes into corresponding voice data through the preset coding and decoding script.

In some embodiments of the present application, said converting said text encoding into a corresponding speech encoding comprises:

matching a first audio file corresponding to the text code from a preset voice library;

dividing the first audio file into a plurality of audio frames according to a preset framing rule;

extracting acoustic characteristic information corresponding to each audio frame in parallel;

respectively matching the voice codes corresponding to the audio frames from the preset voice library according to the acoustic characteristic information corresponding to the audio frames;

and splicing the speech codes corresponding to each audio frame into the speech codes corresponding to the text data.

In some embodiments of the present application, the converting the voice codes into corresponding voice data through the preset codec script includes:

In some embodiments of the present application, before converting the text data into corresponding voice data through a preset acoustic service module and a preset codec script, the method further includes:

and if the content of the preset file type is identified in the text data through the preset encoding and decoding script, deleting the content of the preset file type from the text data.

In some embodiments of the present application, the method further comprises:

and converting the voice codes corresponding to the text data into corresponding voice data through the preset coding and decoding script according to the voice adjusting parameters.

An embodiment of a second aspect of the present application provides a text speech synthesis method, applied to a server, including:

converting the text code into a corresponding voice code through a preset acoustic service module;

and sending the voice code to the user terminal so that the user terminal converts the voice code into corresponding voice data through the local preset coding and decoding script.

In some embodiments of the present application, the converting, by the preset acoustic service module, the text coding into a corresponding speech coding includes:

matching a first audio file corresponding to the text code from a preset voice library through a preset acoustic service module; dividing the first audio file into a plurality of audio frames according to a preset framing rule; extracting acoustic characteristic information corresponding to each audio frame in parallel; respectively matching the voice codes corresponding to the audio frames from the preset voice library according to the acoustic characteristic information corresponding to the audio frames; and splicing the speech codes corresponding to each audio frame into the speech codes corresponding to the text data.

In some embodiments of the present application, before receiving a text code corresponding to text data to be converted sent by a user terminal, the method further includes:

The embodiment of the third aspect of the present application provides a speech synthesis system for text, the system includes a user terminal and a server; the local plug-in library of the user terminal comprises a preset acoustic service module and a preset coding and decoding script, and/or the user terminal is locally configured with the preset coding and decoding script, and the server comprises the preset acoustic service module;

An embodiment of a fourth aspect of the present application provides a text speech synthesis apparatus, applied to a user terminal, including:

the acquisition module is used for acquiring text data to be converted from a data conversion interface included by the client;

An embodiment of a fifth aspect of the present application provides a text speech synthesis apparatus, applied to a server, including:

the conversion module is used for converting the text code into a corresponding voice code through a preset acoustic service module;

An embodiment of a sixth aspect of the present application provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor executes the computer program to implement the method of the first aspect or the second aspect.

An embodiment of a seventh aspect of the present application provides a computer-readable storage medium having a computer program stored thereon, the program being executed by a processor to implement the method of the first or second aspect.

The technical scheme provided in the embodiment of the application at least has the following technical effects or advantages:

in the embodiment of the application, the preset acoustic service module and the preset coding and decoding script are locally arranged at the user terminal, and/or the preset coding and decoding script is locally arranged at the user terminal, and the preset acoustic service module is configured in the server. And setting a data conversion interface in the client, and accessing the voice synthesis service provided by the preset acoustic service module and the preset encoding and decoding script through the data conversion interface. Any client can be provided with a data conversion interface, any equipment capable of installing the client can use the voice synthesis service, no specific equipment is needed to be purchased, no additional application program is needed to be downloaded and installed, the number of the application programs installed on the user terminal is not increased, the storage resource and the calculation resource of the user terminal are saved, and the cost of the user for using the voice synthesis service is reduced.

Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to refer to like parts throughout the drawings.

In the drawings:

FIG. 1 is a flow chart illustrating a method for speech synthesis of text according to an embodiment of the present application;

FIG. 2 illustrates a schematic diagram of a text input interface provided by an embodiment of the present application;

FIG. 3 is another flow chart of a method for speech synthesis of text provided by an embodiment of the present application;

FIG. 4 illustrates another schematic diagram of a text input interface provided by an embodiment of the present application;

fig. 5 is a signaling interaction diagram illustrating a speech synthesis method for text according to an embodiment of the present application;

FIG. 6 is a schematic structural diagram of a text speech synthesis system according to an embodiment of the present application;

fig. 7 is a schematic structural diagram illustrating a text speech synthesis apparatus according to an embodiment of the present application;

FIG. 8 is a schematic structural diagram of another text speech synthesis apparatus provided in an embodiment of the present application;

fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present application;

fig. 10 is a schematic diagram of a storage medium provided in an embodiment of the present application.

Detailed Description

Exemplary embodiments of the present application will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present application are shown in the drawings, it should be understood that the present application may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

It is to be noted that, unless otherwise specified, technical or scientific terms used herein shall have the ordinary meaning as understood by those skilled in the art to which this application belongs.

A method, a system, an apparatus, a device, and a storage medium for speech synthesis of text according to embodiments of the present application are described below with reference to the accompanying drawings.

Currently, in the related art, specific speech synthesis software is usually adopted to convert speech into text, and a user needs to download the speech synthesis software and install the speech synthesis software in a user terminal such as a mobile phone or a computer of the user, so that the number of software installed in the user terminal is increased, and a large amount of storage space and computing resources of the user terminal are occupied. Moreover, voice synthesis software provided by different manufacturers can only run based on specific hardware, so that the product is expensive and inconvenient to carry, and the application scene of voice synthesis is limited.

Based on the above problems in the related art, embodiments of the present application provide a text speech synthesis method, in which a preset acoustic service module and a preset codec script are used to convert a text into speech, where the preset codec script may be a node js script, and the preset codec script is configured locally at a user terminal, such as a mobile phone or a computer of a user, and is used to perform a codec operation on data, such as converting text data into text codes or converting speech codes into speech data. The preset codec script may also be configured in the server, and the preset codec script configured in the server is only used for transmitting the encoded data, such as receiving text encoding sent by the user terminal, or sending voice encoding to the user terminal. The preset acoustic service module can be configured in a server or a user terminal.

The method comprises the steps that a user terminal is provided with a preset acoustic service module and a preset coding and decoding script, and/or the user terminal is provided with the preset coding and decoding script used for coding and decoding operations, and after the preset acoustic service module and the preset coding and decoding script used for receiving or sending coded data are arranged in a server, voice synthesis services provided by the preset acoustic service module and the preset coding and decoding script can be accessed through a data conversion interface arranged in any client. The data conversion interface can be arranged in any client side of a browser, instant messaging software, game software, multimedia playing software and the like. The voice synthesis service can be accessed through the original client on the user terminal, an application program does not need to be additionally installed, the number of software installed on the user terminal cannot be increased, the storage space and the computing resource of the user terminal are saved, the cost of converting texts into voice is reduced, a data conversion interface in the client can be used for conversion in any application scene needing voice synthesis, and the method is more convenient, quicker and more efficient.

Referring to fig. 1, the method specifically includes the following steps:

step 101: the user terminal obtains text data to be converted from a data conversion interface included in the client.

The user terminal is provided with at least one client, such as a browser, instant messaging software, game software and the like. Any client installed in the user terminal can be provided with the data conversion interface, and the data conversion interface can be a link or a key for triggering the voice access synthesis function. A user clicks a data conversion interface in a client interface, and a user terminal detects a click event of the data conversion interface and displays a text input interface, which may include a text input box and/or a file upload interface for submitting a text file, as shown in fig. 2. The user can edit the text data to be converted in the text input box, or the user can upload text files in file formats such as word, txt, pdf and the like through the file uploading interface.

And if the user terminal detects that the user inputs information in a text input box in the text input interface, acquiring text data to be converted, which is input by the user, from the text editing assembly.

If the user terminal detects an uploading request triggered by a file uploading interface in the text input interface, a local folder browsing component can be displayed, so that a user can select a text file to be uploaded by browsing a local folder directory, the user terminal obtains the text file selected by the user from the file uploading interface, and the text file is used as text data to be converted. Or, the user can directly drag the text file to be uploaded to the file uploading interface, and the user terminal obtains the text file dragged by the user from the file uploading interface and takes the text file as the text data to be converted.

After the text data to be converted is obtained in any of the above manners, the text data to be converted is converted into voice data by the operation of the following step 102.

Step 102: the user terminal converts the text data into corresponding voice data through a preset acoustic service module and a preset coding and decoding script, the preset acoustic service module is used for converting text codes corresponding to the text data into voice codes, and the preset coding and decoding script is used for converting the voice codes into corresponding voice data.

When a preset acoustic service module and a preset coding and decoding script are configured in a local plug-in library of the user terminal, text data to be converted can be converted into voice data through the local plug-in library. When the user terminal is configured with a preset encoding and decoding script for encoding and decoding operations, and the server is configured with a preset acoustic service module and a preset encoding and decoding script for receiving or sending encoded data, and the user terminal can be connected to a network, text data to be converted can be converted into voice data by matching the user terminal and the server. The specific processes of the two schemes are described in detail below.

In an application scene of performing voice synthesis through a local plug-in library, after a user terminal receives text data submitted by a user, a preset encoding and decoding script in the local plug-in library is called, semantic recognition is performed on word segmentation included in the text data through the preset encoding and decoding script, and whether preset forbidden words are included in the text data is judged. The preset forbidden words can comprise words violating laws or regulations, and a forbidden word bank is arranged and contains a plurality of preset forbidden words. And performing word segmentation processing on the text data by using a preset encoding and decoding script, inquiring whether each word segment included in the text data contains the word segment, and if the word segment included in the word segment library, determining that the text data contains preset forbidden words. And if the forbidden word bank does not contain all the participles included in the text data, determining that the text data does not contain preset forbidden words.

If the text data contains the preset forbidden words, prompt information for prompting to input the text data again is displayed. After seeing the prompt information, the user can re-input the text data to be converted in the text input interface.

For the text data uploaded through the file uploading interface in the text input interface, the text data may include pictures or CSS style files and the like which are not easily converted into voice. Therefore, the embodiment of the application also configures a preset file type in the preset encoding and decoding script in advance, and the preset file type may include one or more of multiple file types such as jpg, png, gif, jpeg, css, and the like. And traversing all contents included in the text data to be converted through a preset encoding and decoding script, determining whether the text data contains the contents of the preset file type, and if so, deleting the contents of the preset file type from the text data. Therefore, the content which is not convenient to convert into the voice is deleted, the computing resource occupied by the content is saved, the occurrence of conversion errors caused by the content is reduced, and the voice synthesis efficiency and accuracy of the text are improved.

The user terminal obtains the text data to be converted through step 101, determines that the text data does not include the preset forbidden word through the preset encoding and decoding script, and deletes the content of the preset file type through the above method if it is determined that the text data includes the content of the preset file type. Thereafter, as shown in FIG. 3, speech synthesis is performed by the following operations of steps 1021 and 1023.

Step 1021: and calling a preset encoding and decoding script included in the local plug-in library by the user terminal, and converting the text data into a corresponding text code.

The user terminal converts the text data to be converted into corresponding digital signals from analog signals through a preset encoding and decoding script, converts each word in the text data into corresponding binary codes, and obtains the text codes corresponding to the text data.

Step 1022: and the user terminal calls a preset acoustic service module included in the local plug-in library to convert the text code into a corresponding voice code.

The preset acoustic service module is pre-configured with a preset voice library, and the preset voice library stores the mapping relation between text codes corresponding to different words and audio. For the text code corresponding to each word in the text data, the user terminal matches the audio corresponding to the text code of each word from a preset voice library through a preset acoustic service module, combines the audio corresponding to all words included in the text data into a first audio file corresponding to the text code of the text data to be converted according to the arrangement sequence of each word in the text data.

After the first audio file corresponding to the text data is obtained in the above manner, since the preset encoding and decoding script cannot directly identify the audio, the first audio file needs to be converted into a corresponding speech code. Specifically, a preset framing rule is pre-configured in the preset acoustic service module, and the first audio file is divided into a plurality of audio frames according to the preset framing rule. The preset framing rule may specify a preset unit duration for dividing the audio frame, that is, the audio frame is divided into one audio frame every preset unit duration according to the duration of the first audio file. The preset unit time length may be 5s or 10s, etc. The embodiment of the application does not limit the specific value of the preset unit time length, and the specific value can be set according to requirements in practical application.

After the first audio file is divided into one or more audio frames in the above manner, division recording information generated in the division process is also recorded, and the division recording information may include the start time and the end time of each audio frame.

After a plurality of audio frames are divided, acoustic characteristic information corresponding to each audio frame is extracted in parallel through a preset acoustic service module. The preset acoustic service module may simultaneously process all audio frames in parallel, or may simultaneously process a preset number of audio frames in parallel, where the preset number may be 4 or 5. The preset acoustic service module processes a plurality of audio frames in parallel through a plurality of processes, and the number of the processes is equal to the number of the audio frames processed in parallel.

Specifically, the preset acoustic service module preloads the digital signal of the audio frame to be processed through a process. After the preloading is finished, windowing processing is carried out on the digital signals of the audio frames so as to reduce the frequency spectrum energy leakage. And performing Fast Fourier Transform (FFT) conversion on the digital signals of the audio frames after the windowing, then performing filtering processing on the digital signals of the audio frames after the conversion, and finally extracting the acoustic characteristic information of the audio frames from the digital signals of the audio frames obtained through the processing by a preset characteristic extraction algorithm. The preset feature extraction algorithm may be a mel-frequency cepstrum coefficient, a linear predictive analysis algorithm, a feature extraction algorithm based on deep learning (such as a principal component analysis algorithm), and the like. The acoustic feature information extracted through the mel-frequency cepstrum coefficient comprises the frequency spectrum feature of an audio frame, and is a frequency domain voice feature parameter which is based on the auditory characteristic of human ears and has better robustness.

For each audio frame needing parallel processing, the acoustic feature information of each audio frame is extracted in parallel through a plurality of processes according to the method, the duration of extracting the acoustic features of the whole first audio file is greatly shortened through parallel extraction of the features, and the conversion efficiency is improved.

The preset voice library in the preset acoustic service module also stores the mapping relation between the voice codes corresponding to different voices and the acoustic characteristic information. And after obtaining the acoustic characteristic information corresponding to each audio frame, respectively matching the voice codes corresponding to each audio frame from a preset voice library according to the acoustic characteristic information corresponding to each audio frame. And then according to the starting time and the ending time of each audio frame included in the divided recording information, splicing the speech codes corresponding to each audio frame into the speech codes corresponding to the text data to be converted.

The speech coding includes a combination of text frames and audio data frames corresponding to the first audio file, wherein the text frames include a first start parameter frame and a last end frame corresponding to the first audio file, and the text frames are in a json format. The audio data frame is an audio data frame between the start parameter frame and the end frame, and the audio data frame is a binary frame.

Step 1023: and the user terminal converts the voice codes into corresponding voice data through a preset coding and decoding script.

And the preset acoustic service module transmits the obtained voice codes to a preset coding and decoding script included in the local plug-in library. The preset encoding and decoding script is used for decoding the received voice coding so as to obtain corresponding voice data.

The preset acoustic service module is provided with a calling interface used by a preset coding and decoding script, the preset coding and decoding script calls a voice coding conversion program of a preset voice library through the calling interface, and the voice coding is converted into corresponding voice data through the voice coding conversion program. The speech coding program is a preset program code for decoding the speech coding into a corresponding speech.

In the embodiment of the present application, when a user synthesizes voice data corresponding to text data by using a voice synthesis function, there may be personalized requirements for the intonation, the speech speed, the tone color, the language, and the like of the synthesized voice data. For example, it may be desirable to have a high pitch or a fast speed or to convert to male, female, child, etc. voices or to Chinese, English, German, dialect, etc. or to cartoon voices, etc. For these personalized requirements, the embodiment of the present application may obtain the voice adjustment parameters set by the user from the data conversion interface in the client, where the voice adjustment parameters at least include one or more of parameters such as a intonation parameter, a speech rate parameter, a tone parameter, and a language type parameter.

And when the preset coding and decoding script converts the voice code into the voice data corresponding to the text data to be converted, generating the corresponding voice data according to the voice adjusting parameter set by the user. The tone, the speech speed, the tone color, the language type, or the like of the generated voice data may be specifically adjusted according to the specific parameter value of the voice adjustment parameter by using the related technology, and the specific process of the adjustment will not be described herein again.

As an example, after clicking the data conversion interface in the client, the user displays a text input interface shown in fig. 4, where the text input interface includes a plurality of options of voice adjustment parameters, such as options of male voice, female voice, english, and chinese in fig. 4, and adjustment bars of intonation and speech speed. The user can set the voice adjusting parameters through a plurality of selection items according to the requirement of the user. Fig. 4 is only an example, and there may be other combinations of options for setting the voice adjustment parameters in practical applications.

The voice data synthesized based on the text data is adjusted according to the voice adjusting parameters, so that the individual requirements of a user on multiple aspects such as tone, speed, tone, language type and the like can be met, the interestingness of voice synthesis based on the text is increased, and the use experience of the user is improved.

And finally generating voice data corresponding to the text data to be converted by the preset encoding and decoding script according to the mode, and transmitting the voice data to the client by the preset encoding and decoding script. The user terminal can obtain a DOM (Document Object Model) structure of a current interface of the client through the script engine, display a playing plug-in at a preset position of the current interface according to the obtained DOM structure, and play the voice data through the playing plug-in. The preset position may be any preset position, such as left, right, upper, lower, etc. positions of the current interface.

The converted voice data is automatically played at the preset position of the current interface through the playing plug-in, and the input text can be played in real time, so that a user can conveniently hear the synthesized voice data, and the real-time performance of voice synthesis is improved.

As another implementation manner, the user terminal may further store the converted voice data as a document in a preset audio format, and store the document in an application such as an album or a folder of the user terminal. The preset audio format may be mp3, mp4, wma, or the like. The converted voice data is stored as a document in a preset audio format, so that the voice data can be played and listened to by a user at any time.

The user terminal can be connected with the server, timely updating of the preset acoustic service module and the preset encoding and decoding script in the local plug-in library is guaranteed, and optimal voice synthesis service is provided for the user through the preset acoustic service module and the preset encoding and decoding script of the latest version.

Several application scenarios of the present application are briefly described below, but the present application is not limited to the following application scenarios in practical applications, and the embodiments of the present application can be applied to any application scenarios requiring speech synthesis.

In a first scenario, text files such as courseware and teaching materials of a teacher can be converted into voice data through the method provided by the embodiment of the application, so that students can conveniently play the voice data at any time and any place for learning.

In a second scenario, the visually impaired people cannot conveniently and directly read the text data, and the text data can be converted into voice data to play the voice data for the visually impaired people to listen to.

And a third scenario, for various books, the books can be converted into audio books by the method provided by the embodiment of the application.

In the embodiment of the application, voice synthesis is carried out by calling the preset acoustic service module and the preset encoding and decoding script in the local plug-in library, network transmission of data is not needed, bandwidth is saved, time occupied by network transmission is avoided, response time of voice synthesis is shortened, and response speed is improved. And as long as the preset acoustic service module and the preset coding and decoding script are locally configured at the user terminal, and a data conversion interface for accessing the voice synthesis service is arranged in any client, any equipment capable of installing the client can use the voice synthesis service, any specific equipment does not need to be purchased, extra application programs do not need to be additionally downloaded and installed, the number of the application programs installed on the user terminal cannot be increased, the storage resources and the calculation resources of the user terminal are saved, and the cost of the user for using the voice synthesis service is reduced.

The following describes a process of providing a speech synthesis service through a preset codec script configured in a user terminal for performing a codec operation and a preset acoustic service module configured in a server. The same parts as those in the implementation process by the local plug-in library in the above embodiment are not repeated in this embodiment, and only the differences between them will be described.

The server is also provided with a preset coding and decoding script, and the preset coding and decoding script is only used for receiving or transmitting coded data. The preset acoustic service module and the preset encoding and decoding script can be deployed on the same server or different servers. A preset codec script, which may be a node.

As shown in fig. 5, after obtaining the text data to be converted through step 101, the user terminal specifically implements a speech synthesis function through the following steps, including:

step 103: the user terminal establishes a full duplex communication connection with the server.

The data conversion interface included in the client is associated with address information of the server, and the server is provided with a preset acoustic service module and a preset coding and decoding script for transmitting and receiving coded data. The address information of the server may include a domain name or an IP address of the server, etc.

The user terminal obtains the address information of the server from a data conversion interface included in the client. And establishing full-duplex communication connection with the server according to the address information. The communication protocol adopted by the full-duplex communication connection can be a websocket protocol.

If the address information includes an IP address of the server, a connection request may be sent to the server according to the IP address. The server responds to the connection request and establishes a full-duplex communication connection between the user terminal and the server.

If the address information does not include the IP address of the server, it includes the domain name of the server. The user terminal sends a domain name resolution request to the domain name server, wherein the domain name resolution request comprises the domain name of the server. The domain name server resolves the domain name of the server to obtain the IP address of the server, and feeds the IP address back to the user terminal. The user terminal obtains the IP address and establishes full duplex communication connection with the server according to the IP address in the mode.

Step 104: and the user terminal calls a locally configured preset encoding and decoding script to convert the text data into corresponding text codes.

And the user terminal carries out semantic recognition on the participles included in the text data through a local preset encoding and decoding script and judges whether the text data contains preset forbidden words or not. The manner of determining whether the default forbidden word is included and the disposal manner after determining that the default forbidden word is included are the same as the corresponding operations in step 102 in the above embodiments, and are not described herein again.

For the content of the preset file type, such as the picture or the CSS style file, which may be included in the text data, the user terminal deletes the content of the preset file type included in the text data through the local preset encoding and decoding script, and a specific operation process is the same as the operation of deleting the content of the preset file type in step 102 in the foregoing embodiment, and is not described herein again.

Step 105: and the user terminal sends the text code to the server based on the full-duplex communication connection with the server.

The user terminal sends the text code corresponding to the text data to be converted to the server based on the full duplex communication connection with the server, so that the server converts the text code to the corresponding speech code by the operation of the following step 106.

Step 106: the server receives the text code sent by the user terminal, and converts the text code into a corresponding voice code through a preset acoustic service module.

The server is configured with a preset codec script for receiving or transmitting encoded data, and the preset codec script may be a node. And based on the full-duplex communication connection with the user terminal, the server receives the text codes sent by the user terminal through a preset coding and decoding script configured by the server.

And then the server converts the text code into corresponding voice data through a preset acoustic service module. The specific conversion process is the same as the conversion process executed by the preset acoustic service module local to the user terminal in step 1022 in the foregoing embodiment, and is not described herein again.

Step 107: and the server sends the obtained speech codes to the user terminal.

The server converts the text code into a corresponding speech code and sends the speech code to the user terminal based on the full duplex communication connection with the user terminal.

Step 108: and the user terminal receives the voice code sent by the server and converts the voice code into corresponding voice data through a local preset coding and decoding script.

The operation of converting the voice code into corresponding voice data and the subsequent operation of playing or storing the voice data by the user terminal through the local preset encoding and decoding script are the same as the operation of step 1023 in the above embodiment, and are not described herein again.

In the embodiment of the application, the user terminal sends the text code corresponding to the text data to be converted to the server, and the server sends the converted voice code to the user terminal. Before the data transmission, the data to be transmitted can be encrypted through a preset encryption algorithm, and then the encrypted ciphertext data is transmitted, so that the data security in the data transmission process is improved, and the privacy security of a user is ensured. The preset encryption Algorithm may include a hash Algorithm, MD5(Message-Digest Algorithm), and the like.

For example, before the user terminal sends the text code corresponding to the text data to be converted to the server, the signature corresponding to the text code is calculated through the MD5 algorithm, the signature is inserted into the request header of the http request, and then the http request is sent to the server. And after receiving the request, the server acquires the signature from the request header and decrypts the signature to obtain the text code.

In the embodiment of the application, the user terminal converts the text data to be converted into the text code through the preset encoding and decoding script and sends the text code to the server. And the server converts the text code into a voice code through a preset acoustic service module and returns the voice code to the user terminal. And the user terminal converts the voice codes into final voice data through a preset coding and decoding script. The text voice synthesis service can be accessed through a data conversion interface arranged in any client, additional application programs do not need to be downloaded and installed, the number of the application programs installed on the user terminal cannot be increased, storage resources and calculation resources of the user terminal are saved, and the cost of the user for using the voice synthesis service is reduced.

The embodiment of the present application further provides a text speech synthesis system, which is configured to execute the text speech synthesis method provided in any of the above embodiments. As shown in fig. 6, the system includes a user terminal and a server; the local plug-in library of the user terminal comprises a preset acoustic service module and a preset coding and decoding script, and/or the user terminal is locally configured with the preset coding and decoding script and the server comprises the preset acoustic service module;

the user terminal is used for obtaining text data to be converted from a data conversion interface included by the client; converting the text data into corresponding text codes through a local preset encoding and decoding script; converting the text code into a corresponding voice code through a local preset acoustic service module or a preset acoustic service module in the server; converting voice codes into corresponding voice data through a local preset coding and decoding script;

the server is used for receiving the text code sent by the user terminal; converting the text code into a corresponding voice code through a preset acoustic service module; and sending the speech code to the user terminal.

The speech synthesis system of the text provided by the above embodiment of the present application and the speech synthesis method of the text provided by the embodiment of the present application have the same beneficial effects as the method adopted, run or implemented by the application program stored in the speech synthesis system of the text.

The embodiment of the present application further provides a text speech synthesis apparatus, where the apparatus is configured to execute operations executed by a user terminal in the text speech synthesis method provided in any of the above embodiments. Referring to fig. 7, the apparatus includes:

an obtaining module 201, configured to obtain text data to be converted from a data conversion interface included in a client;

the conversion module 202 is configured to convert the text data into corresponding voice data through a preset acoustic service module and a preset encoding and decoding script, where the preset acoustic service module is configured to convert text codes corresponding to the text data into voice codes, and the preset encoding and decoding script is configured to convert the voice codes into corresponding voice data.

The conversion module 202 is used for establishing full-duplex communication connection with a server, and the server comprises a preset acoustic service module; calling a locally configured preset encoding and decoding script to convert the text data into corresponding text codes; based on full-duplex communication connection, sending text codes to a server so that the server converts the text codes into corresponding voice codes through a preset acoustic service module; and receiving the voice codes returned by the server, and converting the voice codes into corresponding voice data through a local preset coding and decoding script.

The conversion module 202 is used for calling a preset encoding and decoding script included in the local plug-in library and converting text data into corresponding text codes; calling a preset acoustic service module included in a local plug-in library, and converting text codes into corresponding voice codes; and converting the voice codes into corresponding voice data through a preset coding and decoding script.

The conversion module 202 is configured to match a first audio file corresponding to the text code from a preset voice library; dividing the first audio file into a plurality of audio frames according to a preset framing rule; extracting acoustic characteristic information corresponding to each audio frame in parallel; respectively matching the voice codes corresponding to the audio frames from a preset voice library according to the acoustic characteristic information corresponding to the audio frames; and splicing the speech codes corresponding to each audio frame into the speech codes corresponding to the text data.

The conversion module 202 is configured to call a voice code conversion program of a preset voice library through a preset encoding and decoding script, and convert a voice code into corresponding voice data through the voice code conversion program.

The device also includes: the prompting module is used for displaying prompting information for prompting to input the text data again if the text data is identified to contain the preset forbidden word through the preset coding and decoding script; and/or the presence of a gas in the gas,

and the deleting module is used for deleting the content of the preset file type from the text data if the content of the preset file type is identified in the text data through the preset coding and decoding script.

The device also includes: the adjusting module is used for acquiring voice adjusting parameters set by a user from the data conversion interface, and the voice adjusting parameters at least comprise one or more of tone parameters, speech speed parameters, tone parameters and language type parameters; and converting the voice codes corresponding to the text data into corresponding voice data through a preset coding and decoding script according to the voice adjusting parameters.

The text speech synthesis device provided by the above embodiment of the present application and the text speech synthesis method provided by the embodiment of the present application have the same inventive concept and have the same beneficial effects as the method adopted, run or implemented by the application program stored in the text speech synthesis device.

The embodiment of the present application further provides a text speech synthesis apparatus, where the apparatus is configured to execute operations executed by a server in the text speech synthesis method provided in any of the above embodiments. Referring to fig. 8, the apparatus includes:

the receiving module 301 is configured to receive a text code corresponding to text data to be converted, where the text code is obtained by converting the text data through a local preset encoding and decoding script of the user terminal;

a conversion module 302, configured to convert the text code into a corresponding speech code through a preset acoustic service module;

the sending module 303 is configured to send the speech code to the user terminal, so that the user terminal converts the speech code into corresponding speech data through a local preset codec script.

The conversion module 302 is configured to match a first audio file corresponding to the text code from a preset speech library through a preset acoustic service module, and divide the first audio file into a plurality of audio frames according to a preset frame division rule; extracting acoustic characteristic information corresponding to each audio frame in parallel; respectively matching the voice codes corresponding to the audio frames from a preset voice library according to the acoustic characteristic information corresponding to the audio frames; and splicing the speech codes corresponding to each audio frame into the speech codes corresponding to the text data.

The device also includes: and the communication connection establishing module is used for receiving a connection request of the user terminal, establishing full-duplex communication connection with the user terminal and performing data interaction with the user terminal based on the full-duplex communication connection.

The embodiment of the application also provides electronic equipment for executing the text speech synthesis method. Please refer to fig. 9, which illustrates a schematic diagram of an electronic device according to some embodiments of the present application. As shown in fig. 9, the electronic apparatus 9 includes: the system comprises a processor 900, a memory 901, a bus 902 and a communication interface 903, wherein the processor 900, the communication interface 903 and the memory 901 are connected through the bus 902; the memory 901 stores a computer program that can be executed on the processor 900, and the processor 900 executes the computer program to execute the method for synthesizing a text with speech provided in any of the foregoing embodiments of the present application.

The Memory 901 may include a high-speed Random Access Memory (RAM) and may further include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. The communication connection between the network element of the apparatus and at least one other network element is realized through at least one communication interface 903 (which may be wired or wireless), and the internet, a wide area network, a local network, a metropolitan area network, and the like can be used.

Bus 902 can be an ISA bus, PCI bus, EISA bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. The memory 901 is used for storing a program, and the processor 900 executes the program after receiving an execution instruction, and the method for synthesizing a text speech disclosed in any of the foregoing embodiments of the present application may be applied to the processor 900, or implemented by the processor 900.

The processor 900 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 900. The Processor 900 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 901, and the processor 900 reads the information in the memory 901, and completes the steps of the above method in combination with the hardware thereof.

The electronic device provided by the embodiment of the application and the method for synthesizing the text by the speech provided by the embodiment of the application have the same inventive concept and the same beneficial effects as the method adopted, operated or realized by the electronic device.

Referring to fig. 10, the computer readable storage medium is an optical disc 30, on which a computer program (i.e., a program product) is stored, and when the computer program is executed by a processor, the computer program performs the method for synthesizing a text speech according to any of the foregoing embodiments.

It should be noted that examples of the computer-readable storage medium may also include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory, or other optical and magnetic storage media, which are not described in detail herein.

The computer-readable storage medium provided by the above-mentioned embodiment of the present application and the speech synthesis method of the text provided by the embodiment of the present application have the same beneficial effects as the method adopted, run or implemented by the application program stored in the computer-readable storage medium.

It should be noted that:

in the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the application may be practiced without these specific details. In some instances, well-known structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the application, various features of the application are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the application and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted to reflect the following schematic: this application is intended to cover such departures from the present disclosure as come within known or customary practice in the art to which this invention pertains. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this application.

Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the application and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.

The above description is only for the preferred embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

22页详细技术资料下载

Text speech synthesis method, system, device, equipment and storage medium

相关技术

网友询问留言