Audio information synthesis method and device, computer readable medium and electronic equipment

文档序号：617689 发布日期：2021-05-07 浏览：14次中文

阅读说明：本技术 音频信息合成方法、装置、计算机可读介质及电子设备 (Audio information synthesis method and device, computer readable medium and electronic equipment ) 是由林诗伦于 2020-05-13 设计创作，主要内容包括：本申请属于人工智能技术领域,并涉及机器学习技术。具体而言,本申请涉及一种音频信息合成方法、音频信息合成装置、计算机可读介质以及电子设备。该方法包括：获取包括至少两个语种类型的混合语种文本信息；基于至少两个语种类型对混合语种文本信息进行文本编码处理以得到混合语种文本信息的中间语义编码特征；获取对应于目标音色主体的目标音色特征,并基于目标音色特征对中间语义编码特征进行解码处理以得到声学特征；对声学特征进行声学编码处理以得到与混合语种文本信息相对应的音频信息。该方法解决了现有混合语种音频合成技术中存在的因语种差异而出现的音色跳变问题,可稳定输出自然顺畅且音色统一的混合语种音频。(The application belongs to the technical field of artificial intelligence and relates to a machine learning technology. In particular, the present application relates to an audio information synthesis method, an audio information synthesis apparatus, a computer-readable medium, and an electronic device. The method comprises the following steps: acquiring mixed language text information comprising at least two language types; performing text coding processing on the mixed language text information based on at least two language types to obtain intermediate semantic coding features of the mixed language text information; acquiring target tone features corresponding to the target tone main body, and decoding the intermediate semantic coding features based on the target tone features to obtain acoustic features; and performing acoustic coding processing on the acoustic features to obtain audio information corresponding to the mixed language text information. The method solves the problem of tone jump caused by language difference in the existing mixed language audio synthesis technology, and can stably output mixed language audio which is natural, smooth and uniform in tone.)

1. A method for synthesizing audio information, comprising:

acquiring mixed language text information comprising at least two language types;

performing text coding processing on the mixed language text information based on the at least two language types to obtain an intermediate semantic coding feature of the mixed language text information;

acquiring target tone features corresponding to a target tone body, and decoding the intermediate semantic coding features based on the target tone features to obtain acoustic features;

and performing acoustic coding processing on the acoustic features to obtain audio information corresponding to the mixed language text information.

2. The method according to claim 1, wherein said text-coding said mixed-language text information based on said at least two language types to obtain intermediate semantic-coding features of said mixed-language text information comprises:

respectively carrying out text coding processing on the mixed language text information through a monolingual text coder corresponding to each language type to obtain at least two monolingual coding characteristics of the mixed language text information;

performing fusion processing on the at least two single-language coding features to obtain a mixed-language coding feature of the mixed-language text information;

and determining the middle semantic coding characteristics of the mixed language text information according to the mixed language coding characteristics.

3. A method as claimed in claim 2, wherein said text encoding said mixed-language text information by said monolingual text encoders corresponding to respective said language types respectively to obtain at least two monolingual encoding features of said mixed-language text information comprises:

mapping conversion processing is carried out on the mixed language text information through character embedding matrixes corresponding to the language types respectively to obtain at least two embedded character features of the mixed language text information;

and respectively carrying out text coding processing on the embedded character features through a monolingual text coder corresponding to each language type to obtain at least two monolingual coding features of the mixed language text information.

4. A method as claimed in claim 3, wherein said text encoding said embedded character features by a monolingual text encoder corresponding to each of said language types to obtain at least two monolingual encoding features of said mixed-language text information comprises:

respectively carrying out residual error coding on the embedded character features through a monolingual text coder corresponding to each language type to obtain at least two residual error coding features of the mixed language text information;

and respectively fusing the embedded character features with the residual coding features to obtain at least two monolingual coding features of the mixed language text information.

5. The method according to claim 3, wherein said single-language coding feature is a residual coding feature obtained by residual coding said embedded character feature; the step of performing fusion processing on the at least two single-language coding features to obtain a mixed-language coding feature of the mixed-language text information includes:

and fusing the at least two single-language coding features and the embedded character features to obtain a mixed-language coding feature of the mixed-language text information.

6. The method according to claim 2, wherein said determining the intermediate semantic code characteristic of the mixed-language text information according to the mixed-language code characteristic comprises:

performing mapping transformation processing on the mixed language text information through a language embedding matrix based on the at least two language types to obtain embedded language features of the mixed language text information;

and carrying out fusion processing on the mixed language coding features and the embedded language features to obtain intermediate semantic coding features of the mixed language text information.

7. The method according to claim 1, wherein said text-coding said mixed-language text information based on said at least two language types to obtain intermediate semantic-coding features of said mixed-language text information comprises:

performing text coding processing on each text character in the mixed language text information based on the at least two language types to obtain character coding features corresponding to each text character;

acquiring attention distribution weights corresponding to the text characters;

and carrying out weighted mapping on the character coding features of the text characters according to the attention distribution weight to obtain the middle semantic coding features of the mixed language text information.

8. The audio information synthesizing method according to claim 7, wherein the obtaining of attention assignment weights corresponding to the respective text characters includes:

acquiring sequence position information of each text character in the mixed language text information;

and determining a position attention distribution weight corresponding to each text character according to the sequence position information.

9. The audio information synthesizing method according to claim 8, wherein the obtaining of attention assignment weights corresponding to the respective text characters further comprises:

obtaining language type information of each text character;

determining language attention distribution weights corresponding to the text characters according to the language type information;

and determining multiple attention distribution weights corresponding to the text characters according to the position attention distribution weight and the language attention distribution weight.

10. A method for synthesizing audio information according to claim 9, wherein said determining a plurality of attention allocation weights corresponding to respective ones of said text characters based on said location attention allocation weight and said language attention allocation weight comprises:

acquiring tone identification information of a target tone body corresponding to each text character;

determining tone attention distribution weights corresponding to the text characters according to the tone identification information;

and determining multiple attention distribution weights corresponding to the text characters according to the position attention distribution weight, the language attention distribution weight and the tone attention distribution weight.

11. The audio information synthesis method according to claim 1, wherein the obtaining of the target timbre characteristic corresponding to the target timbre subject includes:

acquiring tone identification information of a target tone body;

and mapping and transforming the tone identification information through a tone embedding matrix to obtain the target tone characteristic of the target tone main body.

12. A method for synthesizing audio information according to claim 1, wherein after the acoustic feature is subjected to an acoustic encoding process to obtain the audio information corresponding to the mixed-language text information, the method further comprises:

obtaining a tone data sample training by using the target tone body to obtain a tone conversion model;

and performing tone conversion processing on the audio information through the tone conversion model to obtain audio information corresponding to the target tone body.

13. An audio information synthesizing apparatus, comprising:

an information acquisition module configured to acquire mixed-language text information including at least two language types;

an information encoding module configured to perform text encoding processing on the mixed-language text information based on the at least two language types to obtain an intermediate semantic encoding feature of the mixed-language text information;

the information decoding module is configured to acquire a target tone color feature corresponding to a target tone color main body and decode the intermediate semantic coding feature based on the target tone color feature to obtain an acoustic feature;

an acoustic coding module configured to perform acoustic coding processing on the acoustic features to obtain audio information corresponding to the mixed-language text information.

14. A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out the audio information synthesizing method of any one of claims 1 to 13.

15. An electronic device, comprising:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the audio information synthesis method of any one of claims 1 to 12 via execution of the executable instructions.

Technical Field

The application relates to the technical field of artificial intelligence and relates to machine learning technology. In particular, the present application relates to an audio information synthesis method, an audio information synthesis apparatus, a computer-readable medium, and an electronic device.

Background

With the rapid development of artificial intelligence technology and intelligent hardware devices (such as smart phones, smart speakers, etc.), voice interaction technology is increasingly applied as a natural interaction mode. As an important part of the voice interaction technology, the voice synthesis technology has also made great progress. The Speech synthesis technology is also called Text-to-Speech (TTS) technology, and functions to convert Text information generated by a computer itself or input from the outside into intelligible and fluent Speech that can be understood by a user and played.

In the application of speech synthesis technology, a situation that a plurality of language types are mixed with each other is often encountered, for example, a chinese sentence is mixed with an english word or an english phrase. Under such a situation, a large timbre difference generally occurs in a speech part where two languages are switched, which causes a jump of the synthesized speech as a whole and affects a playing effect of the synthesized speech. Therefore, how to overcome the timbre difference caused by the mixing of multiple language types is a problem to be solved.

It is to be noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the present application and therefore may include information that does not constitute prior art known to a person of ordinary skill in the art.

Disclosure of Invention

The present application is directed to an audio information synthesizing method, an audio information synthesizing apparatus, a computer readable medium, and an electronic device, which overcome the technical problem of timbre differences of different language types in synthesized audio at least to some extent.

Other features and advantages of the present application will be apparent from the following detailed description, or may be learned by practice of the application.

According to an aspect of an embodiment of the present application, there is provided an audio information synthesizing method, including:

acquiring mixed language text information comprising at least two language types;

acquiring target tone features corresponding to a target tone body, and decoding the intermediate semantic coding features based on the target tone features to obtain acoustic features;

and performing acoustic coding processing on the acoustic features to obtain audio information corresponding to the mixed language text information.

According to an aspect of an embodiment of the present application, there is provided an audio information synthesizing apparatus including:

an information acquisition module configured to acquire mixed-language text information including at least two language types;

an acoustic coding module configured to perform acoustic coding processing on the acoustic features to obtain audio information corresponding to the mixed-language text information.

In some embodiments of the present application, based on the above technical solutions, the information encoding module includes:

a monolingual coding unit configured to perform text coding processing on the mixed-language text information through monolingual text coders corresponding to the respective language types to obtain at least two monolingual coding features of the mixed-language text information;

the encoding feature fusion unit is configured to perform fusion processing on the at least two single-language encoding features to obtain mixed-language encoding features of the mixed-language text information;

and the coding feature determining unit is configured to determine the intermediate semantic coding feature of the mixed language text information according to the mixed language coding feature.

In some embodiments of the present application, based on the above technical solution, the monolingual coding unit includes:

a character embedding subunit, configured to perform mapping transformation processing on the mixed-language text information through character embedding matrixes corresponding to the respective language types respectively to obtain at least two embedded character features of the mixed-language text information;

an embedded encoding subunit configured to perform text encoding processing on the embedded character features respectively through monolingual text encoders corresponding to the respective language types to obtain at least two monolingual encoding features of the mixed-language text information.

In some embodiments of the present application, based on the above technical solution, the embedded coding sub-unit includes:

a residual coding subunit configured to perform residual coding on the embedded character features by a monolingual text coder corresponding to each language type to obtain at least two residual coding features of the mixed language text information;

and the residual error fusion subunit is configured to perform fusion processing on the embedded character features and the residual error coding features respectively to obtain at least two monolingual coding features of the mixed language text information.

In some embodiments of the present application, based on the above technical solution, the monolingual coding feature is a residual coding feature obtained by performing residual coding on the embedded character feature; the encoding feature fusion unit includes:

and the encoding characteristic fusion subunit is configured to perform fusion processing on the at least two monolingual encoding characteristics and the embedded character characteristic to obtain a mixed language encoding characteristic of the mixed language text information.

In some embodiments of the present application, based on the above technical solutions, the encoding characteristic determining unit includes:

a language embedding subunit configured to perform mapping transformation processing on the mixed-language text information by a language embedding matrix based on the at least two language types to obtain an embedded language feature of the mixed-language text information;

and the language fusion subunit is configured to perform fusion processing on the mixed language coding feature and the embedded language feature to obtain an intermediate semantic coding feature of the mixed language text information.

In some embodiments of the present application, based on the above technical solutions, the information encoding module includes:

a character encoding unit configured to perform text encoding processing on each text character in the mixed-language text information based on the at least two language types to obtain a character encoding characteristic corresponding to each text character;

a weight acquisition unit configured to acquire attention assignment weights corresponding to the respective text characters;

and the feature weighting unit is configured to perform weighted mapping on the character coding features of the text characters according to the attention distribution weights so as to obtain the intermediate semantic coding features of the mixed language text information.

In some embodiments of the present application, based on the above technical solutions, the weight obtaining unit includes:

a sequence position acquisition subunit configured to acquire sequence position information of each text character in the mixed-language text information;

a first weight determining subunit configured to determine a position attention assignment weight corresponding to each of the text characters according to the sequence position information.

In some embodiments of the present application, based on the above technical solution, the weight obtaining unit further includes:

a language type obtaining subunit configured to obtain language type information of each of the text characters;

a language weight determination subunit configured to determine a language attention assignment weight corresponding to each of the text characters according to the language type information;

a second weight determination subunit configured to determine multiple attention allocation weights corresponding to the respective text characters according to the location attention allocation weight and the language attention allocation weight.

In some embodiments of the present application, based on the above technical solutions, the second weight determining subunit includes:

a tone identification acquisition subunit configured to acquire tone identification information of a target tone body corresponding to each of the text characters;

a tone weight determination subunit configured to determine tone attention assignment weights corresponding to the respective text characters based on the tone identification information;

a third weight determination subunit configured to determine multiple attention allocation weights corresponding to the respective text characters according to the location attention allocation weight, the language attention allocation weight, and the timbre attention allocation weight.

In some embodiments of the present application, based on the above technical solutions, the information decoding module includes:

a tone identification acquisition unit configured to acquire tone identification information of a target tone body;

and the tone identification embedding unit is configured to perform mapping transformation processing on the tone identification information through a tone embedding matrix to obtain a target tone characteristic of the target tone body.

In some embodiments of the present application, based on the above technical solutions, the audio information synthesizing apparatus further includes:

the model acquisition module is configured to acquire a tone data sample trained by using the target tone body to obtain a tone conversion model;

and the tone conversion module is configured to perform tone conversion processing on the audio information through the tone conversion model to obtain the audio information corresponding to the target tone body.

According to an aspect of the embodiments of the present application, there is provided a computer-readable medium on which a computer program is stored, the computer program, when executed by a processor, implementing the audio information synthesizing method as in the above technical solutions.

According to an aspect of an embodiment of the present application, there is provided an electronic apparatus including: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to execute the audio information synthesis method as in the above technical solution via executing the executable instructions.

In the technical solution provided in the embodiment of the present application, the mixed-language text information is encoded by an encoder based on multiple language types, and the encoded information is decoded by a decoder combined with a target tone body, so that audio information corresponding to a single tone and multiple language types can be converted. The problem of tone jump caused by language difference existing in the existing mixed language audio synthesis technology is solved, and natural and smooth mixed language audio with uniform tone can be stably output. The embodiment of the application can be deployed at the cloud end to provide general synthesis service for various devices, and the exclusive tone can be customized according to the self requirements of different applications. The method can realize the mixed synthesis of multiple language types by using the single-language audio database of different target tone main bodies, thereby greatly reducing the cost of training data acquisition. Meanwhile, the embodiment of the application can be compatible with the recorded single-language audio database, so that the available tone is richer.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application. It is obvious that the drawings in the following description are only some embodiments of the application, and that for a person skilled in the art, other drawings can be derived from them without inventive effort. In the drawings:

fig. 1 schematically shows an exemplary system architecture diagram of the present technical solution in an application scenario.

Fig. 2 schematically shows an exemplary system architecture and a customized audio synthesis service flow of the present technical solution in another application scenario.

Fig. 3 schematically shows a flow chart of the steps of an audio information synthesis method provided in an embodiment of the present application.

Fig. 4 schematically shows a flowchart of method steps of an encoding process by a multi-way encoder in an embodiment of the present application.

Fig. 5 schematically shows a flowchart of method steps of an Attention mechanism (Attention) based encoding process in an embodiment of the present application.

Fig. 6 schematically shows a schematic diagram of a principle of implementing audio information synthesis on chinese mixed text based on an embodiment of the present application.

Fig. 7 schematically shows a block diagram of the composition of the audio information synthesizing apparatus in the embodiment of the present application.

Fig. 8 schematically shows a schematic structural diagram of a computer system suitable for implementing the electronic device of the embodiment of the present application.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art.

Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the application. One skilled in the relevant art will recognize, however, that the subject matter of the present application can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and so forth. In other instances, well-known methods, devices, implementations, or operations have not been shown or described in detail to avoid obscuring aspects of the application.

The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.

The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.

Before describing the technical solutions of the audio information synthesis method, the audio information synthesis apparatus, the computer-readable medium, and the electronic device provided in the present application, a brief description will be made of a cloud technology and an artificial intelligence technology related to the technical solutions of the present application.

Cloud Computing (Cloud Computing) refers to a mode of delivery and use of IT infrastructure, meaning that required resources are obtained over a network in an on-demand, easily scalable manner. Such services may be IT and software, internet related, or other services. Cloud Computing is a product of development and fusion of traditional computers and Network Technologies, such as Grid Computing (Grid Computing), distributed Computing (distributed Computing), Parallel Computing (Parallel Computing), Utility Computing (Utility Computing), Network Storage (Network Storage Technologies), Virtualization (Virtualization), Load balancing (Load Balance), and the like.

With the development of diversification of internet, real-time data stream and connecting equipment and the promotion of demands of search service, social network, mobile commerce, open collaboration and the like, cloud computing is rapidly developed. Different from the prior parallel distributed computing, the generation of cloud computing can promote the revolutionary change of the whole internet mode and the enterprise management mode in concept.

An artificial intelligence cloud Service is also commonly referred to as AIaaS (AI as a Service, chinese). The method is a service mode of an artificial intelligence platform, and particularly, the AIaaS platform splits several types of common AI services and provides independent or packaged services at a cloud. This service model is similar to the one opened in an AI theme mall: all developers can access one or more artificial intelligence services provided by the platform through an API (application programming interface), and part of the qualified developers can also use an AI framework and an AI infrastructure provided by the platform to deploy and operate and maintain the self-dedicated cloud artificial intelligence services.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Key technologies for Speech Technology (Speech Technology) are automatic Speech recognition Technology (ASR) and Speech synthesis Technology (TTS), as well as voiceprint recognition Technology. The computer can listen, see, speak and feel, and the development direction of the future human-computer interaction is provided, wherein the voice becomes one of the best viewed human-computer interaction modes in the future.

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

With the research and progress of artificial intelligence technology, the artificial intelligence technology is developed and applied in a plurality of fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned driving, automatic driving, unmanned aerial vehicles, robots, smart medical care, smart customer service, and the like.

The application scenario is wide, the audio synthesis scheme with a plurality of mixed language types can be configured into the cloud service, the cloud service can be used as a basic technology for users using the cloud service, and the scheme can also be used in the personalized scenario in the vertical field. For example, the method can be applied to scenes such as reading APP intelligent reading, intelligent customer service, news broadcasting, intelligent equipment interaction and the like, and intelligent audio synthesis under various scenes is realized.

Fig. 1 schematically shows an exemplary system architecture diagram of the present technical solution in an application scenario.

As shown in fig. 1, system architecture 100 may include a client 110, a network 120, and a server 130. The client 110 may include various terminal devices such as a smart phone, a smart robot, a smart speaker, a tablet computer, a notebook computer, and a desktop computer. The server 130 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud computing services. Network 120 may be a communication medium of various connection types capable of providing communication links between clients 110 and servers 130, such as wired communication links, wireless communication links, and so forth.

According to implementation needs, the technical solution provided in the embodiment of the present application may be applied to the client 110, or may be applied to the server 130, or may be implemented by both the client 110 and the server 130, and this application is not particularly limited to this.

For example, various smart devices such as a smart robot and a smart phone may access to a mixed-language audio synthesis service, such as a chinese-english mixed speech synthesis service, on a cloud server through a wireless network. The client 110 sends the Chinese-English mixed text to be synthesized to the server 130 through the network 120, and after the server 130 performs fast synthesis, the corresponding synthesized audio can be sent to the client 110 in a streaming or sentence-returning manner. A complete speech synthesis procedure may include, for example:

the client 110 uploads the Chinese and English mixed text to be synthesized to the server 130, and the server 130 performs corresponding regularization processing after receiving the text;

the server 130 inputs the normalized text information into a Chinese-English mixed speech synthesis system, quickly synthesizes audio corresponding to the text, and completes post-processing operations such as audio compression and the like;

the server 130 returns the audio to the client 110 by streaming or sentence returning, and the client 110 can play the audio in a fluent and natural voice after receiving the audio.

In the above speech synthesis process, the speech synthesis service provided by the server 130 has a small delay, and the client 110 can obtain the returned result substantially immediately. The user can hear the required content in a short time, and the eyes are liberated, so that the interaction is natural and convenient.

Fig. 2 schematically shows an exemplary system architecture and a customized audio synthesis service flow of the present technical solution in another application scenario. The system architecture and the process are mainly applied to the vertical fields of novel reading, news broadcasting and the like which need to customize the exclusive tone and voice synthesis service.

The process of implementing the customized audio synthesis service under the system architecture mainly comprises the following steps:

the front-end requirement party 210 submits a required voice synthesized tone requirement list of the products, such as the sex of the speaker, the tone type and other requirements.

After receiving the list of the requesting party, the background server 220 collects an audio database according to the required tone conditions, and trains a corresponding audio synthesis model 230.

The background server 220 synthesizes the sample by using the audio synthesis model 230, and after the sample is delivered to the front-end demander 210 for verification and confirmation, the audio synthesis model 230 obtained by customization can be deployed online;

an application program (such as a reading APP, a news client, etc.) of the front-end demander 210 sends a text requiring audio synthesis to an audio synthesis model 230 deployed on the background server 220;

the user of the front-end requirement party 210 can hear the text content read with the corresponding customized timbre in the application program, and the specific audio synthesis flow is the same as the online synthesis service used in the system architecture shown in fig. 1.

In the application scenario, after the front-end demander 210 provides the requirements, the background servant 220 only needs to collect the speaker audio database of one language type (such as chinese) meeting the requirements and perform the customized training of the audio synthesis model 230 of language mixing in combination with the audio database of another language type (such as english) of the original other speakers, and finally performs the audio synthesis of language mixing with the timbre meeting the requirements of the front-end demander 210, thereby greatly reducing the cost of the customized audio synthesis service.

The technical solutions provided in the present application are described in detail below with reference to specific embodiments.

Fig. 3 schematically shows a flow chart of the steps of an audio information synthesis method provided in an embodiment of the present application. The execution main body of the audio information synthesis method can be various terminal devices such as a smart phone and a smart sound box serving as a client, and can also be various server devices such as a physical server and a cloud server serving as a server. As shown in fig. 3, the audio information synthesizing method may mainly include the following steps S310 to S340:

step S310, mixed language text information comprising at least two language types is obtained.

Mixed-language text information is composed of any number of text characters, where each text character corresponds to at least two different language types. For example, the mixed-language text information may be a text composed of a mixture of chinese characters and english characters. The step can acquire the mixed language text information input by the user through the input device in a real-time receiving mode, and can also extract the mixed language text information sentence by sentence or paragraph by paragraph from a storage file of the text information in a mode of acquiring item by item. In addition, the step can also be used for carrying out voice recognition on voice information which is input by a user and contains two or more than two different language types, and obtaining mixed language text information containing at least two language types based on a voice recognition result; for example, the step can perform voice recognition processing on the received voice information through a pre-trained voice recognition model to obtain corresponding text information, and then perform audio synthesis on the text information through subsequent steps, so that the effect of tone conversion is integrally achieved, and the tone-varying processing with uniform tone is performed on one or more speakers.

And S320, performing text coding processing on the mixed language text information based on at least two language types to obtain the intermediate semantic coding features of the mixed language text information.

In the step, the text coding processing can be carried out on the mixed language text information by utilizing a pre-trained coder (encoder), so as to obtain the intermediate semantic coding characteristics related to the natural semantics of the text information. The number and types of the encoders may correspond to the language types in the mixed-language text information one to one, for example, the mixed-language text information contains both chinese characters and english characters, so the step may use two encoders for text encoding of the mixed-language text information to obtain an intermediate semantic encoding characteristic, and then the intermediate semantic encoding characteristic may be decoded by a decoder (decoder) corresponding to the encoder in the subsequent step, and finally a natural language with an audio form for the user to understand is formed.

The encoder may be a model obtained by training various types of Neural networks based on Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), Long-Term Memory networks (LSTM), or Gate-round units (GRU). CNN is a feedforward neural network whose neurons can respond to elements in the receptive field; CNNs are generally composed of multiple convolutional layers and top fully-connected layers, which reduce the number of parameters of the model by sharing the parameters, making them widely used in image and speech recognition. RNN is a recurrent Neural Network (recurrent Neural Network) in which sequence data is input, recursion is performed in the direction of evolution of the sequence, and all nodes (cyclic units) are connected in a chain. The LSTM is a cyclic neural network, and a unit for judging whether information is useful or not is added in an algorithm, wherein an input gate, a forgetting gate and an output gate are arranged in one unit; after the information enters the LSTM, whether the information is useful or not is judged according to the rule, the information which accords with the algorithm authentication is left, and the information which does not accord with the algorithm authentication is forgotten through a forgetting door; LSTM is suitable for processing and predicting important events of relatively long interval and delay in a time series. The GRU is one of the recurrent neural networks, and is also proposed for solving the problems of long-term memory, gradient in back propagation and the like the LSTM; compared with the LSTM, the GRU has one less 'gating' inside, has less parameters than the LSTM, can achieve the effect equivalent to the LSTM in most cases, and effectively reduces the time consumption of calculation.

And S330, acquiring target tone color characteristics corresponding to the target tone color main body, and decoding the intermediate semantic coding characteristics based on the target tone color characteristics to obtain acoustic characteristics.

The target timbre principal is a principal object for determining timbre characteristics of the synthesized audio information, which may be a speaker who forms an audio database by collecting sound samples. In some embodiments, the target timbre subject may be a real physical object, such as a real character with prominent timbre features, a dubbing actor, or the like; the target tone color body may be a virtual object synthesized by a computer simulation, and may be a virtual character such as the future of the initial tone or loiter generated by the voice synthesis software vocalid.

In the step, the tone characteristics required by the user, such as male voice, emotion sounding and the like, can be obtained in advance, and then the target tone main body which accords with the tone characteristics is selected. For a determined target tone main body, target tone features capable of reflecting and identifying tone features of the target tone main body can be obtained through feature extraction or mapping and the like. And then, based on the target tone color feature, decoding the intermediate semantic coding feature obtained in step S320 by using a pre-trained decoder to obtain a corresponding acoustic feature. The acoustic features may be, for example, features having tonal characteristics and sound content that are presented in spectra (Spectrograms) or other forms. The frequency spectrum refers to a representation mode of a signal in a time domain in a frequency domain, and can be obtained by performing fourier transform on a sound signal, and the obtained result is two images respectively taking the amplitude and the phase as vertical axes and the frequency as a horizontal axis.

And S340, carrying out acoustic coding processing on the acoustic features to obtain audio information corresponding to the mixed language text information.

In this step, the acoustic features may be input to a Vocoder (Vocoder), and the Vocoder converts the acoustic features to form audio information that can be output and played through an audio output device such as a speaker. Vocoders, also known as speech signal analysis and synthesis systems, are derived from the acronym of human Voice coder (Voice Encoder), which functions to convert acoustic features into sound.

In the audio information synthesis method provided by the embodiment of the application, the mixed-language text information is encoded by the encoder based on multiple language types, and the encoded information is decoded by the decoder combined with the target tone body, so that the audio information corresponding to a single tone and multiple language types can be converted. The problem of tone jump caused by language difference existing in the existing mixed language audio synthesis technology is solved, and natural and smooth mixed language audio with uniform tone can be stably output. The embodiment of the application can be deployed at the cloud end to provide general synthesis service for various devices, and the exclusive tone can be customized according to the self requirements of different applications. Because the Chinese-English hybrid synthesis can be realized by using the single-language audio databases of different target tone color main bodies, the cost of training data acquisition is greatly reduced. Meanwhile, the embodiment of the application can be compatible with the recorded single-language audio database, so that the available tone is richer.

The following describes in detail the implementation of some steps in the above embodiments with reference to fig. 4 to 5.

Fig. 4 schematically shows a flowchart of method steps of an encoding process by a multi-way encoder in an embodiment of the present application. As shown in fig. 4, on the basis of the above embodiment, in step S320, the encoding process is performed on the mixed-language text information based on at least two language types to obtain the intermediate semantic encoding feature of the mixed-language text information, which may include the following steps S410 to S430:

and S410, respectively carrying out text coding processing on the mixed language text information through the monolingual text coders corresponding to the language types to obtain at least two monolingual coding characteristics of the mixed language text information.

In the step, the mixed language text information can be mapped and transformed in advance to form vector features which can be identified by an encoder. The mapping transformation manner may be, for example, that the mixed-language text information is subjected to mapping transformation processing by using the character embedding matrix corresponding to each language type to obtain at least two embedded character features of the mixed-language text information. The number and type of the character embedding matrix may be in one-to-one correspondence with the language type, for example, the mixed language text information contains both chinese characters and english characters, and then the step may perform mapping transformation processing on the mixed language text information through the character embedding matrix corresponding to the chinese characters to obtain the embedded character features corresponding to the chinese characters, and may perform mapping transformation processing on the mixed language text information through the character embedding matrix corresponding to the english characters to obtain the embedded character features corresponding to the english characters. The character embedding matrix can firstly carry out linear mapping on the mixed language text information, and then carry out nonlinear transformation on the mixed language text information by utilizing an activation function or other modes to obtain corresponding embedded character characteristics.

Where there are several language types in the mixed-language text message, several corresponding monolingual text coders may be used in this step. And respectively coding the embedded character features by using the monolingual text coders corresponding to the language types to obtain at least two monolingual coding features of the mixed language text information. For example, the mixed-language text information contains both chinese characters and english characters, after obtaining the corresponding embedded character features, the embedded character features may be encoded by a monolingual text encoder corresponding to the chinese language to obtain monolingual encoding features corresponding to the chinese language, and the embedded character features may be encoded by a monolingual text encoder corresponding to the english language to obtain monolingual encoding features corresponding to the english language.

The single-language text encoder used in the embodiment of the application can be an encoder with a residual error network structure, and the residual error network is characterized by being easy to optimize and capable of improving the accuracy rate by increasing the equivalent depth. On the basis, residual error coding can be respectively carried out on the embedded character features through a monolingual text coder corresponding to each language type to obtain at least two residual error coding features of the mixed language text information; and then, respectively fusing the embedded character features with the residual coding features to obtain at least two monolingual coding features of the mixed language text information.

The residual coding feature is a difference part between input data and output data of the encoder, and the output monolingual coding feature can be obtained by fusing the residual coding feature and the input embedded character feature, wherein the fusing mode can be directly adding the residual coding feature and the embedded character feature. The coding mode based on the residual error network structure has stronger sensitivity to the data change of the coded output data, and the data change of the coded output data has larger adjustment effect on the network weight in the training process, so that better training effect can be obtained.

And S420, fusing the at least two single-language coding features to obtain a mixed-language coding feature of the mixed-language text information.

The monolingual coding features output by each monolingual text encoder can be processed by fusion to obtain the mixed language coding features of the mixed language text information. For example, for two monolingual coding features, vector calculation can be performed on them, such as by directly adding them to obtain a mixed-language coding feature. In addition, the two single-language coding features can be spliced and mapped through a full connection layer or other network structures to obtain the mixed-language coding features. The embodiment of the present application is not particularly limited to this.

In some embodiments of the present application, based on a residual network structure, in a monolingual text encoder corresponding to different language types, each residual coding feature and each embedded character feature may be fused to obtain a monolingual coding feature, and then the monolingual coding features may be fused to obtain a mixed-language coding feature of mixed-language text information.

In other embodiments of the present application, only residual coding may be performed on each embedded character feature in monolingual coders corresponding to different language types based on a residual network structure to obtain residual coding features, that is, the residual coding features are directly used as monolingual coding features output by each monolingual text coder, and then the monolingual coding features and the embedded character features are subjected to fusion processing together to obtain mixed language coding features of mixed language text information.

And S430, determining the middle semantic coding characteristics of the mixed language text information according to the mixed language coding characteristics.

In some embodiments of the present application, the mixed-language coding feature may be directly determined as an intermediate semantic coding feature of the mixed-language text information, or the intermediate semantic coding feature may be obtained by transforming the mixed-language coding feature through a preset function.

In other embodiments of the present application, language-type identification information may be embedded into the mixed-language text information to obtain intermediate semantic coding features of the mixed-language text information.

For example, in this step, the mixed-language text information may be subjected to mapping transformation processing based on language embedding matrices of at least two language types to obtain embedded language features of the mixed-language text information; and then, carrying out fusion processing on the mixed language coding features and the embedded language features to obtain intermediate semantic coding features of the mixed language text information.

The mapping transformation of the mixed-language text information by the language embedding matrix may be a linear mapping of the mixed-language text information according to a preset matrix parameter in the language embedding matrix, and then a nonlinear transformation of the mixed-language text information by an activation function or other means, so as to obtain a corresponding embedded-language feature. For example, the mixed-language text information is a character sequence having a certain number of characters, and the embedded language feature obtained by mapping and transforming the mixed-language text information may be a feature vector having the same sequence length as the character sequence, where each element in the feature vector corresponds to a language type corresponding to each character in the character sequence.

The fusion processing of the mixed language coding feature and the embedded language feature may be vector calculation, for example, by directly adding the two features to obtain an intermediate semantic coding feature of the mixed language text information. In addition, the mixed language coding features and the embedded language features can be spliced, and then mapped through a full connection layer or other network structures to obtain the intermediate semantic coding features of the mixed language text information.

By executing the steps S410 to S430, the single-language text encoder corresponding to each language type can be used to independently encode the mixed-language text information through the mutually independent symbol sets of different languages, and the intermediate semantic encoding features containing the language type information are obtained after the fusion processing.

Fig. 5 schematically shows a flowchart of method steps of an Attention mechanism (Attention) based encoding process in an embodiment of the present application. As shown in fig. 5, on the basis of the above embodiments, in step S320, the encoding process is performed on the mixed-language text information based on at least two language types to obtain the intermediate semantic encoding feature of the mixed-language text information, which may include the following steps S510 to S530:

step S510, text coding processing is carried out on each text character in the mixed language text information based on at least two language types to obtain character coding characteristics corresponding to each text character.

The mixed-language text information is a character sequence composed of a plurality of text characters, and when the encoding method provided by each embodiment is used for carrying out text encoding processing on the mixed-language text information, each text character in the mixed-language text information can be sequentially encoded to obtain the character encoding characteristic corresponding to each text character.

Step S520, attention distribution weights corresponding to the text characters are obtained.

Because each text character in the mixed language text information has other factors which can influence semantic encoding and decoding besides the self character semantic difference, the step can acquire the attention distribution weight corresponding to each text character according to the influence factors with different dimensionalities.

And S530, carrying out weighted mapping on the character coding features of each text character according to the attention distribution weight to obtain the intermediate semantic coding features of the mixed language text information.

The attention distribution weight determines the semantic importance degree of each text character in the encoding and decoding process, so that the character coding features of each text character are weighted and mapped according to the attention distribution weight, and the semantic expression capability of obtaining the middle semantic coding features can be improved.

In some embodiments of the present application, one attention dimension may be sequence position information of individual text characters in mixed-language text information. For example, the embodiment of the present application may first obtain sequence position information of each text character in the mixed-language text information, and then determine a position attention distribution weight corresponding to each text character according to the sequence position information.

On this basis, the embodiment of the present application may further obtain language type information of each text character, determine a language attention allocation weight corresponding to each text character according to the language type information, and further determine a multiple attention allocation weight corresponding to each text character according to the location attention allocation weight and the language attention allocation weight.

On this basis, the embodiment of the present application may further obtain tone color identification information of a target tone color body corresponding to each text character, determine a tone color attention allocation weight corresponding to each text character according to the tone color identification information, and further determine a multiple attention allocation weight corresponding to each text character according to the location attention allocation weight, the language attention allocation weight, and the tone color attention allocation weight.

By executing the steps S510 to S530, an encoding effect based on the attention mechanism can be achieved, and particularly, by using the multiple attention mechanism, a plurality of different influence factors can be introduced into the encoding process of the mixed language text information, so that the semantic expression capability of the encoding result is improved.

In step S330, a target timbre feature corresponding to the target timbre subject is acquired, and the intermediate semantic coding features are subjected to decoding processing based on the target timbre feature to obtain acoustic features.

In this step, audio databases corresponding to different tone color bodies may be preconfigured, and corresponding tone color identification information may be assigned to the different tone color bodies by way of numbering, etc. In this step, the tone identification information of the target tone body can be obtained first, and then the tone identification information is subjected to mapping transformation processing through the tone embedding matrix to obtain the target tone characteristic of the target tone body. And then, the target tone color characteristic and the intermediate semantic coding characteristic can be jointly input into a decoder, and the decoder performs decoding processing to obtain the acoustic characteristic with the tone color characteristic of the target tone color main body.

When the decoding process is performed by the decoder, a multiple attention mechanism similar to the encoder in the above embodiment may be used, for example, in step S320 and step S330, the coding/decoding process for the mixed-language text information may be implemented using an RNN network structure based on the attention mechanism as a coder-decoder model, or the coding/decoding process may be performed using a Transformer (Transformer) as a coder-decoder model, where the Transformer model is a network structure based on the full attention mechanism, and the parallelism capability of the models may be improved.

In step s340, after the acoustic features are subjected to acoustic coding processing to obtain audio information corresponding to the mixed language text information, in the embodiment of the present application, a tone conversion model obtained by training a tone data sample of a target tone body may be obtained, and then the audio information is subjected to tone conversion processing by using the tone conversion model to obtain audio information corresponding to the target tone body.

By training the tone conversion model and performing tone conversion on the output audio information again by using the tone conversion model, the tone of the audio of mixed languages can be more uniform on the premise of not increasing the data acquisition cost.

Fig. 6 schematically shows a schematic diagram of a principle of implementing audio information synthesis on chinese mixed text based on an embodiment of the present application. As shown in fig. 6, the overall system for synthesizing audio information may mainly include four parts, namely, a multipath residual encoder 610, a language embedding generator 620, a multi-attention mechanism module 630, and a speaker embedding generator 640, and further includes a decoder 650 and a vocoder 660.

The multi-path residual Encoder 610 (multi-path-Res-Encoder) may perform residual encoding on the input mixed-language text through a chinese-english two-path Encoder and add the residual encoding to the input mixed-language text to obtain a text encoding Representation (Encode Representation), so as to reduce the fragmentation at the boundary of chinese-english while enhancing the distinctiveness of the text encoding Representation.

The Language Embedding generator 620 may map and nonlinearly transform the category to which each character of the input mixed Language text belongs by Language Embedding (Language Embedding). Thus, each character input is marked by corresponding language embedding, and the distinguishability of the output of the encoder can be further enhanced by combining the characters with the text coding representation.

The multiple-Attention mechanism module 630(Multi-Attention) focuses on language embedding in addition to text encoding tokens. The attention mechanism is a bridge connecting the multi-pass residual encoder 610 and decoder 650 to accurately determine which position in the encoding needs attention at each decoding instant is decisive for the final synthesis quality. The multiple attention mechanism focuses on text encoding representation and has clear cognition on the content needing decoding currently. Meanwhile, language embedding is also concerned, and the language to which the current decoding content belongs is clearly judged. The two are combined, so that the decoding is more stable and smooth.

The Speaker Embedding generator 640(Speaker Embedding) obtains Speaker Embedding information by mapping and nonlinear transformation of Speaker serial numbers belonging to different audio databases, and participates in each decoding moment. Since the role of the decoder 650 is to convert text encoding tokens into acoustic features, it plays a critical role in the timbre of the final synthesized audio. The speaker embedding is introduced into each decoding moment, so that the audio characteristic attribute output by the decoder 650 can be effectively controlled, and the tone of the finally synthesized audio is controlled to be the tone of the corresponding speaker.

The acoustic features output by the decoder 650 are vocoded by the vocoder 660 to obtain a mixed chinese and english audio corresponding to the mixed language text. The system has the advantages brought by end-to-end learning, and ensures natural and smooth and consistent timbre of the synthesized Chinese-English mixed audio through the fine design of the model encoding end and the model decoding end.

It should be noted that although the various steps of the methods in this application are depicted in the drawings in a particular order, this does not require or imply that these steps must be performed in this particular order, or that all of the shown steps must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions, etc.

The following describes embodiments of the apparatus of the present application, which can be used to perform the audio information synthesis method in the above-described embodiments of the present application. Fig. 7 schematically shows a block diagram of the composition of the audio information synthesizing apparatus in the embodiment of the present application. As shown in fig. 7, the audio information synthesizing apparatus 700 may mainly include:

an information acquisition module 710 configured to acquire mixed-language text information including at least two language types;

an information encoding module 720, configured to perform text encoding processing on the mixed-language text information based on at least two language types to obtain an intermediate semantic encoding feature of the mixed-language text information;

an information decoding module 730 configured to obtain a target timbre feature corresponding to the target timbre subject, and perform decoding processing on the intermediate semantic coding feature based on the target timbre feature to obtain an acoustic feature;

an acoustic coding module 740 configured to perform an acoustic coding process on the acoustic features to obtain audio information corresponding to the mixed-language text information.

In some embodiments of the present application, based on the above embodiments, the information encoding module 720 includes:

a monolingual coding unit configured to perform text coding processing on the mixed-language text information through monolingual text coders corresponding to respective language types, respectively, to obtain at least two monolingual coding features of the mixed-language text information;

and the coding feature determining unit is configured to determine the intermediate semantic coding feature of the mixed language text information according to the mixed language coding feature.

In some embodiments of the present application, based on the above embodiments, the monolingual encoding unit includes:

the character embedding subunit is configured to respectively perform mapping transformation processing on the mixed language text information through character embedding matrixes corresponding to the various language types to obtain at least two embedded character features of the mixed language text information;

and the embedded coding subunit is configured to perform text coding processing on the embedded character features through the monolingual text coders corresponding to the various language types respectively to obtain at least two monolingual coding features of the mixed language text information.

In some embodiments of the present application, based on the above embodiments, the embedded coding subunit includes:

a residual coding subunit configured to perform residual coding on the embedded character features by a monolingual text coder corresponding to each language type respectively to obtain at least two residual coding features of the mixed language text information;

In some embodiments of the present application, based on the above embodiments, the monolingual coding feature is a residual coding feature obtained by performing residual coding on the embedded character feature; the encoding feature fusion unit includes:

and the encoding characteristic fusion subunit is configured to perform fusion processing on the at least two monolingual encoding characteristics and the embedded character characteristics to obtain mixed-language encoding characteristics of the mixed-language text information.

In some embodiments of the present application, based on the above embodiments, the encoding characteristic determination unit includes:

a language embedding subunit configured to perform mapping transformation processing on the mixed-language text information by a language embedding matrix based on at least two language types to obtain an embedded language feature of the mixed-language text information;

and the language fusion subunit is configured to perform fusion processing on the mixed language coding features and the embedded language features to obtain intermediate semantic coding features of the mixed language text information.

In some embodiments of the present application, based on the above embodiments, the information encoding module 720 includes:

a character encoding unit configured to perform text encoding processing on each text character in the mixed-language text information based on at least two language types to obtain character encoding characteristics corresponding to each text character;

a weight acquisition unit configured to acquire attention assignment weights corresponding to respective text characters;

In some embodiments of the present application, based on the above embodiments, the weight obtaining unit includes:

a sequence position acquisition subunit configured to acquire sequence position information of each text character in the mixed-language text information;

a first weight determining subunit configured to determine a position attention assignment weight corresponding to each text character according to the sequence position information.

In some embodiments of the present application, based on the above embodiments, the weight obtaining unit further includes:

a language type acquisition subunit configured to acquire language type information of each text character;

a language weight determination subunit configured to determine language attention assignment weights corresponding to the respective text characters based on the language type information;

In some embodiments of the present application, based on the above embodiments, the second weight determining subunit comprises:

a tone identification acquisition subunit configured to acquire tone identification information of a target tone body corresponding to each text character;

a tone weight determination subunit configured to determine tone attention assignment weights corresponding to the respective text characters based on the tone identification information;

a third weight determination subunit configured to determine multiple attention allocation weights corresponding to the respective text characters according to the position attention allocation weight, the language attention allocation weight, and the timbre attention allocation weight.

In some embodiments of the present application, based on the above embodiments, the information decoding module 730 includes:

a tone identification acquisition unit configured to acquire tone identification information of a target tone body;

and the tone identification embedding unit is configured to perform mapping transformation processing on the tone identification information through the tone embedding matrix so as to obtain the target tone characteristic of the target tone body.

In some embodiments of the present application, based on the above embodiments, the audio information synthesizing apparatus 700 further includes:

the model acquisition module is configured to acquire a tone color conversion model trained by using tone color data samples of a target tone color main body;

The specific details of the audio information synthesizing apparatus provided in the embodiments of the present application have been described in detail in the corresponding method embodiments, and are not described herein again.

FIG. 8 illustrates a schematic structural diagram of a computer system suitable for use in implementing the electronic device of an embodiment of the present application.

It should be noted that the computer system 800 of the electronic device shown in fig. 8 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 8, a computer system 800 includes a Central Processing Unit (CPU)801 that can perform various appropriate actions and processes according to a program stored in a Read-Only Memory (ROM) 802 or a program loaded from a storage section 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data necessary for system operation are also stored. The CPU 801, ROM 802, and RAM 803 are connected to each other via a bus 804. An Input/Output (I/O) interface 805 is also connected to bus 804.

The following components are connected to the I/O interface 805: an input portion 806 including a keyboard, a mouse, and the like; an output section 807 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and a speaker; a storage portion 808 including a hard disk and the like; and a communication section 809 including a Network interface card such as a LAN (Local Area Network) card, a modem, or the like. The communication section 809 performs communication processing via a network such as the internet. A drive 810 is also connected to the I/O interface 805 as necessary. A removable medium 811 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 810 as necessary, so that a computer program read out therefrom is mounted on the storage section 808 as necessary.

In particular, according to embodiments of the present application, the processes described in the various method flowcharts may be implemented as computer software programs. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program can be downloaded and installed from a network through the communication section 809 and/or installed from the removable medium 811. When the computer program is executed by the Central Processing Unit (CPU)801, various functions defined in the system of the present application are executed.

It should be noted that the computer readable medium shown in the embodiments of the present application may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a Read-Only Memory (ROM), an Erasable Programmable Read-Only Memory (EPROM), a flash Memory, an optical fiber, a portable Compact Disc Read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the application. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present application can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which can be a personal computer, a server, a touch terminal, or a network device, etc.) to execute the method according to the embodiments of the present application.

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains.

It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

24页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：多任务环境中的智能数字助理

Audio information synthesis method and device, computer readable medium and electronic equipment

相关技术

网友询问留言