Voice processing method and voice processing device

文档序号：1906557 发布日期：2021-11-30 浏览：19次中文

阅读说明：本技术 语音处理方法和语音处理装置 (Voice processing method and voice processing device ) 是由李楠李子涵邢文浩张晨于 2021-09-30 设计创作，主要内容包括：提供一种语音处理方法和语音处理装置。一种音频处理方法可包括以下步骤：获取当前语音传输网络的网络状态信息；根据所述网络状态信息确定用于对输入语音进行编码的当前码率；基于所述当前码率确定所述输入语音的语音特征信息和对所述语音特征信息进行编码的编码参数；按照所述编码参数对所述语音特征信息进行编码。(A speech processing method and a speech processing apparatus are provided. An audio processing method may include the steps of: acquiring network state information of a current voice transmission network; determining a current code rate for coding input voice according to the network state information; determining voice feature information of the input voice and coding parameters for coding the voice feature information based on the current code rate; and coding the voice characteristic information according to the coding parameters.)

1. An audio processing method, comprising:

acquiring network state information of a current voice transmission network;

determining a current code rate for coding input voice according to the network state information;

determining voice feature information of the input voice and coding parameters for coding the voice feature information based on the current code rate;

and coding the voice characteristic information according to the coding parameters.

2. The audio processing method of claim 1, wherein determining encoding parameters for encoding the speech feature information based on the current code rate comprises:

determining at least one of a codebook and an inter-frame dependency for encoding the speech feature information based on the current code rate,

wherein the codebook represents a number of coded bits of a single speech frame, and the inter-frame dependency represents a number of frame information that the speech frame references other speech frames in a sequence of speech frames.

3. The audio processing method of claim 2, wherein determining at least one of a codebook and inter-frame dependencies for encoding the speech feature information based on the current code rate comprises:

selecting a codebook used for encoding the voice feature information from a plurality of codebooks stored in advance based on the current code rate; and/or

Selecting an inter-frame dependency for encoding the speech feature information from a plurality of pre-stored inter-frame dependencies based on the current code rate.

4. The audio processing method of claim 3, wherein encoding the speech feature information according to the encoding parameters comprises:

when the inter-frame dependency is selected as a first inter-frame dependency of the plurality of inter-frame dependencies, speech feature information for each speech frame is encoded by a set of vectors in the codebook.

5. The audio processing method of claim 3, wherein encoding the speech feature information according to the encoding parameters comprises:

when the inter-frame dependency is selected as a second inter-frame dependency of the plurality of inter-frame dependencies, the speech feature information of every other speech frame is encoded by a set of vectors in the codebook, and the speech feature information of a speech frame not encoded by the codebook is encoded by an average of the speech feature information of two adjacent encoded speech frames.

6. The audio processing method of claim 3, wherein encoding the speech feature information according to the encoding parameters comprises:

when the inter-frame dependency is selected as the third inter-frame dependency among the multiple inter-frame dependencies, in every four speech frames, the speech feature information of one speech frame is encoded by a group of vectors in the codebook, the speech feature information of a speech frame separated from the speech frame by one frame is encoded by a group of vectors in a pre-stored differential codebook, and the speech feature information of the other two speech frames is encoded by the mean value of the speech feature information of two adjacent speech frames.

7. An audio processing method, comprising:

acquiring network state information of a current voice transmission network;

determining the current code rate for decoding the voice characteristic information of the received coded voice according to the network state information;

determining decoding parameters for decoding the voice feature information based on the current code rate;

and decoding the voice characteristic information according to the decoding parameters.

8. An audio processing apparatus, comprising:

the network state monitoring module is configured to acquire network state information of a current voice transmission network and determine a current code rate for encoding input voice according to the network state information;

an encoding module configured to determine speech feature information of the input speech and encoding parameters for encoding the speech feature information based on the current code rate, and encode the speech feature information according to the encoding parameters.

9. An audio processing apparatus, comprising:

the network state monitoring module is configured to acquire network state information of a current voice transmission network and determine a current code rate for decoding voice feature information of received coded voice according to the network state information;

a decoding module configured to determine a decoding parameter for decoding the voice feature information based on the current code rate, and decode the voice feature information according to the decoding parameter.

10. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions,

wherein the processor is configured to execute the instructions to implement the speech processing method of any of claims 1-7.

Technical Field

The present disclosure relates to the field of audio technologies, and in particular, to a speech processing method and a speech processing apparatus for encoding and decoding speech in speech transmission.

Background

Voice codec technology is important in Voice transmission, and especially has attracted attention in Voice over Internet Protocol (VoIP). For example, the speech intelligibility and sound quality of the ultra-low bit rate VoIP affect the user experience of the real-time communication user in the weak network situation, and especially when the network can only provide a speech transmission bandwidth of 3kbps, the speech coding and decoding capability of the VoIP is very challenging. Meanwhile, high-quality voice in a high network speed state is also a key point for improving user experience.

Disclosure of Invention

The present disclosure provides a voice processing method and a voice processing apparatus to solve at least the above problems. The technical scheme of the disclosure is as follows:

according to a first aspect of the embodiments of the present disclosure, there is provided a speech processing method, which may include: acquiring network state information of a current voice transmission network; determining a current code rate for coding input voice according to the network state information; determining voice feature information of the input voice and coding parameters for coding the voice feature information based on the current code rate; and coding the voice characteristic information according to the coding parameters.

Optionally, determining the speech feature information of the input speech based on the current code rate may include: determining dimension information for extracting voice features based on the current code rate; and extracting voice feature information corresponding to the dimension information from the input voice according to the dimension information.

Optionally, the voice features at high latitudes are extracted in a high-bit-rate network environment, and the voice features at low latitudes are extracted in a low-bit-rate environment, where the dimension range of the dimension information is 16 to 64.

Optionally, determining encoding parameters for encoding the speech feature information based on the current code rate may include: determining at least one of a codebook and an inter-frame dependency for encoding the speech feature information based on the current code rate, wherein the codebook represents a number of encoding bits of a single speech frame and the inter-frame dependency represents a number of frame information of a speech frame referencing other speech frames in a sequence of speech frames.

Optionally, determining at least one of a codebook and an inter-frame dependency for encoding the speech feature information based on the current coding rate may include: selecting a codebook used for encoding the voice feature information from a plurality of codebooks stored in advance based on the current code rate; and/or selecting an inter-frame dependency for encoding the speech feature information from a plurality of pre-stored inter-frame dependencies based on the current code rate.

Optionally, encoding the speech feature information according to the encoding parameter may include: when the inter-frame dependency is selected as a first inter-frame dependency of the plurality of inter-frame dependencies, speech feature information for each speech frame is encoded by a set of vectors in the codebook.

Optionally, encoding the speech feature information according to the encoding parameter may include: when the inter-frame dependency is selected as a second inter-frame dependency of the plurality of inter-frame dependencies, the speech feature information of every other speech frame is encoded by a set of vectors in the codebook, and the speech feature information of a speech frame not encoded by the codebook is encoded by an average of the speech feature information of two adjacent encoded speech frames.

Optionally, encoding the speech feature information according to the encoding parameter may include: when the inter-frame dependency is selected as the third inter-frame dependency among the multiple inter-frame dependencies, in every four speech frames, the speech feature information of one speech frame is encoded by a group of vectors in the codebook, the speech feature information of a speech frame separated from the speech frame by one frame is encoded by a group of vectors in a pre-stored differential codebook, and the speech feature information of the other two speech frames is encoded by the mean value of the speech feature information of two adjacent speech frames.

Optionally, before determining the speech feature information of the input speech, the audio processing method may include: carrying out noise reduction processing on the input voice; and extracting the voice characteristic information from the input voice subjected to noise reduction processing based on the current code rate.

According to a second aspect of the embodiments of the present disclosure, there is provided a speech processing method, which may include: acquiring network state information of a current voice transmission network; determining the current code rate for decoding the voice characteristic information of the received coded voice according to the network state information; determining decoding parameters for decoding the voice feature information based on the current code rate; and decoding the voice characteristic information according to the decoding parameters.

Optionally, determining decoding parameters for decoding the speech feature information based on the current code rate may include: determining at least one of a codebook and an inter-frame dependency for decoding the speech feature information based on the current code rate, wherein the codebook represents a number of coded bits of a single speech frame and the inter-frame dependency represents a number of frame information of the speech frame referencing other speech frames in a sequence of speech frames.

Optionally, determining at least one of a codebook and an inter-frame dependency for decoding the speech feature information based on the current coding rate may include: selecting a codebook for decoding the voice feature information from a plurality of codebooks stored in advance based on the current code rate; and/or selecting an inter-frame dependency for decoding the speech feature information from a plurality of pre-stored inter-frame dependencies based on the current code rate.

Optionally, decoding the speech feature information according to the decoding parameters may include: when the inter-frame dependency is selected as a first inter-frame dependency of the plurality of inter-frame dependencies, the speech feature information of each speech frame is decoded by a set of vectors in the codebook.

Optionally, decoding the speech feature information according to the decoding parameters may include: when the inter-frame dependency is selected as a second inter-frame dependency of the plurality of inter-frame dependencies, the speech feature information of every other speech frame is decoded by a set of vectors in the codebook, and the speech feature information of a speech frame not encoded by the codebook is decoded by an average of the speech feature information of two encoded speech frames adjacent thereto.

Optionally, decoding the speech feature information according to the decoding parameters may include: when the inter-frame dependency is selected as the third inter-frame dependency among the multiple inter-frame dependencies, in every four speech frames, the speech feature information of one speech frame is decoded by a group of vectors in the codebook, the speech feature information of a speech frame separated from the speech frame by one frame is decoded by a group of vectors in a pre-stored differential codebook, and the speech feature information of the other two speech frames is decoded by the mean value of the speech feature information of two adjacent speech frames.

Optionally, the audio processing method may further include: generating a speech signal corresponding to the encoded speech using a neural network based on the decoded speech feature information.

According to a third aspect of the embodiments of the present disclosure, there is provided a voice processing apparatus, which may include: the network state monitoring module is configured to acquire network state information of a current voice transmission network and determine a current code rate for encoding input voice according to the network state information; an encoding module configured to determine speech feature information of the input speech and encoding parameters for encoding the speech feature information based on the current code rate, and encode the speech feature information according to the encoding parameters.

Optionally, the encoding module may be configured to: determining dimension information for extracting voice features based on the current code rate; and extracting voice feature information corresponding to the dimension information from the input voice according to the dimension information.

Optionally, the encoding module may be configured to: determining at least one of a codebook and an inter-frame dependency for encoding the speech feature information based on the current code rate, wherein the codebook represents a number of encoding bits of a single speech frame and the inter-frame dependency represents a number of frame information of a speech frame referencing other speech frames in a sequence of speech frames.

Optionally, the encoding module may be configured to: selecting a codebook used for encoding the voice feature information from a plurality of codebooks stored in advance based on the current code rate; and/or selecting an inter-frame dependency for encoding the speech feature information from a plurality of pre-stored inter-frame dependencies based on the current code rate.

Optionally, the encoding module may be configured to: when the inter-frame dependency is selected as a first inter-frame dependency of the plurality of inter-frame dependencies, speech feature information for each speech frame is encoded by a set of vectors in the codebook.

Optionally, the encoding module may be configured to: when the inter-frame dependency is selected as a second inter-frame dependency of the plurality of inter-frame dependencies, the speech feature information of every other speech frame is encoded by a set of vectors in the codebook, and the speech feature information of a speech frame not encoded by the codebook is encoded by an average of the speech feature information of two adjacent encoded speech frames.

Optionally, the encoding module may be configured to: when the inter-frame dependency is selected as the third inter-frame dependency among the multiple inter-frame dependencies, in every four speech frames, the speech feature information of one speech frame is encoded by a group of vectors in the codebook, the speech feature information of a speech frame separated from the speech frame by one frame is encoded by a group of vectors in a pre-stored differential codebook, and the speech feature information of the other two speech frames is encoded by the mean value of the speech feature information of two adjacent speech frames.

Optionally, the audio processing apparatus may further include a noise reduction module configured to: before determining the speech feature information of the input speech, performing noise reduction processing on the input speech, wherein the encoding module may extract the speech feature information from the noise-reduced input speech based on the current code rate.

According to a fourth aspect of the embodiments of the present disclosure, there is provided a voice processing apparatus, which may include: the network state monitoring module is configured to acquire network state information of a current voice transmission network and determine a current code rate for decoding voice feature information of received coded voice according to the network state information; a decoding module configured to determine a decoding parameter for decoding the voice feature information based on the current code rate, and decode the voice feature information according to the decoding parameter.

Optionally, the decoding module may be configured to: determining at least one of a codebook and an inter-frame dependency for decoding the speech feature information based on the current code rate, wherein the codebook represents a number of coded bits of a single speech frame and the inter-frame dependency represents a number of frame information of the speech frame referencing other speech frames in a sequence of speech frames.

Optionally, the decoding module may be configured to: selecting a codebook for decoding the voice feature information from a plurality of codebooks stored in advance based on the current code rate; and/or selecting an inter-frame dependency for decoding the speech feature information from a plurality of pre-stored inter-frame dependencies based on the current code rate.

Optionally, the decoding module may be configured to: when the inter-frame dependency is selected as a first inter-frame dependency of the plurality of inter-frame dependencies, the speech feature information of each speech frame is decoded by a set of vectors in the codebook.

Optionally, the decoding module may be configured to: when the inter-frame dependency is selected as a second inter-frame dependency of the plurality of inter-frame dependencies, the speech feature information of every other speech frame is decoded by a set of vectors in the codebook, and the speech feature information of a speech frame not encoded by the codebook is decoded by an average of the speech feature information of two encoded speech frames adjacent thereto.

Optionally, the decoding module may be configured to: when the inter-frame dependency is selected as the third inter-frame dependency among the multiple inter-frame dependencies, in every four speech frames, the speech feature information of one speech frame is decoded by a group of vectors in the codebook, the speech feature information of a speech frame separated from the speech frame by one frame is decoded by a group of vectors in a pre-stored differential codebook, and the speech feature information of the other two speech frames is decoded by the mean value of the speech feature information of two adjacent speech frames.

Optionally, the audio processing apparatus may further include a speech generation module configured to: generating a speech signal corresponding to the encoded speech using a neural network based on the decoded speech feature information.

According to a fifth aspect of embodiments of the present disclosure, there is provided an electronic apparatus, which may include: at least one processor; at least one memory storing computer-executable instructions, wherein the computer-executable instructions, when executed by the at least one processor, cause the at least one processor to perform the speech processing method as described above.

According to a sixth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium storing instructions that, when executed by at least one processor, cause the at least one processor to perform the speech processing method as described above.

According to a seventh aspect of embodiments of the present disclosure, there is provided a computer program product, instructions of which are executed by at least one processor in an electronic device to perform the speech processing method as described above.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

the low-complexity self-adaptive scalable code rate high-tone-quality voice coding method can select different code rates according to the network state, and select parameters suitable for voice coding or decoding of the current network according to the different code rates to code or decode voice data, so that the voice can be ensured to have good intelligibility and better tone quality under the condition of a weak network, excellent tone quality can be ensured under the condition of a high-speed network, and user experience is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

FIG. 1 is a diagram of an application environment for real-time communication, according to an embodiment of the present disclosure;

FIG. 2 is a flow diagram of a speech processing method for speech encoding according to an embodiment of the present disclosure;

FIG. 3 is a flow diagram of a speech processing method for speech decoding according to an embodiment of the present disclosure;

FIG. 4 is a schematic flow diagram for speech processing according to an embodiment of the present disclosure;

FIG. 5 is a schematic block diagram of a speech processing device according to an embodiment of the present disclosure;

FIG. 6 is a block diagram of a speech processing apparatus for speech encoding according to an embodiment of the present disclosure;

FIG. 7 is a block diagram of a speech processing apparatus for speech decoding according to an embodiment of the present disclosure;

fig. 8 is a block diagram of an electronic device according to an embodiment of the disclosure.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

The following description with reference to the accompanying drawings is provided to assist in a comprehensive understanding of the embodiments of the disclosure as defined by the claims and their equivalents. Various specific details are included to aid understanding, but these are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. In addition, descriptions of well-known functions and constructions are omitted for clarity and conciseness.

The terms and words used in the following description and claims are not limited to the written meaning, but are used only by the inventors to achieve a clear and consistent understanding of the disclosure. Accordingly, it should be apparent to those skilled in the art that the following descriptions of the various embodiments of the present disclosure are provided for illustration only and not for the purpose of limiting the disclosure as defined by the appended claims and their equivalents.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

Existing speech codec techniques include waveform coding methods (such as g.728, g.729 standards or implementations using the OPUS open source library, etc.) and parametric coding methods (such as Lyra speech coding). However, the waveform coding method is difficult to realize voice transmission with high sound quality and 8kHz effective frequency width under the code rate of 6kbps, and has a problem that VoIP experience is affected by a great amount of jamming, packet loss, voice quality impairment and the like in the case of a weak network environment which can only provide a voice coding bandwidth under 3 kbps. In addition, the parameter coding method only supports the speech coding and decoding with the effective frequency width of 8kHz under the weak network condition, and only supports the speech coding with the code rate of 3kbps, so that the high sound quality is difficult to guarantee, the code rate scalability cannot be realized under the high network speed condition, the real-time communication experience with high sound quality is difficult to guarantee, and meanwhile, the problems of overhigh complexity, large delay (for example, delay of 90ms) and the like exist, and the practical application is difficult.

The present disclosure provides a low-complexity high-quality speech coding and decoding method, which can adaptively perform coding and decoding based on code rate under different network conditions, thereby not only ensuring that speech has higher intelligibility and tone quality under the weak network condition, but also ensuring that a user has good real-time communication experience with excellent tone quality under the high network speed state.

Hereinafter, according to various embodiments of the present disclosure, a method and apparatus of the present disclosure will be described in detail with reference to the accompanying drawings.

Fig. 1 is a diagram of an application environment for real-time communication according to an embodiment of the present disclosure.

Referring to fig. 1, the application environment 100 includes a terminal 110 and a terminal 120.

Terminal 110 and terminal 120 may be terminals where users are located, for example, two users may communicate in real time using terminals 110 and 120, respectively. The terminals 110 and 120 may be at least one of a smart phone, a tablet computer, a portable computer, a desktop computer, and the like. The terminal 110 may be installed with a target application for an application such as for voice communication. Although the present embodiment shows only two terminals for illustration, those skilled in the art will appreciate that the number of terminals may be two or more. The number of terminals and the type of the device are not limited in any way in the embodiments of the present disclosure.

The terminals 110 and 120 may communicate over a wireless network such that users of the terminals 110 and 120 may communicate in real time. For example, the network can comprise a Local Area Network (LAN), a Wide Area Network (WAN), a wireless link, an intranet, the internet, a combination thereof, or the like.

The terminal 110 may encode voice input by a user of the terminal 110 and then transmit the encoded voice data to the terminal 120, and the terminal 120 may decode the received encoded data and then restore the decoded data to a voice signal. In addition, the terminal 120 may encode voice input by the user of the terminal 120 and then transmit the encoded voice data to the terminal 110, and the terminal 110 may decode the received encoded data and then restore the decoded data to a voice signal. Terminal 110 and/or terminal 120 may implement encoding and decoding of speech simultaneously.

According to the present disclosure, when the terminal 110 serves as an encoding/transmitting end and the terminal 120 serves as a decoding/receiving end, the terminal 110 may determine a code rate for encoding according to a network state of communication, then determine which encoding mode to use to encode input voice according to the code rate, and then transmit encoded voice data to the terminal 120. The terminal 120 may determine a code rate for decoding according to the network status, and then determine which decoding mode to use for decoding the received encoded data according to the code rate, and then obtain the voice signal.

The method and the device can select a proper coding mode and a proper decoding mode according to the network state, so that under the condition of a weak network, good speech intelligibility and good speech tone quality can be ensured, excellent tone quality can be ensured under a high-speed network, and user experience is improved. How to encode and decode speech in different network states will be described in more detail below with reference to fig. 2 to 4.

Fig. 2 is a flowchart of a speech processing method for speech encoding according to an embodiment of the present disclosure. The speech processing method of fig. 2 may be implemented in any electronic device having an audio coding function. The electronic device may be a device including at least one of, for example, a smart phone, a tablet Personal Computer (PC), a mobile phone, a video phone, an electronic book reader (e-book reader), a desktop PC, a laptop PC, a netbook computer, a workstation, a server, a Personal Digital Assistant (PDA), a Portable Multimedia Player (PMP), a moving picture experts group (MPEG-1 or MPEG-2) audio layer 3(MP3) player, a camera, a wearable device, and the like.

Referring to fig. 2, in step S201, network status information of a current voice transmission network is acquired. For example, the network status information of the current voice transmission network may be acquired by monitoring the network status of the voice transmission in real time.

In step S202, a current code rate for encoding the input speech is determined according to the acquired network status information. For example, the current code rate suitable for encoding may be determined according to the transmission bandwidth of the current network. The better the current network state, the higher the current code rate can be selected.

As an example, by monitoring the current network state in real time, the code rate bitrate (n) that can be used for speech coding at the current time n is determined according to the current network state. For example, the code rate Bitrate (n) may range from 1.5kbps to 24 kbps.

In step S203, speech characteristic information of the input speech and encoding parameters for encoding the speech characteristic information are determined based on the current code rate.

First, dimension information for extracting a voice feature may be determined based on a current code rate, and then voice feature information corresponding to the dimension information may be extracted from an input voice according to the determined dimension information. The voice features of high latitude are extracted in a high-code-rate network environment, and the voice features of low latitude are extracted in a low-code-rate environment, and the dimension range of the dimension information can be 16 to 64 in the present disclosure.

The speech features may be mel-frequency cepstral coefficients MFCC, mel-frequency spectral features, but the disclosure is not limited thereto. For example, the dimension for extracting the speech feature may be first determined based on the current code rate, the speech feature of the corresponding dimension may be extracted from the input speech according to the determined dimension information, and then the speech feature may be encoded according to the determined encoding parameters.

For example, the speech feature extraction may be performed on the input speech based on the current code rate. The speech features can be extracted using equation (1) below:

feature(n)＝FEAT[x(n)，BitRate(n)] (1)

wherein, x (n) represents the speech frame at the current time n, feature (n) represents the speech feature output at the current time n, FEAT [ ] represents the feature extraction process, and the process can refer to the current code rate bitrate (n) to determine the feature dimension. In a high-rate network environment, high dimensional features may be extracted, and in a low-rate network environment, low dimensional features may be extracted, for example, the dimensional range may be set to 16 to 64. However, the above examples are merely exemplary, and the present disclosure is not limited thereto.

At least one of a codebook and an inter-frame dependency for encoding speech characteristic information may be determined based on a current code rate, wherein the codebook represents a number of encoding bits of a single speech frame and the inter-frame dependency represents a number of frame information of the speech frame referring to other speech frames in a sequence of speech frames. The higher the current code rate, the higher the coding quality of the coding parameters can be selected.

In the present disclosure, the encoding parameter may include at least one of a codebook and an inter-frame dependency. The codebook may determine the number of coded bits for a single frame feature. The larger the codebook, the more bits of coding represent a single speech frame. Inter-frame dependencies may be understood as bit-packing rules for determining the coding scheme between speech frames. The higher the inter-frame dependency, the more frame information indicating that a speech frame refers to other speech frames in the sequence of speech frames.

The multiple codebooks and the multiple inter-frame dependencies can be stored in the electronic equipment serving as the sending end/the encoding end in advance, so that the electronic equipment can select encoding parameters suitable for the current code rate from the stored codebooks and the inter-frame dependencies according to the current code rate.

A codebook used for encoding the input speech may be selected from a plurality of codebooks stored in advance based on the current code rate, for example, the higher the current code rate, the larger the codebook may be selected. The inter-frame dependencies used to encode the input speech may be selected from a plurality of pre-stored inter-frame dependencies based on the current bitrate, e.g., the higher the current bitrate, the lower the inter-frame dependencies may be selected.

As an example, the codebook that needs to be currently used may be determined based on the current code rate using the following equation (2):

the codebook (n) represents the codebook selected at the current time n, the codebook1, the codebook2, the codebook3, the codebook M represents different codebooks, the number of codebooks is M, bitrate 1, bitrate 2, bitrate 3, the codebook M-1 represents different bitrate thresholds, and the bitrate thresholds are gradually increased according to the sequence numbers. However, the above examples are merely exemplary, and the present disclosure may variously set the codebook content and the number of codebooks according to the need. In addition, a code rate threshold for determining a codebook may be set differently.

The inter-frame dependency that needs to be used currently can be determined based on the current code rate using equation (3) below:

wherein, distribution (n) represents inter-frame dependency determined to be used by the current time n, bitrate ha, bitrate hb, bitrate hcs respectively represent different bitrate (bit rate) thresholds, the bitrate thresholds are sequentially increased one by one, distribution strong, distribution normal, distribution weak respectively represent different inter-frame dependencies, that is, strong inter-frame dependency (that is, third inter-frame dependency), medium inter-frame dependency (that is, second inter-frame dependency) and weak inter-frame dependency (that is, first inter-frame dependency), and a difference between different inter-frame dependencies mainly lies in whether a speech frame sequence references speech frame information adjacent to each other before and after the speech frame sequence, and the more reference information, the stronger inter-frame dependency, and the lower bitrate. Furthermore, the code rate threshold used to determine inter-frame dependency may be set differently.

Although the above examples show only cases where the inter-frame dependencies are classified into three, the present disclosure may be classified into more cases according to different needs.

In step S204, the speech feature information of the input speech is encoded according to the determined encoding parameters.

As an example, when the inter-frame dependencies are selected to be weak inter-frame dependencies, each speech frame may be encoded using a set of vectors in the selected codebook.

For example, for weak inter-frame dependencies, each speech frame may be represented by a set of vectors in a Vector Quantization (VQ) codebook, which may be represented by equation (4) below:

VQfeature(n)＝VQ[feature(n)] (4)

here, feature (n) represents a speech feature at the current time n, vqfeature (n) represents a VQ-encoded speech feature at the current time n, and VQ [ ] represents VQ encoding processing.

When inter-frame dependencies are selected to be medium, every other speech frame may be encoded using a set of vectors in the selected codebook, and speech frames not encoded by a codebook may be encoded using the average of the two encoded speech frames adjacent thereto.

For example, for medium inter-frame dependency, every other speech frame may be represented by a set of vectors in the VQ codebook, and one speech frame between two VQ-encoded speech frames may be determined by the mean of its two neighboring frames (i.e., the two VQ-encoded speech frames), which may be represented by the following equation (5):

where VQfeature (n-1) represents the speech feature of VQ coding at the previous time, VQfeature (n +1) represents the speech feature of VQ coding at the subsequent time, feature (n) represents the speech feature of current time n, VQfeature (n) represents the speech feature of VQ coding at current time n, VQ [. [ ] represents VQ coding processing, mod (n, 2) ═ 1 represents the case where a speech frame is represented by a set of vectors in the VQ codebook, and mod (n, 2) ═ 0 represents the case where a speech frame is not represented by a set of vectors in the VQ codebook.

When the inter-frame dependency is selected to be strong, in every fourth speech frame, one speech frame may be encoded using a set of vectors in the selected codebook, a speech frame one frame away from the speech frame (the speech frame encoded using the selected codebook) may be encoded using a set of vectors in a differential codebook, and the remaining two speech frames may be encoded using the average of the two speech frames adjacent thereto, respectively.

For example, for strong inter-frame dependency, one speech frame out of every four speech frames may be represented by a set of vectors in the codebook, a speech frame one speech frame apart from the speech frame represented by the set of vectors in the codebook may be represented by a set of vectors in a differential VQ codebook, and the remaining two speech frames out of the four speech frames may be obtained by mean encoding of two adjacent speech frames, which may be represented by the following equation (6):

wherein the same variables in equation (6) as in equation (5) have the same meanings as the corresponding variables in equation (5), and furthermore, diffVQfeature [ # - # ] represents a differential VQ process, i.e., a low-rate VQ encoding is performed on the difference of two elements of [ #and #.

And finally, sending the coded voice characteristic information to a receiving end/a decoding end. For example, the encoded voice data may be transmitted to a receiving/decoding end through a network.

According to an embodiment of the present disclosure, before encoding the input speech, the input speech may be subjected to noise reduction processing to obtain relatively purer speech. The noise reduction of the input speech may be performed by using a CRNN-based neural network noise reduction method, and then the speech features of the noise-reduced speech may be extracted based on the current code rate.

For example, noise reduction processing may be performed on the currently input original speech to obtain noise-removed speech. For example, the denoising process may be performed using the following equation (7):

x(n)＝NR[raw(n]] (7)

where raw (n) represents the original input speech frame at the current time n. The length of each speech frame is generally 10ms to 20ms, x (n) represents the noise-reduced speech frame at the current time n, and NR [ ] represents the noise reduction process.

The high-quality voice coding method with low complexity can realize self-adaptive scalable coding in the range from 1.5kbps to 24kbps, can ensure high intelligibility and tone quality of voice under the condition of a weak network, and simultaneously ensures good real-time communication experience of excellent tone quality under the state of high network speed.

Fig. 3 is a flowchart of a speech processing method for speech decoding according to an embodiment of the present disclosure. The speech processing method of fig. 3 may be implemented in any electronic device having an audio decoding function. The electronic device may be a device including at least one of, for example, a smart phone, a tablet Personal Computer (PC), a mobile phone, a video phone, an electronic book reader (e-book reader), a desktop PC, a laptop PC, a netbook computer, a workstation, a server, a Personal Digital Assistant (PDA), a Portable Multimedia Player (PMP), a moving picture experts group (MPEG-1 or MPEG-2) audio layer 3(MP3) player, a camera, a wearable device, and the like.

Referring to fig. 3, in step S301, network status information of a current voice transmission network is acquired.

In step S302, a current code rate for decoding the speech feature information of the received encoded speech is determined according to the acquired network status information.

For example, the current code rate suitable for encoding may be determined according to the transmission bandwidth of the current network. The better the current network state, the higher the current code rate can be selected.

In step S303, decoding parameters for speech feature information of the encoded speech are determined based on the current code rate. The higher the current code rate, the higher the decoding quality of the decoding parameters can be selected. The decoding parameters may include at least one of codebook and inter-frame dependencies. The larger the codebook, the more the number of encoding bits that can represent a single speech frame, and the higher the inter-frame dependency, the more frame information that can represent that a speech frame refers to other speech frames in a sequence of speech frames.

The multiple codebooks and the multiple inter-frame dependencies can be stored in the electronic equipment serving as the receiving end/the decoding end in advance, so that the electronic equipment can select the coding parameters suitable for the current code rate from the stored codebooks and the inter-frame dependencies according to the current code rate.

In the present disclosure, the pre-stored codebooks and interframe dependencies in the sending end/encoding end are the same as the pre-stored codebooks and interframe dependencies in the receiving end, so that after the receiving end/decoding end receives the encoded data sent by the sending end, the receiving end/decoding end can correspondingly decode the encoded data according to the same encoding parameters.

As an example, a codebook for decoding encoded speech may be selected from a plurality of codebooks stored in advance based on a current code rate. For example, the higher the current code rate, the larger the codebook may be selected. The inter-frame dependencies for decoding the encoded speech may be selected from a plurality of pre-stored inter-frame dependencies based on the current bitrate, e.g., the higher the current bitrate, the lower the inter-frame dependencies may be selected. The process of selecting the decoding parameters is similar to the process of selecting the encoding parameters, and the decoding parameters may be selected with reference to equations (2) and (3) described above.

In step S304, the speech feature information of the encoded speech is decoded in accordance with the determined decoding parameters. The process of decoding the encoded speech can be considered as the inverse process of encoding the input speech.

When the inter-frame dependencies are selected to be weak inter-frame dependencies, each speech frame is decoded by a set of vectors in the selected codebook.

When the inter-frame dependency is selected to be medium, every other speech frame may be decoded by a set of vectors in the selected codebook, and a speech frame not encoded by a codebook may be decoded by the average of the two encoded speech frames adjacent to it.

When the inter-frame dependency is selected to be strong, in every four speech frames, one speech frame may be decoded by a set of vectors in the selected codebook, a speech frame one frame apart from the speech frame may be decoded by a set of vectors in a pre-stored differential codebook, and the remaining two speech frames may be decoded by an average of two adjacent speech frames, respectively.

According to an embodiment of the present disclosure, a speech feature of the encoded speech may be decoded according to the determined decoding parameter, and then a speech signal corresponding to the encoded speech may be generated based on the decoded speech feature. For example, a neural network may be used to generate the final speech signal based on the decoded speech features.

For example, in the case of receiving VQ-encoded speech features from a transmitting end, a process of decoding the VQ-encoded speech features may be represented by the following equation (8):

DeVQfeature(n)＝DeVQ[VQfeature(n)，codebook(n)，distribution(n)] (8)

wherein, DeVQ [, codebook (n), distribution (n) ] represents the decoding processing of the speech characteristics based on the VQ codebook obtained from the code rate of the current time n and the inter-frame dependency, DeVQfeature (n) represents the speech characteristics decoded at the current time n, and VQfeature (n) represents the speech characteristics of the VQ code received at the current time n.

Next, a final speech may be generated based on the decoded speech features. The process of generating speech can be represented by equation (9) below:

y(n)＝SpeechGenerate[DeVQfeature(n) (9)

where y (n) represents the output speech frame at the current time n, and spechgeraterate [ ] represents the speech generation (speech decoding) process, which may generally employ a neural network-based speech generation model method.

As an example, a neural network may be utilized to generate the final speech. For example, the speech generation model may be formed by four set-connected modules, which are two Convolutional Neural Network (CNN) layers and two gated cyclic unit (GRU) layers, respectively, where the CNN layer is generally a one-dimensional convolution, the size of the convolution kernel is generally 3, and may be adjusted according to a requirement, the number of units in the GRU layer is generally 512, and the activation function is generally selected from a tanh function or a softmax function. The training data for the model may be obtained from the clean speech of hundreds of hours long, covering hundreds of speakers of different ages and genders. The pure speech can be used for extracting speech features by using the above equation (1), then the speech features obtained after the extracted speech features are subjected to coding and decoding processing according to different code rates of equations (4), (5), (6) and (8) are used as the input of a speech generation model, the predicted speech is output by using the speech generation model, and the speech generation model is trained by comparing the real pure speech with the predicted corresponding speech. However, the above examples are merely exemplary, and the present disclosure is not limited thereto.

FIG. 4 is a flow diagram for speech processing according to an embodiment of the present disclosure.

Referring to fig. 4, the noise reduction module may remove noise from the transmitting/encoding end input speech to preserve the desired transmitted speech signal. The feature extraction module may extract representative key information capable of characterizing the speech feature from the noise-reduced speech signal. The feature encoding module may encode the speech feature using a codebook and bit distribution rules (i.e., inter-frame dependencies) determined based on the current code rate, i.e., characterizing the speech feature with a particular number of bits. The network transmission module can transmit the coded characteristic information to the receiving end/decoding end. The network status monitoring module may monitor the network status in real time to determine the bandwidth that can be allocated for vocoding. The feature decoding module can restore the voice features coded by the sending end to approximate the original voice features. The voice decoding module can decode the decoded voice characteristics to obtain the finally output voice.

According to the embodiment of the disclosure, a proper coding scheme is selected by considering the network state, so that under the condition of a weak network, good speech intelligibility and better speech tone quality can be ensured; under the high-speed network, can guarantee splendid tone quality, improve user experience.

Fig. 5 is a schematic structural diagram of a speech processing device in a hardware operating environment according to an embodiment of the present disclosure. Here, the speech processing apparatus 500 may implement the encoding function and/or the decoding function described above.

As shown in fig. 5, the speech processing apparatus 500 may include: a processing component 501, a communication bus 502, a network interface 503, an input-output interface 504, a memory 505, and a power component 506. The communication bus 502 is used to implement, among other things, communication signals between these components. The input-output interface 504 may include a video display (such as a liquid crystal display), a microphone and speakers, and a user-interaction interface (such as a keyboard, mouse, touch-input device, etc.), and optionally, the input-output interface 504 may also include a standard wired interface, a wireless interface. The network interface 503 may optionally include a standard wired interface, a wireless interface (e.g., a wireless fidelity interface). The memory 505 may be a high speed random access memory or may be a stable non-volatile memory. The memory 505 may alternatively be a storage device separate from the processing component 501 described previously.

Those skilled in the art will appreciate that the configuration shown in FIG. 5 does not constitute a limitation of the speech processing apparatus 500, and may include more or fewer components than those shown, or some components in combination, or a different arrangement of components.

As shown in fig. 5, the memory 505, which is one type of storage medium, may include therein an operating system (such as a MAC operating system), a data storage module, a network communication module, a user interface module, a voice processing program, and a database.

In the voice processing apparatus 500 shown in fig. 5, the network interface 503 is mainly used for data communication with an external apparatus/terminal; the input/output interface 504 is mainly used for data interaction with a user; the processing component 501 and the memory 505 in the speech processing device 500 may be provided in the speech processing device 500, and the speech processing device 500 executes the speech processing method provided by the embodiment of the present disclosure by the processing component 501 calling the speech processing program stored in the memory 505 and various APIs provided by the operating system.

The processing component 501 may include at least one processor, and the memory 505 has stored therein a set of computer-executable instructions that, when executed by the at least one processor, perform a method of speech processing according to an embodiment of the disclosure. Further, the processing component 501 may perform encoding operations and decoding operations, etc., as described above. However, the above examples are merely exemplary, and the present disclosure is not limited thereto.

A preset plurality of codebooks and a variety of bit distribution rules (i.e., inter-frame dependencies) may be stored in the memory 505 in advance to select the codebook and the bit distribution rule for encoding or decoding suitable for the current network state. Processing component 501 may encode the input speech or decode the received encoded speech according to the selected codebook and bit distribution rules.

By way of example, the speech processing device 500 may be a PC computer, tablet device, personal digital assistant, smart phone, or other device capable of executing the set of instructions described above. The speech processing apparatus 500 need not be a single electronic device, but can be any collection of devices or circuits that can individually or jointly execute the instructions (or sets of instructions). The speech processing device 500 may also be part of an integrated control system or system manager, or may be configured as a portable electronic device that interfaces with local or remote (e.g., via wireless transmission).

In the speech processing apparatus 500, the processing component 501 may include a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a programmable logic device, a special-purpose processor system, a microcontroller, or a microprocessor. By way of example, and not limitation, processing component 501 may also include an analog processor, a digital processor, a microprocessor, a multi-core processor, a processor array, a network processor, and the like.

The processing component 501 may execute instructions or code stored in a memory, wherein the memory 505 may also store data. Instructions and data may also be sent and received over a network via the network interface 503, where the network interface 503 may employ any known transmission protocol.

The memory 505 may be integral to the processor, e.g., having RAM or flash memory disposed within an integrated circuit microprocessor or the like. Further, memory 505 may comprise a stand-alone device, such as an external disk drive, storage array, or any other storage device that may be used by a database system. The memory and the processor may be operatively coupled or may communicate with each other, such as through an I/O port, a network connection, etc., so that the processor can read files stored in the memory.

Fig. 6 is a block diagram of a speech processing apparatus for speech encoding according to an embodiment of the present disclosure.

Referring to fig. 6, the speech processing apparatus 600 may include a network status monitoring module 601, a noise reduction module 602, a feature extraction module 603, an encoding module 604, and a transmission module 605. Each module in the voice processing apparatus 600 may be implemented by one or more modules, and names of the corresponding modules may vary according to types of the modules. In various embodiments, some modules in the speech processing apparatus 600 may be omitted, or additional modules may also be included. Furthermore, modules/elements according to various embodiments of the present disclosure may be combined to form a single entity, and thus may equivalently perform the functions of the respective modules/elements prior to combination.

The network status monitoring module 601 may monitor the network status of the voice transmission in real time, and determine the current code rate for encoding the input voice according to the current network status. The better the current network state, the higher the current code rate can be selected by the network state monitoring module 601.

The feature extraction module 603 may determine dimension information for extracting the speech feature based on the current code rate, and extract speech feature information corresponding to the dimension information from the input speech according to the dimension information.

The encoding module 604 determines encoding parameters of the voice feature information of the input voice based on the current code rate, and encodes the voice feature information of the input voice according to the encoding parameters. The higher the current code rate, the higher the coding quality of the coding parameters can be selected by the coding module 604.

The transmitting module 605 may transmit the encoded voice to a receiving/decoding end.

According to an embodiment of the present disclosure, the encoding parameter may include at least one of a codebook and an inter-frame dependency, wherein the larger the codebook, the more the number of encoding bits representing a single speech frame, and the higher the inter-frame dependency, the more frame information representing that the speech frame refers to other speech frames in the sequence of speech frames.

The encoding module 604 may select a codebook to be used for encoding the input speech from a plurality of codebooks stored in advance based on the current code rate. For example, the higher the current code rate, the larger the codebook may be selected by the encoding module 604.

The encoding module 604 may select an inter-frame dependency for encoding the input speech from a plurality of pre-stored inter-frame dependencies based on the current coding rate. For example, the higher the current code rate, the lower the inter-frame dependency the encoding module 604 may select.

When the inter-frame dependencies are selected to be weak inter-frame dependencies, the encoding module 604 may encode each speech frame using a set of vectors in a codebook.

When the inter-frame dependency is selected to be a medium inter-frame dependency, the encoding module 604 may encode every other speech frame using a set of vectors in the codebook, and encode speech frames that are not encoded by the codebook using the average of two encoded speech frames adjacent to it.

When the inter-frame dependency is selected to be strong, in every fourth speech frame, one speech frame may be encoded by a set of vectors in the codebook, a speech frame one frame apart from the speech frame may be encoded by a set of vectors in the differential codebook, and the remaining two speech frames may be encoded by the mean of the two speech frames adjacent thereto, respectively.

The encoding module 604 may encode speech characteristics of the input speech.

The feature extraction module 603 may determine a dimension for extracting the speech feature based on the current code rate, and extract the speech feature of the corresponding dimension from the input speech according to the dimension information. The encoding module 604 may encode the speech feature according to the determined encoding parameter.

In addition, before encoding the input speech, the noise reduction module 602 may perform noise reduction processing on the input speech, and then the encoding module 604 may encode the input speech subjected to the noise reduction processing according to the determined encoding parameters.

Fig. 7 is a block diagram of a speech processing apparatus at a receiving end according to an embodiment of the present disclosure.

Referring to fig. 7, the speech processing apparatus 700 may include a network status monitoring module 701, a receiving module 702, a decoding module 703, and a speech generating module 704. Each module in the voice processing apparatus 700 may be implemented by one or more modules, and names of the corresponding modules may vary according to types of the modules. In various embodiments, some modules in the speech processing apparatus 700 may be omitted, or additional modules may also be included. Furthermore, modules/elements according to various embodiments of the present disclosure may be combined to form a single entity, and thus may equivalently perform the functions of the respective modules/elements prior to combination.

The network status monitoring module 701 may monitor a network status of voice transmission in real time, and determine a current code rate for decoding the encoded voice according to the current network status. The better the current network state is, the higher the current code rate can be selected by the network state monitoring module 701

The receiving module 702 may receive encoded speech.

The decoding module 703 may determine a decoding parameter of the speech feature information of the encoded speech based on the current code rate, and decode the speech feature information of the encoded speech according to the determined decoding parameter. The higher the current code rate, the higher the decoding quality of the decoding parameters can be selected by the decoding module 703.

The decoding parameters may include at least one of a codebook and an inter-frame dependency, wherein the larger the codebook, the more the number of coded bits representing a single speech frame, and the higher the inter-frame dependency, the more frame information representing that a speech frame references other speech frames in the sequence of speech frames.

The decoding module 703 may select a codebook for decoding the encoded speech from a plurality of previously stored codebooks based on the current code rate. For example, the higher the current code rate, the larger the codebook may be selected by the decoding module 703.

The decoding module 703 may select an inter-frame dependency for decoding the encoded speech from a plurality of pre-stored inter-frame dependencies based on the current code rate. For example, the higher the current code rate, the lower the inter-frame dependency the decoding module 703 may choose.

When the inter-frame dependencies are selected to be weak inter-frame dependencies, the decoding module 703 may decode using a set of vectors in the selected codebook for each speech frame.

When inter-frame dependency is selected to be medium, the decoding module 703 may decode using a set of vectors in the selected codebook for every other speech frame and using the mean of its two adjacent encoded speech frames for speech frames not encoded by the codebook.

When the inter-frame dependency is selected to be strong, in every four speech frames, the decoding module 703 may perform decoding using a set of vectors in the selected codebook for one speech frame, a set of vectors in a pre-stored differential codebook for a speech frame that is one frame away from the speech frame, and an average of two speech frames adjacent to the two speech frames for the other two speech frames, respectively.

The decoding module 703 may decode the speech feature of the encoded speech according to the determined decoding parameter, and the speech generating module 704 may generate a speech signal corresponding to the encoded speech based on the decoded speech feature.

The speech generation module 704 may generate a corresponding speech signal using a neural network based on the decoded speech features.

According to an embodiment of the present disclosure, an electronic device may be provided. Fig. 8 is a block diagram of an electronic device according to an embodiment of the disclosure, the electronic device 1000 may include at least one memory 1002 and at least one processor 1001, the at least one memory 1002 storing a set of computer-executable instructions, the set of computer-executable instructions, when executed by the at least one processor 1001, performing a speech processing method (i.e., a speech encoding method and/or a speech decoding method) according to an embodiment of the disclosure.

The processor 1001 may include a Central Processing Unit (CPU), an audio processor, a programmable logic device, a dedicated processor system, a microcontroller, or a microprocessor. By way of example, and not limitation, processor 1001 may also include an analog processor, a digital processor, a microprocessor, a multi-core processor, a processor array, a network processor, or the like.

The memory 1002, which is a kind of storage medium, may include an operating system (e.g., a MAC operating system), a data storage module, a network communication module, a user interface module, an audio processing program, an audio codec program, and a database.

The memory 1002 may be integrated with the processor 1001, for example, RAM or flash memory may be disposed within an integrated circuit microprocessor or the like. Further, memory 1002 may comprise a stand-alone device, such as an external disk drive, storage array, or any other storage device usable by a database system. The memory 1002 and the processor 1001 may be operatively coupled, or may communicate with each other, such as through I/O ports, network connections, etc., so that the processor 1001 can read files stored in the memory 1002.

In addition, the electronic device 1000 may also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of the electronic device 1000 may be connected to each other via a bus and/or a network.

By way of example, the electronic device 1000 may be a PC computer, tablet device, personal digital assistant, smartphone, or other device capable of executing the set of instructions described above. The electronic device 1000 need not be a single electronic device, but can be any collection of devices or circuits that can execute the above instructions (or sets of instructions) individually or in combination. The electronic device 1000 may also be part of an integrated control system or system manager, or may be configured as a portable electronic device that interfaces with local or remote (e.g., via wireless transmission).

Those skilled in the art will appreciate that the configuration shown in fig. 8 is not intended to be limiting and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.

According to an embodiment of the present disclosure, there may also be provided a computer-readable storage medium storing instructions that, when executed by at least one processor, cause the at least one processor to perform a speech processing method (speech encoding method and/or speech decoding method) according to the present disclosure. Examples of the computer-readable storage medium herein include: read-only memory (ROM), random-access programmable read-only memory (PROM), electrically erasable programmable read-only memory (EEPROM), random-access memory (RAM), dynamic random-access memory (DRAM), static random-access memory (SRAM), flash memory, non-volatile memory, CD-ROM, CD-R, CD + R, CD-RW, CD + RW, DVD-ROM, DVD-R, DVD + R, DVD-RW, DVD + RW, DVD-RAM, BD-ROM, BD-R, BD-R LTH, BD-RE, Blu-ray or compact disc memory, Hard Disk Drive (HDD), solid-state drive (SSD), card-type memory (such as a multimedia card, a Secure Digital (SD) card or a extreme digital (XD) card), magnetic tape, a floppy disk, a magneto-optical data storage device, an optical data storage device, a hard disk, a magnetic tape, a magneto-optical data storage device, a hard disk, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, A solid state disk, and any other device configured to store and provide a computer program and any associated data, data files, and data structures to a processor or computer in a non-transitory manner such that the processor or computer can execute the computer program. The computer program in the computer-readable storage medium described above can be run in an environment deployed in a computer apparatus, such as a client, a host, a proxy device, a server, and the like, and further, in one example, the computer program and any associated data, data files, and data structures are distributed across a networked computer system such that the computer program and any associated data, data files, and data structures are stored, accessed, and executed in a distributed fashion by one or more processors or computers.

According to an embodiment of the present disclosure, a computer program product may also be provided, in which instructions are executable by a processor of a computer device to perform the above-described speech processing method (speech encoding method and/or speech decoding method).

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

22页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：目标音频的输出方法及装置、系统

Voice processing method and voice processing device

相关技术

网友询问留言