Audio data processing method, device, equipment and storage medium

文档序号:719640 发布日期:2021-04-16 浏览:34次 中文

阅读说明:本技术 音频数据处理方法、装置、设备和存储介质 (Audio data processing method, device, equipment and storage medium ) 是由 袁俊 陈昌滨 王俊超 聂志朋 于 2020-12-09 设计创作,主要内容包括:本申请公开了音频数据处理方法、装置、设备和存储介质,涉及语音技术、深度学习等人工智能技术领域。具体实现方案为:获取待处理音频数据的原始特征张量;根据原始特征张量与可学习的权值张量,获取待处理特征张量和关键特征张量;分别对待处理特征张量和关键特征张量中的目标维度进行维度变换,获取待压缩特征张量和候选关键特征张量;根据待压缩特征张量和候选关键特征张量,获取权重矩阵;根据权重矩阵和候选关键特征张量,获取目标特征张量,对目标特征张量进行处理,获取压缩特征张量;将压缩特征张量输入神经网络进行处理,获取待处理音频数据的处理结果。由此,在保证信息压缩质量的同时提高信息压缩效率,提高后续语音处理效果。(The application discloses an audio data processing method, an audio data processing device, audio data processing equipment and a storage medium, and relates to the technical field of artificial intelligence such as voice technology and deep learning. The specific implementation scheme is as follows: acquiring an original characteristic tensor of audio data to be processed; acquiring a feature tensor to be processed and a key feature tensor according to the original feature tensor and the learnable weight tensor; respectively carrying out dimension transformation on target dimensions in the feature tensor to be processed and the key feature tensor to obtain a feature tensor to be compressed and a candidate key feature tensor; acquiring a weight matrix according to the feature tensor to be compressed and the candidate key feature tensor; acquiring a target feature tensor according to the weight matrix and the candidate key feature tensor, and processing the target feature tensor to acquire a compressed feature tensor; and inputting the compression characteristic tensor into the neural network for processing to obtain a processing result of the audio data to be processed. Therefore, the information compression efficiency is improved while the information compression quality is ensured, and the subsequent voice processing effect is improved.)

1. An audio data processing method, comprising:

acquiring an original feature tensor of audio data to be processed, and acquiring a feature tensor to be processed and a key feature tensor according to the original feature tensor and a learnable weight tensor;

performing dimensionality transformation on target dimensionalities in the feature tensor to be processed and the key feature tensor to obtain a feature tensor to be compressed and a candidate key feature tensor;

acquiring a weight matrix, and acquiring a target feature tensor according to the weight matrix and the candidate key feature tensor;

and processing the target feature tensor to obtain a compressed feature tensor, inputting the compressed feature tensor into a neural network for processing, and obtaining a processing result of the audio data to be processed.

2. The audio data processing method according to claim 1, wherein the obtaining a to-be-processed feature tensor and a key feature tensor according to the original feature tensor and the learnable weight tensor includes:

matrix multiplication is carried out on the original characteristic tensor and a first weight tensor which can be learned, and the characteristic tensor to be processed is obtained;

and carrying out matrix multiplication on the original characteristic tensor and the learnable second weight tensor to obtain the key characteristic tensor.

3. The audio data processing method of claim 1, wherein the obtaining a weight matrix comprises:

and acquiring a weight matrix according to the feature tensor to be compressed and the candidate key feature tensor.

4. The audio data processing method according to claim 1, wherein the performing dimension transformation on the target dimension in the feature tensor to be processed and the key feature tensor to obtain the feature tensor to be compressed and the candidate key feature tensor respectively comprises:

inserting a target dimension matrix in front of a target dimension of the feature tensor to be processed to obtain the feature tensor to be compressed;

and splitting the target dimensionality of the key characteristic tensor to obtain the candidate key characteristic tensor.

5. The audio data processing method according to claim 3, wherein the obtaining a weight matrix according to the feature tensor to be compressed and the candidate key feature tensor comprises:

matrix multiplication is carried out on the feature tensor to be compressed and the target dimensionality of the candidate key feature tensor;

and processing the data corresponding to the target dimension in the feature tensor obtained by matrix multiplication to obtain the weight matrix.

6. The audio data processing method of claim 1, wherein the obtaining a target feature tensor according to the weight matrix and the candidate key feature tensor comprises:

transposing the target dimensionality of the candidate key feature tensor to obtain a transposed key feature tensor;

and performing matrix multiplication on the weight matrix and the transposed key feature tensor to obtain the target feature tensor.

7. An audio data processing apparatus comprising:

the first acquisition module is used for acquiring an original characteristic tensor of the audio data to be processed;

the second acquisition module is used for acquiring a to-be-processed feature tensor and a key feature tensor according to the original feature tensor and the learnable weight tensor;

a third obtaining module, configured to perform dimension transformation on target dimensions in the to-be-processed feature tensor and the key feature tensor respectively, so as to obtain a to-be-compressed feature tensor and a candidate key feature tensor;

the fourth acquisition module is used for acquiring the weight matrix;

a fifth obtaining module, configured to obtain a target feature tensor according to the weight matrix and the candidate key feature tensor;

and the processing module is used for processing the target feature tensor, acquiring a compressed feature tensor, inputting the compressed feature tensor into a neural network for processing, and acquiring a processing result of the audio data to be processed.

8. The audio data processing method of claim 6, wherein the second obtaining module is specifically configured to:

matrix multiplication is carried out on the original characteristic tensor and a first weight tensor which can be learned, and the characteristic tensor to be processed is obtained;

and carrying out matrix multiplication on the original characteristic tensor and the learnable second weight tensor to obtain the key characteristic tensor.

9. The audio data processing method according to claim 6, wherein the fourth obtaining module is specifically configured to:

and acquiring a weight matrix according to the feature tensor to be compressed and the candidate key feature tensor.

10. The audio data processing method according to claim 6, wherein the third obtaining module is specifically configured to:

inserting a target dimension matrix in front of a target dimension of the feature tensor to be processed to obtain the feature tensor to be compressed;

and splitting the target dimensionality of the key characteristic tensor to obtain the candidate key characteristic tensor.

11. The audio data processing method according to claim 9, wherein the fourth obtaining module is specifically configured to:

matrix multiplication is carried out on the feature tensor to be compressed and the target dimensionality of the candidate key feature tensor;

and processing the data corresponding to the target dimension in the feature tensor obtained by matrix multiplication to obtain the weight matrix.

12. The audio data processing method according to claim 6, wherein the fifth obtaining module is specifically configured to:

transposing the target dimensionality of the candidate key feature tensor to obtain a transposed key feature tensor;

and performing matrix multiplication on the weight matrix and the transposed key feature tensor to obtain the target feature tensor.

13. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the audio data processing method of any of claims 1-6.

14. A non-transitory computer-readable storage medium storing computer instructions for causing a computer to execute the audio data processing method according to any one of claims 1 to 6.

15. A computer program product, characterized in that instructions in the computer program product, when executed by a processor, implement the audio data processing method of any of claims 1-6.

Technical Field

The present application relates to the field of artificial intelligence technologies such as voice technology and deep learning in the field of data processing technologies, and in particular, to an audio data processing method, apparatus, device, and storage medium.

Background

Artificial intelligence is the subject of research that makes computers simulate some human mental processes and intelligent behaviors (such as learning, reasoning, thinking, planning, etc.), both at the hardware level and at the software level. Artificial intelligence technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, machine learning/deep learning, a big data processing technology, a knowledge map technology and the like.

In general, in an artificial neural network, feature dimensions need to be compressed through an information Bottleneck (information Bottleneck) structure to retain main features and remove unnecessary information. For example, an information bottleneck is often designed in a tone Conversion (Voice Conversion) neural network, and the original tone is usually squeezed out while the original language content and style are retained through a compression channel dimension to achieve a better tone Conversion effect.

In the related art, the dimensionality is compressed in an extreme mode through the pooling layer, for example, the maximum pooling process only takes the maximum value and ignores some information details, and for example, the average pooling process is too average to weaken main information, so that the compression efficiency and the compression effect are poor.

Disclosure of Invention

The present disclosure provides a method, apparatus, device, and storage medium for audio data processing.

According to an aspect of the present disclosure, there is provided an audio data processing method including:

acquiring an original feature tensor of audio data to be processed, and acquiring a feature tensor to be processed and a key feature tensor according to the original feature tensor and a learnable weight tensor;

respectively carrying out dimensionality transformation on the feature tensor to be processed and the key feature tensor to obtain a feature tensor to be compressed and a candidate key feature tensor;

acquiring a weight matrix, and acquiring a target feature tensor according to the weight matrix and the candidate key feature tensor;

and processing the target feature tensor to obtain a compressed feature tensor, inputting the compressed feature tensor into a neural network for processing, and obtaining a processing result of the audio data to be processed.

According to another aspect of the present disclosure, there is provided an audio data processing apparatus including:

the first acquisition module is used for acquiring an original characteristic tensor of the audio data to be processed;

the second acquisition module is used for acquiring a to-be-processed feature tensor and a key feature tensor according to the original feature tensor and the learnable weight tensor;

a third obtaining module, configured to perform dimension transformation on target dimensions in the to-be-processed feature tensor and the key feature tensor respectively, so as to obtain a to-be-compressed feature tensor and a candidate key feature tensor;

the fourth acquisition module is used for acquiring the weight matrix;

a fifth obtaining module, configured to obtain a target feature tensor according to the weight matrix and the candidate key feature tensor;

and the processing module is used for processing the target feature tensor to obtain a compressed feature tensor, inputting the compressed feature tensor into the neural network for processing to obtain a processing result of the audio data to be processed.

According to a third aspect, there is provided an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the audio data processing method described in the above embodiments.

According to a fourth aspect, a non-transitory computer-readable storage medium storing computer instructions for causing the computer to execute the audio data processing method described in the above embodiments is proposed.

According to a fifth aspect, a computer program product is proposed, in which instructions that, when executed by a processor, enable a server to perform the audio data processing method of the first aspect.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

fig. 1 is a flowchart of an audio data processing method according to a first embodiment of the present application;

FIG. 2 is a flow chart of a method of audio data processing according to a second embodiment of the present application;

fig. 3 is a schematic configuration diagram of an audio data processing apparatus according to a third embodiment of the present application;

fig. 4 is a block diagram of an electronic device for implementing an audio data processing method according to an embodiment of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In practical applications, such as tone conversion, the original tone is squeezed out while the original language content and style are retained through a compression channel dimension to achieve a better tone conversion effect, and in related compression technologies, the proportion of primary information and secondary information cannot be well adjusted, and the compression effect and efficiency are poor.

In view of the above problems, the present application provides an audio data processing method, which not only can explicitly utilize the inherent association of audio features to more effectively compress information, but also can realize the adjustability of the degree of protrusion of primary and secondary information.

More specifically, acquiring an original feature tensor of audio data to be processed; acquiring a feature tensor to be processed and a key feature tensor according to the original feature tensor and the learnable weight tensor; respectively carrying out dimension transformation on target dimensions in the feature tensor to be processed and the key feature tensor to obtain a feature tensor to be compressed and a candidate key feature tensor; acquiring a weight matrix, acquiring a target feature tensor according to the weight matrix and the candidate key feature tensor, and processing the target feature tensor to acquire a compressed feature tensor; and inputting the compression characteristic tensor into the neural network for processing to obtain a processing result of the audio data to be processed. Therefore, the information compression efficiency is improved while the information compression quality is ensured, and the subsequent voice processing effect is improved.

Specifically, fig. 1 is a flowchart of an audio data processing method according to a first embodiment of the present application, where the audio data processing method is used in an electronic device, where the electronic device may be any device with computing capability, for example, a Personal Computer (PC), a mobile terminal, and the like, and the mobile terminal may be a mobile phone, a tablet computer, a personal digital assistant, a wearable device, an in-vehicle device, and other hardware devices with various operating systems, touch screens, and/or display screens, such as a smart television, a smart refrigerator, and the like.

As shown in fig. 1, the method includes:

step 101, acquiring an original feature tensor of audio data to be processed, and acquiring the feature tensor to be processed and a key feature tensor according to the original feature tensor and a learnable weight tensor.

In this embodiment of the application, the audio data to be processed may be audio data collected by a microphone array of the electronic device or audio data sent by other electronic devices, and is specifically selected and set according to an application scenario.

In the embodiment of the present application, there are many ways to obtain the original feature tensor of the audio data to be processed, and the setting may be selected according to an application scenario, which is described as follows.

In a first example, audio features such as acoustic features, lexical features, prosodic information, channel information, short-time energy, and short-time zero crossing rate of audio data to be processed are extracted through a convolutional neural network, and an original feature tensor of N dimensions is generated according to the audio features, where N is a positive integer, and a three-dimensional or more-than-three-dimensional original feature tensor is generally selected.

In a second example, audio data to be processed is sampled, and one or more audio features of the sampled audio data are extracted through different layers of neural networks to obtain an original feature tensor.

In the embodiment of the present application, the learnable weight tensor may be understood as an updatable weight tensor, for example, a fully-connected layer in a neural network, Y is WX + B, and both weight tensors W and B are updatable, that is, the weight tensor may be updated by an error back propagation algorithm according to a set loss function from given training data.

In this embodiment of the application, the original feature tensor can be understood as a feature tensor that needs to be compressed, and further, according to the original feature tensor and the learnable weight tensor, there are many ways to obtain the feature tensor to be processed and the key feature tensor, which are described as follows.

As a possible implementation manner, the original feature tensor and the learnable first weight tensor are subjected to matrix multiplication to obtain a feature tensor to be processed, and the original feature tensor and the learnable second weight tensor are subjected to matrix multiplication to obtain a key feature tensor.

As another possible implementation manner, a learnable first weight tensor is obtained, a second weight tensor is obtained by performing transposition and other processes according to the learnable first weight tensor, and the original feature tensor is subjected to matrix multiplication with the first weight tensor and the second weight tensor respectively to obtain a key feature tensor.

In the embodiment of the present application, the feature tensor to be processed may be understood as pre-compression of the original feature tensor, which is an inquiry tensor in the sub attention mechanism and is used to index main information in the total information; the key feature tensor can be understood as a one-to-one correspondence of key values and attribute values in the attention mechanism.

And 102, performing dimensionality transformation on target dimensionalities in the feature tensor to be processed and the key feature tensor to obtain the feature tensor to be compressed and the candidate key feature tensor.

In the embodiment of the present application, dimension transformation is performed on target dimensions in the feature tensor to be processed and the key feature tensor, and there are many ways to obtain the feature tensor to be compressed and the candidate key feature tensor, which can be selectively set according to application scenario needs, as illustrated below.

In the first example, a target dimension matrix is inserted in front of a target dimension of the feature tensor to be processed, the feature tensor to be compressed is obtained, the target dimension of the key feature tensor is split, and the candidate key feature tensor is obtained.

In the second example, a target dimension matrix is inserted in front of a target dimension of the feature tensor to be processed to obtain the feature tensor to be compressed, and a target dimension matrix is inserted in front of the target dimension of the key feature tensor to obtain the candidate key feature tensor.

It should be noted that the target dimension may be selectively set according to the application context, generally, in order to improve the computational efficiency, the target dimension is set in the last dimension, and if the compression dimension is not in the target dimension, the compression dimension may be transferred to the target dimension for subsequent processing.

And 103, acquiring a weight matrix, and acquiring a target feature tensor according to the weight matrix and the candidate key feature tensor.

And 104, processing the target feature tensor to obtain a compressed feature tensor, inputting the compressed feature tensor into a neural network for processing, and obtaining a processing result of the audio data to be processed.

In the embodiment of the present application, there are many ways to obtain the weight matrix, and the setting may be selected according to an application scenario, which is described as follows.

In a first example, a weight matrix is obtained according to an eigentensor to be compressed and a candidate key eigentensor.

The method includes the steps that a plurality of ways of obtaining a weight matrix are provided according to the feature tensor to be compressed and the candidate key feature tensor, for example, matrix multiplication is performed on target dimensions of the feature tensor to be compressed and the candidate key feature tensor, data corresponding to the target dimensions in the feature tensor obtained by the matrix multiplication is processed, and a weight matrix is obtained; and for example, directly performing matrix multiplication on the feature tensor to be compressed and the candidate key feature tensor, and processing data corresponding to each dimension in the feature tensor obtained by performing the matrix multiplication to obtain a weight matrix.

In a second example, a suitable weight matrix is determined by analyzing the weight matrices in a plurality of historical compression processes.

In the embodiment of the present application, there are many ways to obtain the target feature tensor according to the weight matrix and the candidate key feature tensor, and the setting may be selected according to an application scenario, which is exemplified as follows.

In the first example, the transposition processing is performed on the target dimension of the candidate key feature tensor to obtain a transposed key feature tensor, and the weight matrix and the transposed key feature tensor are subjected to matrix multiplication to obtain a target feature tensor.

In the second example, the candidate key feature tensor is directly transposed to obtain the transposed key feature tensor, and the weight matrix and the transposed key feature tensor are subjected to matrix multiplication to obtain the target feature tensor.

Further, the target feature tensor is processed to obtain a compressed feature tensor, for example, the compressed feature tensor is obtained by removing a specific dimension after performing dimension transformation on the target feature tensor.

And finally, inputting the compressed feature tensor into the neural network for processing to obtain a processing result of the audio data to be processed, for example, inputting the compressed feature tensor into a trained tone conversion neural network for processing to obtain a tone conversion result.

According to the audio data processing method, the original feature tensor of the audio data to be processed is obtained, the feature tensor to be processed and the key feature tensor are obtained according to the original feature tensor and the learnable weight tensor, dimension transformation is carried out on target dimensions in the feature tensor to be processed and the key feature tensor respectively, the feature tensor to be compressed and the candidate key feature tensor are obtained, the weight matrix is obtained, the target feature tensor is obtained according to the weight matrix and the candidate key feature tensor, the target feature tensor is processed, the compressed feature tensor is obtained, the compressed feature tensor is input into a neural network to be processed, and a processing result of the audio data to be processed is obtained. Therefore, the primary and secondary information weights are adjusted through the learnable weight tensor, the information compression quality is guaranteed, the information compression efficiency is improved, and the subsequent voice processing effect is improved.

Based on the above embodiments, different ways may be selected for information compression according to application scenarios, and detailed description is given below with reference to fig. 2 by using a specific example.

Fig. 2 is a flowchart of an audio data processing method according to a second embodiment of the present application.

As shown in fig. 2, the method includes:

step 201, obtaining an original feature tensor of the audio data to be processed, performing matrix multiplication on the original feature tensor and a first learnable weight tensor to obtain a feature tensor to be processed, and performing matrix multiplication on the original feature tensor and a second learnable weight tensor to obtain a key feature tensor.

In this embodiment of the application, the audio data to be processed may be audio data collected by a microphone array of the electronic device or audio data sent by other electronic devices, and is specifically selected and set according to an application scenario.

In the embodiment of the present application, there are many ways to obtain the original feature tensor of the audio data to be processed, and the setting may be selected according to an application scenario, which is described as follows.

In a first example, audio features such as acoustic features, lexical features, prosodic information, channel information, short-time energy, and short-time zero crossing rate of audio data to be processed are extracted through a convolutional neural network, and an original feature tensor of N dimensions is generated according to the audio features, where N is a positive integer, and a three-dimensional or more-than-three-dimensional original feature tensor is generally selected.

In a second example, audio data to be processed is sampled, and one or more audio features of the sampled audio data are extracted through different layers of neural networks to obtain an original feature tensor.

For example, an eigentensor X, such as (dimension N T C)i) The obtained dimension compressed feature tensor Y (dimension N T C)o) And C isi=nCoN is a positive integer and n is more than or equal to 1, and is used for controlling the information bottleneck width, namely the compression ratio.

Specifically, X is respectively associated with the first weight tensor Wq(dimension C)i*Co) And a second weight tensor Wk(dimension C)i*Cj) By matrix multiplication, i.e. Q ═ XWqAnd K ═ XWkObtaining the characteristic tensor Q (dimension N X T C) to be processedo) And the key feature tensor K (dimension N T C)j) And the first weight tensor and the second weight tensor can be set according to application requirements.

The feature tensor Q to be processed is pre-compressed of the input feature tensor X, is an 'inquiry' tensor in a self-attention mechanism and is used for indexing main information in full-scale information; the key feature tensor K is the key and attribute values in the self-attention mechanism.

Step 202, inserting a target dimension matrix in front of a target dimension of the feature tensor to be processed to obtain the feature tensor to be compressed, splitting the target dimension of the key feature tensor to obtain a candidate key feature tensor.

And 203, performing matrix multiplication on the target dimensions of the feature tensor to be compressed and the candidate key feature tensor, and processing data corresponding to the target dimensions in the feature tensor acquired by the matrix multiplication to acquire a weight matrix.

And 204, transposing the target dimension of the candidate key feature tensor to obtain a transposed key feature tensor, and performing matrix multiplication on the weight matrix and the transposed key feature tensor to obtain a target feature tensor.

Continuing with the example above, the feature tensor Q (dimension N T C) is treatedo) And the key feature tensor K (dimension N T C)j) Respectively carrying out dimension transformation, and inserting a dimension in front of the last dimension of the feature tensor Q to be processed to obtain the feature tensor Q' (the dimension N is T is 1 is C is obtained)0) The final dimension C of the key feature tensor KjSplitting the image into two dimensions to obtain a candidate key feature tensor K' dimension N x T x C0N), the candidate key feature tensor K' contains n different kinds of compressed information.

Performing matrix multiplication on the final two dimensions of the feature tensor Q 'to be compressed and the candidate key feature tensor K', and performing normalization processing operation along the final dimension to obtain a weight matrix A (the dimension N × T × 1 × N), namely A ═ softmax (Q 'K'/T); wherein softmax () is a normalization function.

The super parameter t is a real number larger than 0 and can be used for continuously adjusting the salient degree of the primary information and the secondary information, and when the value of t is larger, the weight between the primary information and the secondary information is more balanced; when the value of t is smaller, the more prominent the main information is, and the details are ignored.

Specifically, matrix multiplication of the feature tensor Q 'to be compressed and the candidate key feature tensor K' can be regarded as correlation operation of main information of X and n different kinds of compressed information, and correlation between the information is obtained, where the stronger the correlation is, the larger the corresponding weight in a is, and the weaker the correlation is, the smaller the corresponding weight in a is.

Further, in the last two dimensions, matrix multiplication is carried out on the weight matrix A and the candidate key feature tensor K' to obtain a target feature tensor O (the dimension N is T1 is C)0) I.e. O ═ AK'T

The superscript T represents that matrix transposition is carried out on the last two dimensions of the candidate key feature tensor K', the target feature tensor O is the fusion of primary and secondary information, and the fused weight is contained in the weight matrixIn A, pass weight tensor WqAnd WkThe method realizes the self-learning and self-adaptation of the weight of the primary and secondary information, and can manually adjust the projection degree of the primary and secondary information through the hyper-parameter t.

Further, performing dimension transformation on the target feature tensor O, for example, removing the 3 rd dimension, to obtain the compressed feature tensor Y dimension (N × T × C)0)。

And step 205, inputting the compressed feature tensor into the neural network for processing, and acquiring a processing result of the audio data to be processed.

And finally, inputting the compressed feature tensor into the neural network for processing to obtain a processing result of the audio data to be processed, for example, inputting the tone conversion neural network for processing to obtain a tone conversion result. Thus, efficient information compression is achieved, both by explicitly exploiting the inherent association of features to more efficiently compress information, and by achieving adjustable weights between primary and secondary information.

The audio data processing method of the embodiment of the application obtains the original feature tensor of the audio data to be processed, performs matrix multiplication on the original feature tensor and a first learnable weight tensor to obtain the feature tensor to be processed, performs matrix multiplication on the original feature tensor and a second learnable weight tensor to obtain a key feature tensor, inserts a target dimension matrix in front of a target dimension of the feature tensor to be processed to obtain the feature tensor to be compressed, splits the target dimension of the key feature tensor to obtain a candidate key feature tensor, performs matrix multiplication on the feature tensor to be compressed and the target dimension of the candidate key feature tensor, processes data corresponding to the target dimension in the feature tensor obtained by matrix multiplication to obtain a weight matrix, transposes the target dimension of the candidate key feature tensor to obtain a transposed key feature tensor, and performing matrix multiplication on the weight matrix and the transposed key feature tensor to obtain a target feature tensor, inputting the compressed feature tensor into a neural network for processing, and obtaining a processing result of the audio data to be processed. Therefore, the primary and secondary information weights are adjusted through the learnable weight tensor, the information compression quality is guaranteed, the information compression efficiency is improved, and the subsequent voice processing effect is improved.

In order to implement the above embodiments, the present application also provides an audio data processing apparatus. Fig. 3 is a schematic structural diagram of an audio data processing apparatus according to a third embodiment of the present application, which includes, as shown in fig. 3: a first obtaining module 301, a second obtaining module 302, a third obtaining module 303, a fourth obtaining module 304 and a processing module 305.

The first obtaining module 301 is configured to obtain an original feature tensor of the audio data to be processed.

A second obtaining module 302, configured to obtain a to-be-processed feature tensor and a key feature tensor according to the original feature tensor and the learnable weight tensor.

The third obtaining module 303 is configured to perform dimension transformation on target dimensions in the feature tensor to be processed and the key feature tensor respectively, and obtain the feature tensor to be compressed and the candidate key feature tensor.

A fourth obtaining module 304, configured to obtain a weight matrix;

a fifth obtaining module 305, configured to obtain a target feature tensor according to the weight matrix and the candidate key feature tensor.

The processing module 306 is configured to process the target feature tensor to obtain a compressed feature tensor, input the compressed feature tensor to the neural network for processing, and obtain a processing result of the audio data to be processed.

In an embodiment of the present application, the second obtaining module 302 is specifically configured to: matrix multiplication is carried out on the original characteristic tensor and a first weight tensor which can be learned, and a characteristic tensor to be processed is obtained; and matrix multiplication is carried out on the original characteristic tensor and the learnable second weight tensor to obtain the key characteristic tensor.

In an embodiment of the present application, the fourth obtaining module 304 is specifically configured to: and acquiring a weight matrix according to the feature tensor to be compressed and the candidate key feature tensor.

In an embodiment of the present application, the third obtaining module 303 is specifically configured to: inserting a target dimension matrix in front of a target dimension of the feature tensor to be processed to obtain the feature tensor to be compressed; and splitting the target dimension of the key characteristic tensor to obtain a candidate key characteristic tensor.

In an embodiment of the present application, the fourth obtaining module 304 is specifically configured to: matrix multiplication is carried out on the target dimensions of the feature tensor to be compressed and the candidate key feature tensor; and processing data corresponding to the target dimension in the feature tensor acquired by matrix multiplication to acquire a weight matrix.

In an embodiment of the present application, the fifth obtaining module 305 is specifically configured to: transposing the target dimensionality of the candidate key feature tensor to obtain a transposed key feature tensor; and performing matrix multiplication on the weight matrix and the transposed key feature tensor to obtain a target feature tensor.

It should be noted that the foregoing explanation of the audio data processing method is also applicable to the audio data processing apparatus according to the embodiment of the present invention, and the implementation principle is similar, and is not repeated herein.

The audio data processing device of the embodiment of the application, through the primitive feature tensor who acquires the audio data of awaiting processing, and according to primitive feature tensor and the weight tensor that can learn, acquire pending feature tensor and key feature tensor, the target dimension in pending feature tensor and the key feature tensor is carried out the dimension transform respectively, acquire pending compression feature tensor and candidate key feature tensor, acquire the weight matrix, and according to weight matrix and candidate key feature tensor, acquire the target feature tensor, handle the target feature tensor, acquire the compression feature tensor and input neural network and handle, acquire the processing result of pending audio data. Therefore, the primary and secondary information weights are adjusted through the learnable weight tensor, the information compression quality is guaranteed, the information compression efficiency is improved, and the subsequent voice processing effect is improved.

According to an embodiment of the present application, an electronic device and a readable storage medium are also provided.

As shown in fig. 4, it is a block diagram of an electronic device of an audio data processing method according to an embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 4, the electronic apparatus includes: one or more processors 401, memory 402, and interfaces for connecting the various components, including high-speed interfaces and low-speed interfaces. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). In fig. 4, one processor 401 is taken as an example.

Memory 402 is a non-transitory computer readable storage medium as provided herein. The memory stores instructions executable by at least one processor to cause the at least one processor to perform the audio data processing method provided by the present application. The non-transitory computer-readable storage medium of the present application stores computer instructions for causing a computer to execute the audio data processing method provided by the present application.

The memory 402, as a non-transitory computer-readable storage medium, may be used to store non-transitory software programs, non-transitory computer-executable programs, and modules, such as program instructions/modules (e.g., the first obtaining module 301, the second obtaining module 302, the third obtaining module 303, the fourth obtaining module 304, and the processing module 305 shown in fig. 3) corresponding to the audio data processing method in the embodiment of the present application. The processor 401 executes various functional applications of the server and data processing, i.e., implements the audio data processing method in the above-described method embodiments, by running non-transitory software programs, instructions, and modules stored in the memory 402.

The memory 402 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the electronic device for audio data processing, and the like. Further, the memory 402 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory 402 optionally includes memory located remotely from processor 401, which may be connected to audio data processing electronics over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device of the audio data processing method may further include: an input device 403 and an output device 404. The processor 401, the memory 402, the input device 403 and the output device 404 may be connected by a bus or other means, and fig. 4 illustrates an example of a connection by a bus.

The input device 403 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the audio data processing electronic apparatus, such as a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointing stick, one or more mouse buttons, a track ball, a joystick, or other input devices. The output devices 404 may include a display device, auxiliary lighting devices (e.g., LEDs), and haptic feedback devices (e.g., vibrating motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), the internet, and blockchain networks.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server may be a cloud Server, also called a cloud computing Server or a cloud host, which is a host product in a cloud computing service system, so as to solve the defects of large management difficulty and weak service extensibility in the traditional physical host and VPS service ("virtual private Server", or "VPS" for short), and the Server may also be a Server of a distributed system or a Server combining a block chain.

According to the technical scheme of the embodiment of the application, the original feature tensor of the audio data to be processed is obtained, the feature tensor to be processed and the key feature tensor are obtained according to the original feature tensor and the learnable weight tensor, dimension transformation is carried out on target dimensions in the feature tensor to be processed and the key feature tensor respectively, the feature tensor to be compressed and the candidate key feature tensor are obtained, the weight matrix is obtained according to the feature tensor to be compressed and the candidate key feature tensor, the target feature tensor is obtained according to the weight matrix and the candidate key feature tensor, the target feature tensor is processed, the compressed feature tensor is obtained, the compressed feature tensor is input into a neural network to be processed, and the processing result of the audio data to be processed is obtained. Therefore, the primary and secondary information weights are adjusted through the learnable weight tensor, the information compression quality is guaranteed, the information compression efficiency is improved, and the subsequent voice processing effect is improved.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and the present invention is not limited thereto as long as the desired results of the technical solutions disclosed in the present application can be achieved.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

15页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种适应任意采样率音频数据流的实时收听处理方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!