Audio generation method and equipment

文档序号：513260 发布日期：2021-05-28 浏览：2次中文

阅读说明：本技术 一种音频生成方法及设备 (Audio generation method and equipment ) 是由闫震海于 2021-02-27 设计创作，主要内容包括：本申请实施例公开了一种音频生成方法及设备,其中方法包括：接收用户输入的音频生成指令,所述音频生成指令用于指示用户想要在生成的目标音频中嵌入的二维图像；响应于所述音频生成指令,获取所述二维图像的目标灰度图像；将所述目标灰度图像中各个像素点的灰度数据转换为语谱图中各个像素点的频域数据,得到目标语谱图；利用所述目标语谱图生成所述目标语谱图对应的目标音频。由此可以实现在音频中嵌入图像信息的目的,使得图像具备发声功能,同时音频中又可以包含了图像信息,大大地提升了音频与图像的关联性。(The embodiment of the application discloses an audio generation method and equipment, wherein the method comprises the following steps: receiving an audio generation instruction input by a user, wherein the audio generation instruction is used for indicating a two-dimensional image which is embedded in generated target audio and is wanted by the user; responding to the audio generation instruction, and acquiring a target gray level image of the two-dimensional image; converting the gray data of each pixel point in the target gray image into frequency domain data of each pixel point in a spectrogram to obtain a target spectrogram; and generating a target audio corresponding to the target spectrogram by using the target spectrogram. Therefore, the purpose of embedding image information in the audio can be achieved, the image has a sound production function, meanwhile, the audio can contain the image information, and the relevance of the audio and the image is greatly improved.)

1. A method of audio generation, comprising:

receiving an audio generation instruction input by a user, wherein the audio generation instruction is used for indicating a two-dimensional image which is embedded in generated target audio and is wanted by the user;

responding to the audio generation instruction, and acquiring a target gray level image of the two-dimensional image;

converting the gray data of each pixel point in the target gray image into frequency domain data of each pixel point in a spectrogram to obtain a target spectrogram;

and generating a target audio corresponding to the target spectrogram by using the target spectrogram.

2. The method of claim 1, further comprising:

receiving an audio selection instruction input by a user, wherein the audio selection instruction is used for indicating original audio required by generating target audio, and responding to the audio selection instruction to obtain an original spectrogram corresponding to the original audio;

the converting the gray data of each pixel point in the target gray image into the frequency domain data of each pixel point in a spectrogram to obtain the target spectrogram comprises:

and processing the frequency domain data of each pixel point in the original spectrogram by using the gray data of each pixel point in the target gray image to obtain a target spectrogram.

3. The method of claim 2, wherein the gray data of each pixel is a gray data matrix, and the processing the frequency domain data of each pixel in the original spectrogram by using the gray data of each pixel in the target gray image to obtain the target spectrogram comprises:

carrying out up-down turning processing on the gray data matrix;

and taking the turned gray data matrix as a weighting factor, and weighting the frequency domain data of each pixel point in the original spectrogram to obtain the target spectrogram.

4. The method of claim 2, wherein the gray data of each pixel is a gray data matrix, and the processing the frequency domain data of each pixel in the original spectrogram by using the gray data of each pixel in the target gray image to obtain the target spectrogram comprises:

carrying out up-and-down turning processing on the gray data matrix, and carrying out down-sampling processing on the turned gray data matrix;

and weighting the frequency domain data of each pixel point in the original spectrogram by using the gray data matrix subjected to the down-sampling processing as a weighting factor to obtain a target spectrogram.

5. The method of claim 1, wherein the gray data of each pixel is a gray data matrix, and the converting the gray data of each pixel in the target gray image into frequency domain data of each pixel in a spectrogram to obtain the target spectrogram comprises:

and turning the gray data matrix up and down, and taking the turned gray data matrix as frequency domain data of each pixel point in the spectrogram to obtain the target spectrogram.

6. The method according to any one of claims 1 to 5, wherein the generating the target audio corresponding to the target spectrogram by using the target spectrogram comprises:

carrying out up-down overturning processing on each frame of frequency domain data of the target spectrogram, and conjugating complex numbers of the frequency domain data subjected to the overturning processing;

and carrying out inverse Fourier transform on each frame of frequency domain data after the conjugation to obtain a time domain signal corresponding to each frame of frequency domain data, and synthesizing each frame of time domain signal into a target audio.

7. The method according to any one of claims 1-5, wherein said obtaining a target gray scale image of said two-dimensional image comprises:

acquiring an original gray image of the two-dimensional image, and performing equal-ratio scaling processing on the original gray image to obtain a gray image subjected to equal-ratio scaling processing;

and normalizing the gray level image subjected to the equal-ratio scaling processing to obtain a target gray level image of the two-dimensional image.

8. The method according to any one of claims 1-5, wherein the two-dimensional images comprise a plurality of two-dimensional images for acquiring changes in user actions; the acquiring of the target gray scale image of the two-dimensional image includes:

respectively calculating gray level difference values between two-dimensional images adjacent to each other in acquisition time in the two-dimensional images to obtain a plurality of gray level difference values;

and arranging the gray difference values according to the acquisition time corresponding to the gray difference values to obtain the target gray image.

9. The method of any one of claims 1-5, further comprising:

receiving an audio playing instruction input by a user;

and responding to the audio playing instruction, playing the target audio and displaying a target spectrogram with an area corresponding to the playing progress according to the playing progress of the target audio.

10. An audio generating device, characterized in that the device comprises:

a processor, a memory, and an input device, wherein the memory is to store a computer program comprising program instructions, the processor being configured to invoke the program instructions to perform the method of any of claims 1-9.

Technical Field

The present application relates to the field of audio processing technologies, and in particular, to an audio generating method and device.

Background

There are some scenes that associate the picture with the audio, for example, directly taking the picture as a cover of an audio file, and then storing the picture and the audio as a new file format, so that the user can directly show the picture when playing the audio. In this way, the picture is only used as a cover picture of the audio, the correlation between the picture and the audio is relatively low, and the practicability is relatively poor.

Disclosure of Invention

The embodiment of the application provides an audio generation method and equipment based on image processing, which can achieve the purpose of embedding image information in audio, so that the image has a sound production function, and meanwhile, the audio can contain the image information, thereby greatly improving the relevance between the audio and the image.

In one aspect, an embodiment of the present application discloses an audio generation method, including:

responding to the audio generation instruction, and acquiring a target gray level image of the two-dimensional image;

converting the gray data of each pixel point in the target gray image into frequency domain data of each pixel point in a spectrogram to obtain a target spectrogram;

and generating a target audio corresponding to the target spectrogram by using the target spectrogram.

In another aspect, an embodiment of the present application provides an audio generating apparatus, where the apparatus includes:

a processor and a memory, the processor and the memory being interconnected, wherein the memory is configured to store a computer program, the computer program comprising program instructions, the processor being configured to invoke the program instructions and perform the steps of:

responding to the audio generation instruction, and acquiring a target gray level image of the two-dimensional image;

converting the gray data of each pixel point in the target gray image into frequency domain data of each pixel point in a spectrogram to obtain a target spectrogram;

and generating a target audio corresponding to the target spectrogram by using the target spectrogram.

In yet another aspect, an embodiment of the present application provides a computer-readable storage medium storing a computer program, where the computer program includes program instructions, and the program instructions, when executed by a processor, cause the processor to perform the following steps:

responding to the audio generation instruction, and acquiring a target gray level image of the two-dimensional image;

converting the gray data of each pixel point in the target gray image into frequency domain data of each pixel point in a spectrogram to obtain a target spectrogram;

and generating a target audio corresponding to the target spectrogram by using the target spectrogram.

When an audio generation instruction is received, a target gray image of a two-dimensional image which a user wants to embed in a generated target audio can be obtained by responding to the audio generation instruction, gray data of each pixel point in the target gray image is converted into frequency domain data of each pixel point in a spectrogram, a target spectrogram is obtained, namely the two-dimensional image is associated with the target spectrogram of the target audio, then the target audio corresponding to the target spectrogram is generated by using the target spectrogram, and accordingly the target audio is generated according to the two-dimensional image. Therefore, the purpose of embedding the image information into the audio can be achieved, the image has a sound production function, meanwhile, the audio can contain the image information, and the relevance between the audio and the image is greatly improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic flowchart of an audio generation method provided in an embodiment of the present application;

fig. 2 is a schematic flowchart of a process for obtaining a target grayscale image according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram illustrating an effect of an image processing process according to an embodiment of the present application;

fig. 4 is a schematic flowchart of a target spectrogram synthesis audio provided in an embodiment of the present application;

fig. 5 is a schematic flowchart of another audio generation method provided in the embodiment of the present application;

fig. 6 is a schematic flowchart of a method for acquiring an original spectrogram according to an embodiment of the present application;

fig. 7a is a schematic diagram illustrating an effect of a target spectrogram according to an embodiment of the present application;

FIG. 7b is a schematic diagram illustrating an effect of another target spectrogram provided in the embodiment of the present application;

fig. 8a is a diagram of an example of a target spectrogram provided in an embodiment of the present application;

FIG. 8b is a diagram of another example of a target spectrogram provided in the embodiments of the present application;

fig. 9 is a schematic flowchart of a further audio generating method provided in an embodiment of the present application;

fig. 10 is a schematic structural diagram of an audio generating apparatus according to an embodiment of the present application;

fig. 11 is a schematic structural diagram of an audio generating apparatus according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.

According to the embodiment of the application, the image information can be embedded into the audio, for example, a spectrogram is reconstructed or constructed according to the image information, so that the audio with the image information is obtained, the correlation between the image and the audio is improved, and a user can intuitively feel the audio containing the image information.

In this application, a spectrogram may refer to a speech spectrogram. The abscissa of the spectrogram can be time, the ordinate can be frequency, each coordinate point value can represent the magnitude of the voice data energy value, and a column of data corresponding to each time point in the spectrogram represents frequency domain data corresponding to a certain frame of audio signal. The magnitude of the voice data energy value is generally represented by a light color, and a darker color may represent a larger energy value, or may also represent the energy value by other means, which is not limited in this application.

The audio generation scheme can be used in an audio generation device, and can be specifically applied to various types of audio software installed in the audio generation device, including but not limited to music playing software, audio editing software, audio conversion software, and the like. The audio generating device may be a terminal, or may be a server, or may be the rest of devices, which is not limited in this application. Optionally, the terminal herein may include but is not limited to: smart phones, tablets, laptops, and desktops, among others.

Based on the above description, the audio generation method, apparatus, device and medium provided in the embodiments of the present application may acquire an audio with image information by using the image information to transform or construct a spectrogram. Therefore, the purpose of embedding image information in the audio is achieved, the image has a sound production function, meanwhile, the audio can contain the image information, and the relevance of the audio and the image is greatly improved. The details are described below.

Referring to fig. 1, fig. 1 is a schematic flowchart of an audio generation method according to an embodiment of the present disclosure. The flow shown in fig. 1 may include the following steps S101-S104.

S101, receiving an audio generation instruction input by a user.

Wherein the audio generation instructions are operable to indicate a two-dimensional image that the user wants to embed in the generated target audio. The two-dimensional image can be stored as an image in a picture format, can also be content created in a temporary creation area, and can also be a plurality of two-dimensional images for collecting user action changes. If the content that the user wants to embed in the generated target audio is a file in a non-picture format, such as a character or a table, the file in the non-picture format can be converted into a picture format, and the image converted into the picture format can be embedded in the target audio. The picture format may be a still image file format, such as jpg, png, bmp, jpeg, etc., which is not limited herein. For example, a file to be embedded with the target audio is acquired, the suffix name of the file is determined, and if the file is not in a picture format, such as a file format of vsd, xls, doc, and the like, the file is converted into the picture format.

And S102, responding to the audio generation instruction, and acquiring a target gray level image of the two-dimensional image.

The target grayscale image may be obtained by acquiring a two-dimensional image and processing the two-dimensional image, or may be obtained by directly acquiring a processed grayscale image from a memory as the target grayscale image, which is not limited in this application. Alternatively, the target grayscale image may also be referred to as a target grayscale image, target grayscale information, a target grayscale matrix, etc., the target grayscale image may be a grayscale data matrix, a block diagram with pixel values, etc., and the value of each position in the target grayscale image may be referred to as a grayscale value, a pixel value, etc., without limitation.

In a possible implementation, by acquiring and processing a two-dimensional image may include: the method comprises the steps of obtaining an original gray image of a two-dimensional image, carrying out equal-ratio scaling processing on the original gray image, carrying out histogram equalization processing on the original gray image, carrying out normalization processing on the original gray image and the like. For example, as shown in fig. 2, acquiring a target grayscale image of a two-dimensional image may include the following steps S201 to S202.

S201, obtaining an original gray image of the two-dimensional image, and performing equal-ratio scaling on the original gray image to obtain a gray image subjected to equal-ratio scaling.

The original gray image of the two-dimensional image may be a gray map concept in the image processing field, the number of gray levels of each pixel in the image is 256, 255 represents full white, and 0 represents full black. For example, the original grayscale image after the graying processing of a certain two-dimensional image is (0,100,123; 215,124,165; 255,65, 98). For ease of understanding, the original grayscale image of the two-dimensional image is referred to herein as gray p1, and the height of the picture is H1.

In one possible implementation, the scaling process may be an equal scaling process based on a scaling factor. The equal-ratio scaling process aims to adjust the height H1 of the original gray image GrayP1 of the two-dimensional image to obtain a gray image after the equal-ratio scaling process, and for convenience of understanding, the gray image after the equal-ratio scaling process is recorded as GrayP2, and the height of GrayP2 is recorded as H2. It should be noted that the height H2 of the grayscale image after the equal-ratio scaling process is a preset value, and the equal-ratio scaling factor scale can be calculated according to the height H2 of the grayscale image after the equal-ratio scaling process and the height H1 of the original grayscale image of the two-dimensional image, for example, the scale is H2/H1. After the corresponding scaling factor is determined, the original grayscale image can be scaled to a suitable size by using the scaling factor, so that the original audio can be reconstructed or constructed by the finally generated target grayscale image to obtain the target audio. Optionally, the height H2 of the scaled gray-scale image GrayP2 may be 2^ N +1, where N is a preset positive integer. The height H2 of the gray-scale image after the scaling processing may be determined according to the height of a target spectrogram corresponding to a target audio that is generated according to the needs of a user, or may be determined according to frequency domain data of an original spectrogram, or may be determined according to the size and/or resolution of an apparatus screen, or may be determined by other means, which is not limited in this application.

S202, normalizing the gray level image after the equal ratio scaling processing to obtain a target gray level image of the two-dimensional image.

Performing normalization processing on the Gray-scale image after the equal-ratio scaling processing, traversing all values of the Gray-scale image after the equal-ratio scaling processing GrayP2, finding a maximum value max (GrayP2) of GrayP2, and performing normalization processing on all data to obtain a target Gray-scale image of the two-dimensional image, wherein for convenience of understanding, the target Gray-scale image is marked as GrayP3, and Gray P3 is a Gray-scale data matrix, namely:

GrayP3＝GrayP2/max(GrayP2)。

for example, if GrayP2 is (20,30, 40; 50,60, 70; 80,90,100), after normalization, the target grayscale image GrayP3 is (0.2,0.3, 0.4; 0.5; 0.6, 0.7; 0.8,0.9, 1). Through steps S201 to S202, a gray data matrix gray 3 of the target gray image of the two-dimensional image is obtained, wherein all data of gray p3 is between 0 and 1. As shown in fig. 3, fig. 3 shows an effect display diagram of an image processing process, in which a color picture is subjected to graying conversion to obtain an original grayscale image of a two-dimensional image, then the original grayscale image is subjected to scaling processing to obtain a grayscale image after geometric scaling processing, and then the original grayscale image is subjected to normalization processing to obtain a target grayscale image of the two-dimensional image.

In a possible implementation manner, histogram equalization processing can be further performed on the gray level image gray p2 after the scaling processing, so as to enhance the contrast of data at different positions in the gray level image gray p2 and improve the picture quality. In particular embodiments, functions may be called directly for processing, such as histeq functions in MATLAB, equalizehost functions in opencv, and the like. Furthermore, the gray level image subjected to the histogram equalization processing can be subjected to normalization processing to obtain a target gray level image of the two-dimensional image.

If the two-dimensional image meets the processing result criteria of steps S201 to S202, the two-dimensional image is directly set as the target grayscale image without performing the operations of steps S201 to S202 on the two-dimensional image.

In one embodiment, the two-dimensional image may include a plurality of two-dimensional images for capturing user motion changes, which may refer to user gesture changes, facial expression changes, and the like, without limitation. The obtaining of the target gray scale image of the two-dimensional image may comprise the steps of: respectively calculating gray level difference values between two-dimensional images adjacent to each other in acquisition time in the plurality of two-dimensional images to obtain a plurality of gray level difference values; and arranging the plurality of gray level difference values according to the acquisition time corresponding to the gray level difference values to obtain a target gray level image. The collection source of the multiple two-dimensional images may be a video shot in real time, a video stored in an audio generating device such as a terminal or other storage devices, or multiple images shot continuously, and the like, which is not limited herein. The gray scale difference value may be a difference value according to target gray scale images corresponding to two-dimensional images adjacent to the acquisition time. For example, a plurality of two-dimensional images exist in a video, one two-dimensional image is acquired at time points t1, t2 and t3 to obtain three two-dimensional images P1, P2 and P3, target gray level images of the three two-dimensional images are obtained according to steps S201 to S202, a gray level difference between P1 and P2 and a gray level difference between P2 and P3 are calculated, and the two gray level differences are arranged according to the acquisition time, for example, the gray level difference between P1 and P2 is arranged before the gray level difference between P2 and P3, namely, at a position close to the left, so as to obtain the target gray level images corresponding to the two-dimensional images for acquiring the user motion change.

S103, converting the gray data of each pixel point in the target gray image into frequency domain data of each pixel point in a spectrogram to obtain the target spectrogram.

In one implementation, the original spectrogram of the original audio is modified based on the target grayscale image of the two-dimensional image to obtain the target spectrogram, for example, the target grayscale image, such as a grayscale data matrix gray p3, may be used as a weighting factor to weight the original spectrogram of the original audio to obtain the target spectrogram; in another embodiment, the target spectrogram of the audio is constructed (constructed) based on the target gray-scale image of the two-dimensional image, so as to obtain the target audio by directly using the target gray-scale image, for example, the gray-scale data matrix gray p3 may be directly used as frequency domain data to obtain the target spectrogram. Therefore, the voice spectrogram of the audio is reconstructed or constructed to acquire the audio with the image information, so that the image and the audio can be closely related, and the relevance of the audio and the image is greatly improved.

And S104, generating a target audio corresponding to the target spectrogram by using the target spectrogram.

The target audio is generated audio in which the image information such as the two-dimensional image information is embedded. Optionally, the generating of the target audio corresponding to the target spectrogram by using the target spectrogram may include the following steps: acquiring a time domain signal corresponding to each frame frequency domain data of a target spectrogram; and obtaining the target audio according to the time domain signal corresponding to each frame frequency domain data. For example, each frame of frequency domain data of the target spectrogram may be inverted up and down, and a complex number of the inverted frequency domain data may be conjugated; and carrying out inverse Fourier transform on each frame of frequency domain data after conjugation to obtain a time domain signal corresponding to each frame of frequency domain data, and synthesizing each frame of time domain signal into a target audio.

In a possible implementation mode, because the frequency domain data of the target spectrogram has conjugate pair formation, when the frequency domain data of the target spectrogram is synthesized into a time domain signal, if each frame frequency domain data of the spectrogram has 2^ N +1 data, only the 2 nd to 2 nd ^ N/2 th data of the frequency domain data need to be subjected to up-down flip processing, and the complex number of the frequency domain data subjected to the flip processing is conjugated, wherein N is a positive integer. For example, if there are 1025 data for each frame of frequency domain data of the target spectrogram, it is only necessary to flip the 2 nd to 512 th data up and down, and conjugate the complex number of the flipped frequency domain data. And then, performing inverse Fourier transform on each frame of frequency domain data after the conjugation to obtain a time domain signal corresponding to each frame of frequency domain data, and converting each frame of frequency domain data of the target spectrogram into a time domain signal.

After the time domain signal corresponding to each frame of frequency domain data of the target spectrogram is obtained, aliasing and splicing can be performed on each frame of time domain signal according to a certain aliasing rate, so that a complete audio signal is obtained. To facilitate differentiation from other audio, the audio represented by the audio signal may be referred to as target audio. The target audio is embedded with the image information, so that a user can intuitively feel the change of the image information to the original audio or the unique sound directly formed by the image information. The process of step S104 is as shown in fig. 4, the target spectrogram is composed of multiple frames of frequency domain data, each frame of frequency domain data is converted into a corresponding time domain signal, and the multiple frames of time domain signals are aliased and spliced into an audio signal.

In a possible implementation manner, after the target audio is obtained, an audio playing instruction input by a user is received; and responding to the audio playing instruction, playing the target audio and displaying a target spectrogram with an area corresponding to the playing progress according to the playing progress of the target audio. Therefore, when the target audio is played, the embedded image is displayed in a point-by-point manner along with the playing progress of the audio. For example, the target audio may be played when a play instruction for the target audio is received, the target spectrogram with a corresponding area between 0 and t1 is shown when the target audio is played to a time point t1, the target spectrogram with a corresponding area between 0 and t2 is shown when the target audio is played to a time point t2, and the complete target spectrogram is shown when the target audio is played completely. Optionally, when a sharing instruction for the target audio is received, the target audio may be shared with a target object, where the target object may be a contact or a functional module in application software, and this is not limited here.

By the method shown in fig. 1, the target audio with the image information can be obtained, and when the target audio is played, the target spectrogram of the target audio can be gradually displayed along with the playing of music, so that a user can visually see the embedded image information, and the obtained target audio can be shared with other users.

For example, in a certain music playing software, a user imports a picture a and a segment of audio b from a terminal, and through the processing of the embodiment, an audio c embedded with the picture a can be obtained, and the spectrogram of the audio can be gradually shown along with the playing of the music when the audio c is played, so that the user can intuitively see the embedded image information.

For another example, in a certain music playing software, a user uses a camera of a terminal to shoot a video with dynamic changes, and through the processing of this embodiment, a plurality of two-dimensional images representing changes in user actions are captured from the video with dynamic changes, and an audio d is obtained after the two-dimensional images are processed, and the audio d presents a sound effect caused by the dynamic changes.

The technical scheme is introduced in the embodiment of the application on the whole, the method for acquiring the audio according to the image information can be divided into two methods, the main difference is that the target spectrogram is acquired in different methods, and the first method is to acquire the audio by modifying the spectrogram through the target gray level image; and the other is to obtain the audio by constructing a spectrogram through a target gray level image. Therefore, the target audio can be obtained by modifying or constructing the spectrogram, image information is embedded in the audio, and the image information is tightly combined with the audio, so that the image has a sound production function, and meanwhile, the sound contains the image information, namely, the spectrogram of the audio contains the image information. Through the embodiment of the application, the purpose of embedding the image information in the audio can be achieved, so that the image has a sound production function, meanwhile, the audio can contain the image information, the relevance between the audio and the image is greatly improved, and the operation process has strong flexibility and interestingness.

Please refer to fig. 5, which is a flowchart illustrating another audio generating method according to an embodiment of the present application. As shown in fig. 5, the audio generating method modifies a spectrogram of an audio based on a target grayscale image of the two-dimensional image to obtain a target spectrogram, and further obtains a target audio, including the following steps S501 to S504.

S501, receiving an audio generation instruction input by a user, and responding to the audio generation instruction to acquire a target gray image of the two-dimensional image.

This step is described with reference to steps S101-S102 and will not be described herein.

In this embodiment of the application, the spectrogram of the original audio may be modified based on the target grayscale image of the two-dimensional image to obtain the target spectrogram, and then the high ratio of the original grayscale image may be scaled to be the same as the height of the original spectrogram when the original grayscale image of the two-dimensional image is scaled in an equal ratio.

S502, receiving an audio selection instruction input by a user, and responding to the audio selection instruction to obtain an original spectrogram corresponding to the original audio.

Wherein the audio selection instruction is used for indicating original audio required for generating the target audio. Optionally, the original audio may be an audio file stored locally, or may also be an audio file temporarily downloaded on another storage device, and the content of the audio file may be music, talk content, noise, and the like, which is not limited in this application.

In a specific implementation, a process of obtaining the original spectrogram according to the original audio may be as shown in fig. 6. For example, the time domain signal of the original audio may be subjected to framing processing to obtain a multi-frame time domain signal; the frame length is the time length of each frame, and the frame shift is the time length of the overlapping of two adjacent frames, for example, if the start time of the kth frame time domain signal is t, the end time is t + E, the start time of the kth +1 th frame time domain signal is t + L, and the end time is t + E + L, the frame length is E, and the frame shift is L; windowing each frame of time domain signal, wherein the length of a window function is consistent with the length of a frame length, and the window function can adopt a Hanning window, a rectangular window, a triangular window, a Hamming window, a Gaussian window and the like; performing Fast Fourier Transform (FFT) on each frame of the multi-frame windowed time domain signal to obtain multi-frame frequency domain data; and arranging each frame frequency domain data according to a column vector form to obtain an original spectrogram. For example, when the data are arranged, all the frequency domain data are gradually increased from bottom to top according to the frequency, all the frequency domain data are transversely placed according to the time sequence to obtain an original spectrogram, the horizontal axis of the original spectrogram is time, the vertical axis of the original spectrogram is frequency, the value of a coordinate point is an energy value, and the magnitude of the energy value is represented by the color depth.

Optionally, when performing Fast Fourier Transform (FFT) on each frame of windowed time-domain signal to obtain multi-frame frequency-domain data, if each frame of windowed time-domain signal is 2^ K, the time complexity of fourier transform can be reduced, so that the operation efficiency of fourier transform can be improved. Accordingly, the frequency domain data corresponding to each frame of the time domain signal has (2^ K/2) +1 values, wherein K is a positive integer. Or it can be expressed that if the time-domain signal has 2^ (N +1) values per frame, the obtained frequency-domain data corresponding to each frame has 2^ N +1 values. Wherein N is an integer greater than or equal to 0.

S503, processing the frequency domain data of each pixel point in the original spectrogram by using the gray data of each pixel point in the target gray image to obtain a target spectrogram.

The gray data of each pixel point in the target gray image can be represented by a gray data matrix, and each numerical value in the gray data matrix represents the value of the pixel point at the corresponding position in the target gray image.

In a possible implementation manner, the processing of the frequency domain data of each pixel point in the original spectrogram by using the gray data of each pixel point in the target gray image to obtain the target spectrogram may include the following operations: carrying out up-down turning processing on the gray data matrix; and taking the turned gray data matrix as a weighting factor, and weighting the frequency domain data of each pixel point in the original spectrogram to obtain the target spectrogram.

In the specific implementation process, the up-down flipping process may represent flipping the gray data matrix up and down according to the Y-axis direction. For example, the grayscale data matrix is (0.1,0.2, 0.3; 0.4,0.5, 0.6; 0.7,0.8,0.9), and the grayscale data matrix after the up-down flip processing is (0.7,0.8, 0.9; 0.4,0.5, 0.6; 0.1,0.2, 0.3).

Optionally, the weighting factor is used to weight the frequency domain data of each pixel point in the original spectrogram, so as to weight all the data of the frequency domain data, but the frequency domain data of the original spectrogram has conjugate symmetry, and if the frequency domain data has 2^ N +1 data, only the 2 nd to 2^ N/2+1 th data of the frequency domain data need to be weighted, so as to achieve the purpose of weighting all the data of the frequency domain data, the obtained target spectrogram has the effect as shown in fig. 7a, the part surrounded by the dotted line frame is an embedded two-dimensional image, the dotted line is the frequency domain data of the original spectrogram outside, the horizontal axis of the original spectrogram is time, the vertical axis is frequency, and the shade of the color represents the size of the energy value of the corresponding coordinate point. It can be seen that the height of the embedded two-dimensional image is equal to the height of the original spectrogram, since step S501 scales the height of the original grayscale image to be equal to the height of the original spectrogram.

Optionally, the grayscale data matrix after the flipping processing may be down-sampled, the size of the grayscale data matrix is reduced, the down-sampled grayscale data matrix is used as a weighting factor, and partial frequency domain data of the original spectrogram is weighted to obtain the target spectrogram, so that the two-dimensional image may be embedded in a local position of the original spectrogram. For example, the frequency domain data has 2^ N +1 data, the height of the gray data matrix has 2^ N +1 pixels, the gray data matrix is subjected to down-sampling, the down-sampling factor is 1/2, the height of the gray data matrix is 2^ N/2+1, the M to M +2^ N/2+1 data of the frequency domain data can be weighted, the obtained target spectrogram only contains image information in the M to M +2^ N/2+1 data of the frequency domain data, and M, N is a positive integer. The target spectrogram effect obtained by this step can be shown in fig. 7b, where the part enclosed by the dotted line is an embedded image, the part outside the dotted line is frequency domain data of the original spectrogram, the horizontal axis of the original spectrogram is time, the vertical axis is frequency, and the shade of the color represents the magnitude of the energy value of the corresponding coordinate point. It can be seen that the height of the embedded two-dimensional image is not equal to the height of the original spectrogram, and the embedded image only exists in a local position of the original spectrogram. If the weighting factor is scaled smaller, the original audio is weighted by the weighting factor, and then the influence of the embedded information on the original audio is very small, so that the synthesized target audio is substantially the same as the original audio, and the image information can be embedded into the target audio secretly.

S504, generating a target audio corresponding to the target spectrogram by using the target spectrogram.

The description of this step may refer to step S104, where when synthesizing each frame of time domain signal into audio, the aliasing ratio may be determined according to the frame shift and the frame length of the framing processing in step S502, for example, the ratio of the frame shift and the frame length of the framing processing may be used to synthesize the audio from the original spectrogram of the portion without weighting processing. If the frame length during framing is 2W and the frame shift is W, the aliasing ratio should be W/2W, i.e. 50%. And aliasing and splicing each frame of time domain signal together to obtain a complete audio signal, namely the target audio.

When the target language spectrogram is obtained, the grayscale data matrix is used as a weighting factor to weight the frequency domain data of the original language spectrogram to obtain the target language spectrogram, Fourier transform is performed on each frame frequency domain data of the target language spectrogram to obtain a time domain signal, then the time domain signals are subjected to aliasing and splicing to finally obtain target audio, namely the audio is obtained by modifying the original language spectrogram. Therefore, the target audio obtained by transforming the original spectrogram can be embedded into the image in the audio, so that the image has a sound production function, and meanwhile, the audio can contain image information, thereby greatly improving the relevance between the audio and the image.

The method set forth in the embodiment of the present application is described below by taking as an example that the method provided in the embodiment of the present application is applied to a certain music playing software to create an image and modify an original spectrogram to obtain a new audio. The music playing software includes, but is not limited to, a mobile phone end, a computer end, and the like. In a music playing software, a temporary creation area is provided, a user creates content in the temporary creation area, the created content is stored in a picture format, and meanwhile, the user selects an audio file which the user wants to modify. Performing the processing of step S501 on the created image to obtain a target gray image, wherein the height of the target gray image is scaled to 2^10+1 pixels, and the data is set to be compared with the height of the original spectrogram; meanwhile, according to the original spectrogram of the audio file obtained in the step S502, when framing the original audio, the frame length is 30ms, and when windowing is performed when the frame is moved to 15ms, the length of the window function is a Hanning window which is the same as the frame length by 30 ms; performing the operation in step S503 on the grayscale data matrix and the original spectrogram, where each frame frequency domain data of the original spectrogram has 1025 data, and only the 2 nd to 513 th data of each frame frequency domain data of the original spectrogram can be weighted, so as to weight the entire frequency domain data and obtain the target spectrogram; carrying out up-down turning processing on the 2 nd to 512 th data of each frame frequency domain data of the target spectrogram, and conjugating the complex number of the frequency domain data subjected to the turning processing; carrying out inverse Fourier transform on each frame of frequency domain data after conjugation to obtain a time domain signal corresponding to each frame of frequency domain data; each frame of the time domain signal is then synthesized into the target audio at an aliasing rate of 15ms/30ms, i.e., 50%, the ratio of the frame shift to the frame length. The finally generated target audio file contains content authored in an authoring area, the height of a target spectrogram of the target audio is consistent with that of a target gray level image of an embedded two-dimensional image, the target spectrogram of the obtained target audio is checked by audio software, example effect graphs are shown in figures 8a and 8b, the two-dimensional image is a part of the target spectrogram, but the height of the two-dimensional image is the same as that of the target spectrogram in terms of a frequency axis, and the energy value of the target spectrogram corresponds to gray level data of each pixel point of the target gray level image of the two-dimensional image. The generated target audio can be shared by other users, and the audio effect after the image is embedded can be shared with friends.

For another example, in a music playing software, a user selects an image in which audio is desired to be embedded, and at the same time, selects an original audio file that is desired to be modified. Processing the image in step S501 to obtain a target gray image, wherein the high zoom of the target gray image is 2^10+1 pixels; meanwhile, according to the step S503, an original spectrogram of an original audio file is obtained, when the original audio is subjected to framing processing, the frame length is 40ms, the frame shift is 20ms, and when windowing processing is performed, the length of a window function is a Hanning window which is the same as the length of the frame length of 40 ms; the operation in step S504 is performed on the grayscale data matrix and the original spectrogram, if the original grayscale data matrix has a size 1025 × 1025, after down-sampling, the grayscale data matrix is changed to 513 × 513, each frame frequency domain data of the original spectrogram has 1025 data, and part of the frequency domain data of the spectrogram is weighted, for example, the size of the down-sampled grayscale data matrix is 513 × 513, then the 100 th to 612 th data of the frequency domain data can be weighted to obtain a target spectrogram, the obtained target spectrogram only has the 100 th to 612 th data of the frequency domain data containing image information, and the 100 th to 612 th data can be other continuous frequency domain data, such as the 200 th to 712 th data, the 313 th to 825 data, and the like. Processing the target spectrogram according to the steps S505 to S506, wherein due to the fact that the real number signal has conjugate symmetry, the data from the 2 nd to the 512 th of each frame frequency domain data of the target spectrogram are turned over up and down, and the complex number of the frequency domain data after turning over is conjugated; carrying out inverse Fourier transform on each frame of frequency domain data after conjugation to obtain a time domain signal corresponding to each frame of frequency domain data; each frame of the time domain signal is then synthesized into the target audio at an aliasing rate of 20ms/40ms, i.e., 50%, of the ratio of the frame shift to the frame length. And finally, a target audio file is generated, wherein the target audio file contains the information of the imported image, the height of a target spectrogram of the target audio is inconsistent with the height of the embedded image, the obtained target spectrogram of the target audio is checked by audio software, the image is a part of the target spectrogram and only occupies a part of the height of the target spectrogram in view of a frequency axis, and the energy value of the target spectrogram corresponds to the gray level data of each pixel point of the image. The generated target audio can be shared by other users, and the audio effect after the image is embedded can be shared with friends.

For another example, by the method in the embodiment of the present application, a plurality of two-dimensional images (e.g., a plurality of two-dimensional images in a video, or a plurality of gesture images captured in real time, etc.) may be obtained as the two-dimensional image that needs to be embedded in the original audio. Specifically, gray level differences between two-dimensional images adjacent to each other in acquisition time in the two-dimensional images can be respectively calculated to obtain a plurality of gray level differences; arranging the multiple gray level difference values according to the acquisition time corresponding to the gray level difference values to obtain a target gray level image; and then, processing the frequency domain data of each pixel point in the original spectrogram corresponding to the original audio by using the gray data of each pixel point in the target gray image to obtain the target spectrogram. For example, taking the acquisition of three two-dimensional images as an example, the target grayscale images M1, M2, and M3 corresponding to the three two-dimensional images are obtained according to the operation in step S102, and the target grayscale images of two-dimensional images acquired at adjacent time intervals are subtracted to obtain two grayscale difference values: M2-M1 and M3-M2, and arranging the two gray level difference values in time sequence to obtain target gray level images corresponding to the multiple two-dimensional images. And then, an original spectrogram of the original audio is obtained according to the step S502, the target gray-scale image is used as a weighting factor to weight the frequency domain data of the original spectrogram according to the operation of the step S503, so that a target spectrogram is obtained, and then the target audio is obtained according to the target spectrogram. According to the method, the original audio can be modified through a plurality of two-dimensional images, so that the original audio has the information of the change of the images in the video.

Please refer to fig. 9, which is a flowchart illustrating another audio generating method according to an embodiment of the present application. As shown in fig. 9, the audio generating method constructs (constructs) a target spectrogram of an audio based on a target grayscale image of the two-dimensional image, thereby obtaining a target audio, including the following steps S901 to S903.

S901, receiving an audio generation instruction input by a user, and responding to the audio generation instruction to acquire a target gray image of the two-dimensional image.

The description of this step can refer to the description related to the above steps S101-S102, which is not repeated herein.

And S902, carrying out up-down turning processing on the gray data matrix, and taking the turned gray data matrix as frequency domain data of each pixel point in the spectrogram to obtain the target spectrogram.

The process of turning the gray data matrix up and down can represent that the gray data matrix is turned up and down according to the Y-axis direction. For example, the grayscale data matrix is (0.1,0.2, 0.3; 0.4,0.5, 0.6; 0.7,0.8,0.9), and the grayscale data matrix after the up-down flip processing is (0.7,0.8, 0.9; 0.4,0.5, 0.6; 0.1,0.2, 0.3).

In an embodiment, the inverted gray data matrix is used as frequency domain data of each pixel in the target spectrogram, in other words, data of the gray data matrix is used as pixel data of a corresponding position of the target spectrogram, that is, an energy value corresponding to each pixel in the target spectrogram, where the energy value may be represented by a color in the target spectrogram, for example, the size of different energy values is represented by a shade of the color, or the size of different energy values is represented by different hues, which is not limited herein. Optionally, when the grayscale data matrix is used as frequency domain data, the larger the value of the grayscale data matrix is, the larger the energy value of the corresponding target spectrogram is. For example, it is preset that the larger the energy value is, the darker the color is, in the obtained target spectrogram, if the gray scale data matrix gray p3 is (0.7,0.8, 0.9; 0.4,0.5, 0.6; 0.1,0.2,0.3), then after 0.9 is taken as the frequency domain data of the corresponding position of the spectrogram, the corresponding energy value is larger than the data with the gray scale value smaller than 0.9 and is converted into the energy value of the frequency domain data, so that the color of the corresponding position of 0.9 in the obtained target spectrogram is darker than the color of the corresponding position of other data, and the embedded two-dimensional image can be represented in the target spectrogram through the color depth relationship. Alternatively, when the grayscale data matrix is used as the frequency domain data, the smaller the grayscale data matrix value is, the larger the energy of the corresponding target spectrogram is. For example, if it is preset that the larger the energy value is in the obtained target spectrogram, the darker the color is, if the gray scale data matrix GrayP3 is (0.7,0.8, 0.9; 0.4,0.5, 0.6; 0.1,0.2,0.3) is present, according to the formula 1-GrayP3, (0.3,0.2, 0.1; 0.6,0.5, 0.4; 0.9,0.8,0.7) can be obtained, so that after 0.9 is taken as the frequency domain data of the corresponding position of the spectrogram, the corresponding energy value is smaller than the data smaller than 0.9 and converted into the energy value of the frequency domain data, so that the color of the corresponding position of 0.9 in the obtained target spectrogram is lighter than the color of the corresponding position of other data, and thus the embedded two-dimensional image can be represented in the target spectrogram through the color depth relationship.

Optionally, the energy of the obtained target spectrogram may be adjusted by adjusting the magnitude of the gray data matrix using the scale factor, for example, if the gray data matrix after the flipping process gray p3 is (0.7,0.8, 0.9; 0.4,0.5, 0.6; 0.1,0.2,0.3), and the scale factor value is 1.1, the gray data matrix becomes (0.77,0.88, 0.99; 0.44,0.55, 0.66; 0.11,0.22, 0.33).

And S903, generating a target audio corresponding to the target spectrogram by using the target spectrogram.

The description of this step may refer to step S104, where the target spectrogram is obtained in this embodiment by directly using the grayscale data matrix as the frequency domain data of the target spectrogram, instead of weighting the original spectrogram using the grayscale data matrix, and when splicing each frame of time domain signal according to aliasing rate aliasing, only aliasing rate needs to be selected from 0-100% (without 100%) to perform aliasing, so as to obtain a complete audio signal, which is the target audio.

When the target speech spectrogram is obtained, the grayscale data matrix of the target grayscale image is used as frequency domain data to obtain the target speech spectrogram, Fourier transform is performed on each frame frequency domain data of the target speech spectrogram to obtain a time domain signal, the time domain signals are subjected to aliasing and splicing to finally obtain a target audio file, and namely, the target audio is obtained by constructing the target speech spectrogram. If the embedded two-dimensional images are a plurality of two-dimensional images for capturing user motion changes, the sound effect due to the feature changes of the plurality of two-dimensional images can be obtained. Therefore, the target audio is obtained through the structure of the spectrogram, so that the purpose of embedding image information in the audio is achieved, the image has a sound production function, meanwhile, the audio can contain the image information, and the relevance between the audio and the image is greatly improved.

The following describes the method provided by the embodiment of the present application, by taking the example that the method is applied to music playing software and the embedded image is a gesture image that changes constantly in a video stream. In the music playing software, a user uses a camera to shoot a picture of a fixed machine position, and waves his/her finger in front of the camera at will, so that the video stream contains a plurality of gesture images, the first gesture image and the second gesture image are collected at an interval of 100ms, the first gesture image and the second gesture image are processed in step S201 to obtain target gray level images corresponding to the first gesture image and the second gesture image, the difference value of the target gray level images of the first gesture image and the second gesture image is calculated, the target gray level images corresponding to the plurality of gesture images are determined according to the gray level difference value, for example, the gray level data matrix of the first gesture image is (0.1,0.2, 0.3; 0.4,0.5, 0.6; 0.7,0.8,0.9), the gray level data matrix of the second gesture image is (0.11,0.23, 0.34; 0.48,0.56, 0.64; 0.78,0.89,0.92), the gray level difference value is (0.01; 0.02, 0.04; 0.08,0.06, 0.04; 0.08,0.09,0.02). Turning the gray data matrix up and down; and taking the reversed gray data matrix as frequency domain data of the target spectrogram to obtain the target spectrogram, wherein the larger the value of the gray data matrix is, the larger the energy of the corresponding target spectrogram is, and meanwhile, the value of the gray data matrix is adjusted by adopting a scale factor with the size of 1.1, so that the obtained gray data matrix is (0.011,0.022, 0.044; 0.088,0.066, 0.044; 0.088,0.099, 0.022). Therefore, the energy value of the target spectrogram can be adjusted by adjusting the gray data matrix. And (5) performing the operation of the step (S904) on the target spectrogram, and splicing each frame of time domain signals according to the aliasing rate of 60% to obtain the target audio.

Alternatively, the above operations may be performed multiple times in the video stream, and then audio subjected to multiple gesture transformations may be sensed in the composed target audio. For example, a plurality of gesture images are acquired at an interval of 100ms in a video stream, and the grayscale data matrices T1, T2, T3, and T4 are obtained after processing in step S201, so that grayscale differences T2-T1 ═ T12, T3-T2 ═ T23, and T4-T3 ═ T34 are generated, and T12, T23, and T34 are arranged in time sequence and mapped to a target spectrogram, thereby synthesizing a segment of continuous audio resulting from gesture transformation. According to the method, the obtained audio embodies the sound effect brought by the change of the dynamic images in the video, and the generated audio can be shared with other users and shares the peculiar sound effect brought by dynamic transformation with friends.

It is to be understood that the above embodiments of the method are all illustrations of the audio generation method of the present application, and the descriptions of the embodiments have respective emphasis, and reference may be made to relevant descriptions of other embodiments for parts that are not described in detail in a certain embodiment.

Based on the description of the embodiment of the audio generation method, the embodiment of the invention also discloses an audio generation device. Alternatively, the audio generating means may be a computer program (comprising program code/program instructions) running in an audio generating device, such as a terminal. For example, the audio generating device may perform the methods of fig. 1, 5, 9. Referring to fig. 10, the audio generating apparatus may operate the following modules:

an obtaining module 1001, configured to receive an audio generation instruction input by a user, where the audio generation instruction is used to indicate a two-dimensional image that the user wants to embed in a generated target audio;

the obtaining module 1001 is further configured to obtain a target grayscale image of the two-dimensional image in response to the audio generation instruction;

the processing module 1002 is configured to convert the grayscale data of each pixel in the target grayscale image into frequency domain data of each pixel in a spectrogram, so as to obtain a target spectrogram;

the processing module 1002 is further configured to generate a target audio corresponding to the target spectrogram by using the target spectrogram.

In one embodiment, the processing module 1002 is further configured to receive an audio selection instruction input by a user, where the audio selection instruction is used to indicate an original audio required for generating a target audio, and in response to the audio selection instruction, obtain an original spectrogram corresponding to the original audio; when the gray data of each pixel point in the target gray image is converted into the frequency domain data of each pixel point in the speech spectrogram, and the target speech spectrogram is obtained, the method may be specifically used for: and processing the frequency domain data of each pixel point in the original spectrogram by using the gray data of each pixel point in the gray image to obtain a target spectrogram.

In another embodiment, the grayscale data of each pixel is a grayscale data matrix, and the processing module 1002 is specifically configured to, when processing the frequency domain data of each pixel in the original spectrogram by using the grayscale data of each pixel in the target grayscale image to obtain a target spectrogram: carrying out up-down turning processing on the gray data matrix; and taking the turned gray data matrix as a weighting factor, and weighting the frequency domain data of each pixel point in the original spectrogram to obtain the target spectrogram.

In another embodiment, the grayscale data of each pixel is a grayscale data matrix, and the processing module 1002 is specifically configured to, when processing the frequency domain data of each pixel in the original spectrogram by using the grayscale data of each pixel in the target grayscale image to obtain a target spectrogram: carrying out up-and-down overturning processing on the gray data matrix, and carrying out down-sampling processing on the gray data matrix after overturning processing; and taking the gray data matrix subjected to the down-sampling processing as a weighting factor, and weighting partial frequency domain data of the original spectrogram to obtain the target spectrogram.

In another embodiment, the grayscale data of each pixel is a grayscale data matrix, and the processing module 1002 is specifically configured to, when converting the grayscale data of each pixel in the target grayscale image into frequency domain data of each pixel in a spectrogram to obtain the target spectrogram: and turning the gray data matrix up and down, and taking the turned gray data matrix as frequency domain data of each pixel point in the spectrogram to obtain the target spectrogram.

In another embodiment, when the processing module 1002 generates the target audio corresponding to the target spectrogram by using the target spectrogram, the processing module is specifically configured to: carrying out up-down overturning processing on each frame of frequency domain data of the target spectrogram, and conjugating complex numbers of the frequency domain data subjected to the overturning processing; and carrying out inverse Fourier transform on each frame of frequency domain data after the conjugation to obtain a time domain signal corresponding to each frame of frequency domain data, and synthesizing each frame of time domain signal into a target audio.

In another embodiment, when the processing module 1002 is configured to obtain a target grayscale image of the two-dimensional image, it is specifically configured to: acquiring an original gray image of the two-dimensional image, and performing equal-ratio scaling processing on the original gray image to obtain a gray image subjected to equal-ratio scaling processing; and normalizing the gray level image subjected to the equal-ratio scaling processing to obtain a target gray level image of the two-dimensional image.

In another embodiment, the two-dimensional images include a plurality of two-dimensional images for acquiring the motion change of the user; when acquiring the grayscale image of the two-dimensional image, the processing module 1002 is specifically configured to: respectively calculating gray level difference values between two-dimensional images adjacent to each other in acquisition time in the two-dimensional images to obtain a plurality of gray level difference values; and arranging the gray difference values according to the acquisition time corresponding to the gray difference values to obtain the target gray image.

In another embodiment, the processing module 1002 is further configured to receive an audio playing instruction input by a user; and responding to the audio playing instruction, playing the target audio and displaying a target spectrogram with an area corresponding to the playing progress according to the playing progress of the target audio.

According to an embodiment of the present invention, the steps involved in the methods shown in fig. 1, 5, and 9 may be performed by the modules in the audio generating apparatus shown in fig. 10. For example, steps S101, S102 shown in fig. 1 may be performed by the acquisition module 1001 shown in fig. 10, and steps S103 and S104 may be performed by the processing module 1002 shown in fig. 10.

According to another embodiment of the present invention, the modules in the audio generating apparatus shown in fig. 10 may be respectively or entirely combined into one or several other modules to form the audio generating apparatus, or some of the modules may be further split into multiple functionally smaller modules to form the audio generating apparatus, which may achieve the same operation without affecting the achievement of the technical effect of the embodiment of the present invention. The modules are divided based on logic functions, and in practical application, the functions of one module can be realized by a plurality of modules, or the functions of a plurality of modules can be realized by one module. In other embodiments of the present invention, the audio-based generating apparatus may also include other modules, and in practical applications, these functions may also be implemented with the assistance of other modules, and may be implemented by cooperation of a plurality of modules.

Based on the description of the method embodiment and the apparatus embodiment, the embodiment of the present invention further provides an audio generating device. Referring to fig. 11, the apparatus at least includes a processor 1101 and a memory 1102, wherein the processor 1101 and the memory 1102 are connected to each other. Optionally, the audio generation device may also include an input device 1103 and/or an output device 1104. The processor 1101, the input device 1103, the output device 1104, and the memory 1102 may be connected by a bus or other means.

The memory 1102 may be used to store a computer program (or may be used to store a computer (readable) storage medium comprising a computer program) comprising program instructions, the processor 1101 being configured to invoke the program instructions. The processor 1101 (or CPU) is a computing core and a control core of the device, and is configured to call the program instructions, and is specifically adapted to load and execute the program instructions so as to implement the above method flow or corresponding functions. The input device 1103 may include one or more of a keyboard, touch screen, radio frequency receiver, or other input device; the output device 1104 may include a display screen (display), and the output device 1104 may also include one or more of a speaker, a radio frequency transmitter, or other output device. Optionally, the device may further include a memory module, a power module, an application client, and the like.

For example, in one embodiment, the processor 1101 according to embodiments of the present invention may be configured to perform a series of audio generation processes, including: receiving an audio generation instruction input by a user, wherein the audio generation instruction is used for indicating a two-dimensional image which is embedded in generated target audio and is wanted by the user; responding to the audio generation instruction, and acquiring a gray level image of the two-dimensional image; converting the gray data of each pixel point in the gray image into frequency domain data of each pixel point in a spectrogram to obtain a target spectrogram; the target spectrogram is utilized to generate a target audio corresponding to the target spectrogram, and so on, which may specifically refer to the description of the foregoing embodiments and are not repeated herein.

Embodiments of the present invention also provide a computer (readable) storage medium, which may be a memory device in a device, for storing programs and data. It will be appreciated that the computer storage media herein may comprise both built-in storage media within the device and, of course, extended storage media supported by the device. The computer storage medium provides a storage space that stores an operating system of an audio generating device, such as a terminal. Also stored in this memory space are program instructions, which may be one or more computer programs (including program code), suitable for being loaded and executed by the processor 1101. The computer storage medium may be a high-speed RAM memory, or may be a non-volatile memory 11 (e.g., at least one disk memory); and optionally at least one computer storage medium located remotely from the processor 1101.

In one embodiment, program instructions in a computer storage medium may be loaded and executed by processor 1101 to implement the respective steps of the methods in the above-described embodiments; for example, in a particular implementation, program instructions in a computer storage medium are loaded by processor 1101 and perform the following steps:

responding to the audio generation instruction, and acquiring a target gray level image of the two-dimensional image;

converting the gray data of each pixel point in the target gray image into frequency domain data of each pixel point in a spectrogram to obtain a target spectrogram;

and generating a target audio corresponding to the target spectrogram by using the target spectrogram.

In one embodiment, the program instructions may also be loaded and executed by processor 1101 to: receiving an audio selection instruction input by a user, wherein the audio selection instruction is used for indicating original audio required by generating target audio, and responding to the audio selection instruction to obtain an original spectrogram corresponding to the original audio; when the gray data of each pixel point in the target gray image is converted into the frequency domain data of each pixel point in the spectrogram, so as to obtain the target spectrogram, the program instructions may be further loaded and specifically executed by the processor 1101: and processing the frequency domain data of each pixel point in the original spectrogram by using the gray data of each pixel point in the target gray image to obtain a target spectrogram.

In another embodiment, the grayscale data of each pixel is a grayscale data matrix, and when the grayscale data of each pixel in the target grayscale image is used to process the frequency domain data of each pixel in the original spectrogram to obtain the target spectrogram, the program instruction may be loaded and specifically executed by the processor 1101: carrying out up-down turning processing on the gray data matrix; and taking the turned gray data matrix as a weighting factor, and weighting the frequency domain data of each pixel point in the original spectrogram to obtain the target spectrogram.

In another embodiment, the grayscale data of each pixel is a grayscale data matrix, and when the grayscale data of each pixel in the target grayscale image is used to process the frequency domain data of each pixel in the original spectrogram to obtain the target spectrogram, the program instruction may be loaded and specifically executed by the processor 1101: carrying out up-and-down turning processing on the gray data matrix, and carrying out down-sampling processing on the turned gray data matrix; and weighting the frequency domain data of each pixel point in the original spectrogram by using the gray data matrix subjected to the down-sampling processing as a weighting factor to obtain a target spectrogram.

In another embodiment, the gray data of each pixel is a gray data matrix, and when the gray data of each pixel in the target gray image is converted into frequency domain data of each pixel in a spectrogram to obtain the target spectrogram, the program instructions may be further loaded and specifically executed by the processor 1101: and turning the gray data matrix up and down, and taking the turned gray data matrix as frequency domain data of each pixel point in the spectrogram to obtain the target spectrogram.

In yet another embodiment, when the target spectrogram is utilized to generate the target audio corresponding to the target spectrogram, the program instructions may be further loaded and specifically executed by the processor 1101: carrying out up-down overturning processing on each frame of frequency domain data of the target spectrogram, and conjugating complex numbers of the frequency domain data subjected to the overturning processing; and carrying out inverse Fourier transform on each frame of frequency domain data after the conjugation to obtain a time domain signal corresponding to each frame of frequency domain data, and synthesizing each frame of time domain signal into a target audio.

In yet another embodiment, when obtaining the target gray scale image of the two-dimensional image, the program instructions may be further loaded into and executed by the processor 1101: acquiring an original gray image of the two-dimensional image, and performing equal-ratio scaling processing on the original gray image to obtain a gray image subjected to equal-ratio scaling processing; and normalizing the gray level image subjected to the equal-ratio scaling processing to obtain a target gray level image of the two-dimensional image.

In yet another embodiment, the two-dimensional image includes a plurality of two-dimensional images for acquiring a change of a user action, and when acquiring a target gray scale image of the two-dimensional images, the program instructions may be further loaded by the processor 1101 and specifically executed: respectively calculating gray level difference values between two-dimensional images adjacent to each other in acquisition time in the two-dimensional images to obtain a plurality of gray level difference values; and arranging the gray difference values according to the acquisition time corresponding to the gray difference values to obtain the target gray image.

In yet another embodiment, the program instructions may be further loaded into and executed by processor 1101: receiving an audio playing instruction input by a user; and responding to the audio playing instruction, playing the target audio and displaying a target spectrogram with an area corresponding to the playing progress according to the playing progress of the target audio.

It can be understood that, for the specific working processes of the audio generating apparatus and the device described above, reference may be made to the relevant descriptions in the foregoing embodiments, and details are not described here again.

It will be understood by those skilled in the art that all or part of the processes of the methods of the above embodiments may be implemented by hardware instructions of a computer program, where the computer program may be stored in a computer storage medium, and the computer storage medium may be a computer-readable storage medium, and when executed, the computer program may include the processes of the above embodiments of the methods. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

It should be understood that the above-described embodiments are only some examples, and not intended to limit the scope of the claims, and that all or part of the flowchart of the above-described embodiments may be implemented by those skilled in the art, and all equivalent changes and modifications made in the claims of the present application are still within the scope of the present invention.

25页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：带有韵律的语音合成方法及系统

Audio generation method and equipment

相关技术

网友询问留言