Audio alignment method and device, electronic equipment and storage medium

文档序号：737026 发布日期：2021-04-20 浏览：10次中文

阅读说明：本技术 音频对齐方法、装置、电子设备及存储介质 (Audio alignment method and device, electronic equipment and storage medium ) 是由李楠张晨于 2021-01-25 设计创作，主要内容包括：本公开关于一种音频对齐方法、装置、电子设备及存储介质。所述音频对齐方法可包括：获取伴奏信号以及采集的包含外放伴奏的歌声信号；估计所述伴奏信号与所述歌声信号中的外放伴奏之间每个时刻的延迟；根据估计的每个时刻的延迟对所述歌声信号进行调整以使所述歌声信号与所述伴奏信号对齐。(The disclosure relates to an audio alignment method, an audio alignment device, an electronic device and a storage medium. The audio alignment method may include: acquiring an accompaniment signal and a collected singing voice signal containing an external playing accompaniment; estimating a delay at each time between the accompaniment signal and a play-out accompaniment in the singing voice signal; adjusting the singing voice signal according to the estimated delay at each time instant to align the singing voice signal with the accompaniment signal.)

1. A method for audio alignment, comprising:

acquiring an accompaniment signal and a collected singing voice signal containing an external playing accompaniment;

estimating a delay at each time between the accompaniment signal and a play-out accompaniment in the singing voice signal;

adjusting the singing voice signal according to the estimated delay at each time instant to align the singing voice signal with the accompaniment signal.

2. The method of claim 1, wherein said estimating a delay at each time instant between said accompaniment signal and a play-out accompaniment in said singing voice signal comprises:

performing short-time Fourier transform on the accompaniment signal and the singing voice signal respectively to obtain a first frequency domain audio signal corresponding to the accompaniment signal and a second frequency domain audio signal corresponding to the singing voice signal;

based on the first frequency domain audio signal and the second frequency domain audio signal, a number of delayed frames at each time between frequency domain signal components corresponding to the play-out accompaniment in the first frequency domain audio signal and the second frequency domain audio signal is estimated.

3. The method of claim 2, wherein said adjusting the singing voice signal to align the singing voice signal with the accompaniment signal according to the estimated delay per time instant comprises:

forming a delay sequence with a preset length by the delay frame number of each time and the delay frame numbers of a plurality of continuous times before each time as a delay sequence corresponding to each time;

determining the delay frame number with the highest confidence in the delay sequence by judging the confidence of the delay sequence corresponding to each moment, and taking the delay frame number with the highest confidence as the final delay frame number of each moment;

adjusting the singing voice signal according to the final delay frame number of each time so as to align the singing voice signal with the accompaniment signal.

4. The method of claim 3, wherein the determining the number of delay frames with the highest confidence in the delay sequences by performing confidence determination on the delay sequences corresponding to each time instant comprises:

and performing confidence judgment on the delay sequence corresponding to each moment based on the MIDI information of the musical instrument digital interface corresponding to the singing voice signal or the lyric information with time mark to determine the delay frame number with the highest confidence in the delay sequence.

5. The method of claim 4, wherein determining the number of delay frames with highest confidence in the delay sequence by performing confidence determination on the delay sequence corresponding to each time based on MIDI (musical instrument digital interface) information or lyric information with time stamp corresponding to the singing voice signal comprises:

determining whether there is a singing voice in the singing voice signal corresponding to each moment in the delay sequence based on MIDI information or lyric information with a time stamp corresponding to the singing voice signal;

obtaining a statistical histogram corresponding to the delay sequence according to the determination result and the delay sequence;

and taking the delay frame number corresponding to the maximum value in the statistical histogram as the delay frame number with the highest confidence coefficient.

6. The method of claim 3, wherein said adjusting said singing voice signal to align said singing voice signal with said accompaniment signal according to said final number of delayed frames per time instant comprises:

determining the number of time domain delay samples at each moment according to the preset maximum tolerant delay frame number;

adjusting the singing voice signal based on the time domain delay sample number of each moment;

and smoothing the adjusted singing voice signal.

7. The method of claim 6, wherein determining the number of time-domain delay samples for each time instant according to a predetermined maximum tolerated delay frame number comprises:

in response to that the final delay frame number of the previous moment of each moment is less than or equal to the sum of the final delay frame number of each moment and the maximum tolerated delay frame number and is greater than or equal to the difference between the final delay frame number of each moment and the maximum tolerated delay frame number, determining the time domain delay sample number of each moment according to the final delay frame number of the previous moment;

and determining the time domain delay sample number of each moment according to the final delay frame number of each moment in response to the fact that the final delay frame number of the previous moment of each moment is smaller than the difference between the final delay frame number of each moment and the maximum tolerated delay frame number and is larger than the sum of the final delay frame number of each moment and the maximum tolerated delay frame number.

8. An apparatus for audio alignment, comprising:

a signal acquisition unit configured to acquire an accompaniment signal and a collected singing voice signal containing a play-out accompaniment;

a delay estimation unit configured to estimate a delay at each time instant between the accompaniment signal and a play-out accompaniment in the singing voice signal;

an adjusting unit configured to adjust the singing voice signal according to the estimated delay at each time instant to align the singing voice signal with the accompaniment signal.

9. An electronic device, comprising:

at least one processor;

at least one memory storing computer-executable instructions,

wherein the computer-executable instructions, when executed by the at least one processor, cause the at least one processor to perform the method of any one of claims 1 to 7.

10. A computer-readable storage medium storing instructions that, when executed by at least one processor, cause the at least one processor to perform the method of any one of claims 1 to 7.

Technical Field

The present disclosure relates to the field of signal processing, and in particular, to a method and an apparatus for audio alignment, an electronic device, and a storage medium.

Background

With the improvement of internet and smart device technologies, the use of audio recording software (e.g., karaoke software) on various smart devices (e.g., mobile phones, computers, etc.) has become very popular. When the software is recorded to audio frequency in using the smart machine carries out the song and records, because the system delay of system's equipment, there is certain delay in the excitation that the accompaniment of broadcast arrived the speaker, and this delay can change and shake, and the singer sings according to the accompaniment of speaker actual broadcast, and this can appear that obvious dislocation appears in accompaniment and singing or the phenomenon of misalignment, and this dislocation or misalignment will influence the song quality of follow-up recording very much.

Disclosure of Invention

The present disclosure provides a method, an apparatus, an electronic device, and a storage medium for audio alignment to at least solve a problem of a song and an accompaniment being misaligned or misaligned due to a delay.

According to a first aspect of embodiments of the present disclosure, there is provided a method for audio alignment, the method comprising: acquiring an accompaniment signal and a collected singing voice signal containing an external playing accompaniment; estimating a delay at each time between the accompaniment signal and a play-out accompaniment in the singing voice signal; adjusting the singing voice signal according to the estimated delay at each time instant to align the singing voice signal with the accompaniment signal.

Optionally, the estimating a delay at each time instant between the accompaniment signal and a play-out accompaniment in the singing voice signal comprises: performing short-time Fourier transform on the accompaniment signal and the singing voice signal respectively to obtain a first frequency domain audio signal corresponding to the accompaniment signal and a second frequency domain audio signal corresponding to the singing voice signal; based on the first frequency domain audio signal and the second frequency domain audio signal, a number of delayed frames at each time between frequency domain signal components corresponding to the play-out accompaniment in the first frequency domain audio signal and the second frequency domain audio signal is estimated.

Optionally, the adjusting the singing voice signal according to the estimated delay per time instant to align the singing voice signal with the accompaniment signal comprises: forming a delay sequence with a preset length by the delay frame number of each time and the delay frame numbers of a plurality of continuous times before each time as a delay sequence corresponding to each time; determining the delay frame number with the highest confidence in the delay sequence by judging the confidence of the delay sequence corresponding to each moment, and taking the delay frame number with the highest confidence as the final delay frame number of each moment; adjusting the singing voice signal according to the final delay frame number of each time so as to align the singing voice signal with the accompaniment signal.

Optionally, the determining, by performing confidence judgment on the delay sequence corresponding to each time, the number of delay frames with the highest confidence in the delay sequence includes: and performing confidence judgment on the delay sequence corresponding to each moment based on the MIDI information of the musical instrument digital interface corresponding to the singing voice signal or the lyric information with time mark to determine the delay frame number with the highest confidence in the delay sequence.

Optionally, the determining the number of delay frames with the highest confidence in the delay sequence by performing confidence judgment on the delay sequence corresponding to each time based on the MIDI information or the lyric information with a time stamp corresponding to the singing voice signal comprises: determining whether there is a singing voice in the singing voice signal corresponding to each moment in the delay sequence based on MIDI information or lyric information with a time stamp corresponding to the singing voice signal; obtaining a statistical histogram corresponding to the delay sequence according to the determination result and the delay sequence; and taking the delay frame number corresponding to the maximum value in the statistical histogram as the delay frame number with the highest confidence coefficient.

Optionally, the adjusting the singing voice signal according to the final number of delayed frames at each time instant to align the singing voice signal with the accompaniment signal includes: determining the number of time domain delay samples at each moment according to the preset maximum tolerant delay frame number; adjusting the singing voice signal based on the time domain delay sample number of each moment; and smoothing the adjusted singing voice signal.

Optionally, the determining the number of the time-domain delay samples at each time according to the predetermined maximum tolerable delay frame number includes: in response to that the final delay frame number of the previous moment of each moment is less than or equal to the sum of the final delay frame number of each moment and the maximum tolerated delay frame number and is greater than or equal to the difference between the final delay frame number of each moment and the maximum tolerated delay frame number, determining the time domain delay sample number of each moment according to the final delay frame number of the previous moment; and determining the time domain delay sample number of each moment according to the final delay frame number of each moment in response to the fact that the final delay frame number of the previous moment of each moment is smaller than the difference between the final delay frame number of each moment and the maximum tolerated delay frame number and is larger than the sum of the final delay frame number of each moment and the maximum tolerated delay frame number.

Optionally, the method further comprises: and carrying out sound mixing processing on the aligned singing voice signal and the aligned accompaniment signal to obtain a mixed song signal.

According to a second aspect of embodiments of the present disclosure, there is provided an apparatus for audio alignment, the apparatus comprising: a signal acquisition unit configured to acquire an accompaniment signal and a collected singing voice signal containing a play-out accompaniment; a delay estimation unit configured to estimate a delay at each time instant between the accompaniment signal and a play-out accompaniment in the singing voice signal; an adjusting unit configured to adjust the singing voice signal according to the estimated delay at each time instant to align the singing voice signal with the accompaniment signal.

Optionally low, said estimating a delay at each time instant between said accompaniment signal and a play-out accompaniment in said singing voice signal, comprising: performing short-time Fourier transform on the accompaniment signal and the singing voice signal respectively to obtain a first frequency domain audio signal corresponding to the accompaniment signal and a second frequency domain audio signal corresponding to the singing voice signal; based on the first frequency domain audio signal and the second frequency domain audio signal, a number of delayed frames at each time between frequency domain signal components corresponding to the play-out accompaniment in the first frequency domain audio signal and the second frequency domain audio signal is estimated.

Optionally, the apparatus further comprises: and the mixing unit is configured to perform mixing processing on the aligned singing voice signals and the aligned accompaniment signals to obtain mixed song signals.

According to a third aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including: at least one processor; at least one memory storing computer-executable instructions, wherein the computer-executable instructions, when executed by the at least one processor, cause the at least one processor to perform the method as described above.

According to a fourth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium storing instructions that, when executed by at least one processor, cause the at least one processor to perform the method as described above.

According to a fifth aspect of embodiments of the present disclosure, there is provided a computer program product comprising computer instructions which, when executed by a processor, implement the method as described above.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects: the embodiment of the present disclosure can make the singing voice signal align with the accompaniment signal in real time by estimating the delay of each time between the accompaniment signal and the play accompaniment in the singing voice signal and adjusting the singing voice signal according to the estimated delay of each time.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

FIG. 1 is an exemplary system architecture to which exemplary embodiments of the present disclosure may be applied;

fig. 2 is a flowchart of a method for audio alignment of an exemplary embodiment of the present disclosure;

fig. 3 is a schematic diagram illustrating a method for audio alignment of an exemplary embodiment of the present disclosure;

fig. 4 is a block diagram of an apparatus for audio alignment of an exemplary embodiment of the present disclosure;

fig. 5 is a block diagram of an electronic device according to an exemplary embodiment of the present disclosure.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The embodiments described in the following examples do not represent all embodiments consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

In this case, the expression "at least one of the items" in the present disclosure means a case where three types of parallel expressions "any one of the items", "a combination of any plural ones of the items", and "the entirety of the items" are included. For example, "include at least one of a and B" includes the following three cases in parallel: (1) comprises A; (2) comprises B; (3) including a and B. For another example, "at least one of the first step and the second step is performed", which means that the following three cases are juxtaposed: (1) executing the step one; (2) executing the step two; (3) and executing the step one and the step two.

Fig. 1 illustrates an exemplary system architecture 100 in which exemplary embodiments of the present disclosure may be applied.

As shown in fig. 1, the system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the terminal devices 101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few. A user may use the terminal devices 101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages (e.g., an audio-video data upload request, an audio-video data acquisition request), etc. Various communication client applications, such as a singing application, audio/video recording software, an audio/video player, an instant messaging tool, a mailbox client, social platform software, and the like, may be installed on the terminal devices 101, 102, and 103. The terminal apparatuses 101, 102, and 103 may be hardware or software. When the terminal devices 101, 102, 103 are hardware, they may be various electronic devices having a display screen and capable of playing and recording audio and video, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like. When the terminal device 101, 102, 103 is software, it may be installed in the electronic devices listed above, it may be implemented as a plurality of software or software modules (for example, to provide distributed services), or it may be implemented as a single software or software module. And is not particularly limited herein.

The terminal devices 101, 102, 103 may be equipped with an image capturing device (e.g., a camera) to capture video data. In practice, the smallest visual unit that makes up a video is a Frame (Frame). Each frame is a static image. Temporally successive sequences of frames are composited together to form a motion video. Further, the terminal apparatuses 101, 102, 103 may also be mounted with a component (e.g., a speaker) for converting an electric signal into sound to play the sound, and may also be mounted with a device (e.g., a microphone) for converting an analog audio signal into a digital audio signal to pick up the sound.

The server 105 may be a server providing various services, such as a background server providing support for multimedia applications installed on the terminal devices 101, 102, 103. The background server can analyze, store and the like the received audio and video data uploading request and other data, and can also receive the audio and video data acquisition request sent by the terminal equipment 101, 102 and 103 and feed back the audio and video data indicated by the audio and video data acquisition request to the terminal equipment 101, 102 and 103. Further, the server 105 may feed back information (e.g., song information) corresponding to the query request to the terminal apparatuses 101, 102, 103 in response to the query request (e.g., song query request) by the user.

The server may be hardware or software. When the server is hardware, it may be implemented as a distributed server cluster formed by multiple servers, or may be implemented as a single server. When the server is software, it may be implemented as multiple pieces of software or software modules (e.g., to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.

It should be noted that the method for audio alignment provided by the embodiment of the present disclosure is generally performed by the terminal devices 101, 102, 103, and accordingly, the apparatus for audio alignment is generally disposed in the terminal devices 101, 102, 103.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation, and the disclosure is not limited thereto.

As background art of the present disclosure, due to system delay, when a singer sings according to the accompaniment actually played by a speaker, the accompaniment and the singing voice are obviously misaligned or misaligned. Currently, the following two ways are adopted to align singing voice and accompaniment for such a phenomenon. One mode is after having recorded the song, shows the time slider of an alignment singing voice and accompaniment in software interface, lets the person of recording adjust the singing voice of dislocation, however, this kind of mode needs user manual operation, and is not only inconvenient, and is difficult to accomplish the accuracy moreover, when this dislocation time changes in whole song in addition, this kind of mode is difficult to support the alignment of different time quantum dislocation time, and the real-time is extremely low. Another way is to find the aligned time points based on the alignment of the singing voice tone and the MIDI score, comparing the tone of the singing voice signal with the correlation of the score. However, this method needs to rely on accurate pitch detection, and at this time, singing voice signals need to have a high signal-to-noise ratio and are not interfered by noise, which is difficult to achieve in a scenario where a user uses the device to play a song, the singing voice signals are interfered by accompanying music, and the pitch is difficult to detect accurately.

The automatic alignment scheme can follow up the changed delay under the condition of playing the accompaniment outside, and performs real-time alignment on the singing sound signal and the accompaniment signal according to the delay.

Hereinafter, the concept of the present disclosure will be described in detail with reference to the accompanying drawings.

Fig. 2 is a flowchart of a method for audio alignment (hereinafter, simply referred to as "audio alignment method" for convenience of description) of an exemplary embodiment of the present disclosure.

In step S201, an accompaniment signal and a collected singing voice signal including a play accompaniment are acquired. For example, the accompaniment signal may be read from the memory, and a singing voice signal including the play-out accompaniment may be collected through the speaker.

In step S202, a delay at each time between the accompaniment signal and the play-out accompaniment in the singing voice signal may be estimated. For example, in step S202, a short-time fourier transform may be first performed on the accompaniment signal and the singing voice signal, respectively, to obtain a first frequency-domain audio signal corresponding to the accompaniment signal and a second frequency-domain audio signal corresponding to the singing voice signal; then, a number of delayed frames per time instant between frequency-domain signal components corresponding to the play-out accompaniment in the first and second frequency-domain audio signals may be estimated based on the first and second frequency-domain audio signals.

Specifically, performing a short-time fourier transform (STFT) on the accompaniment signal and the singing voice signal containing the play-out accompaniment may be expressed as:

BGM(n)＝STFT(bgm(t))

VOCAL(n)＝STFT(vocal(t))

wherein bgm (t) and vocal (t) are respectively accompaniment signals and time domain audio signals of singing voice signals containing the play-out accompaniment, bgm (N) and vocal (N) are respectively first frequency domain audio signals corresponding to the accompaniment signals and second frequency domain audio signals corresponding to the singing voice signals containing the play-out accompaniment, N is a frame sequence number, N is more than 0 and less than or equal to N, and N is a total frame number. Since the present disclosure performs the same processing in each frequency band, symbols indicating frequency band information are not represented in the frequency domain signal. In addition, the vocal (t) and vocal (n) signals have the following compositions, respectively:

vocal(t)＝cleanVocal(t)+spkBgm(t)

VOCAL(n)＝CLEANVOCAL(n)+SPKBGM(n)

wherein clearVocal (t) and spkBgm (t) are the pure singing voice and the played accompaniment time domain audio signal, respectively, and CLEANVCAL (n) and SPKBGM (n) are the frequency domain signal components corresponding to the pure singing voice and the played accompaniment, respectively.

After obtaining the first frequency domain audio signal bgm (n) and the second frequency domain audio signal vocal (n), the delay of the spkbgm (n) signal component and bgm (n) in vocal (n) may be estimated based on bgm (n) and vocal (n). For example, the estimated delay delayarow (n) at the nth time between the spkbgm (n) signal component and bgm (n) in vocal (n) may be estimated based on a delay estimation method based on correlation, a delay estimation method based on spectral energy similarity, etc., which indicates the number of delay frames at the nth time between bgm (n) and vocal (n) signals.

Next, in step S203, the singing voice signal is adjusted according to the estimated delay at each time instant to align the singing voice signal with the accompaniment signal. According to an exemplary embodiment, in step S203, first, the number of delayed frames at each time and the number of delayed frames at a plurality of consecutive times before each time may constitute a delayed sequence of a predetermined length as a delayed sequence corresponding to each time. For example, delayaraw (n) obtained above is combined into a delay sequence of length M from time n-M +1 to time n:

delayRawVec(n)

＝[delayRaw(n-M+1)，delayRaw(n-M+2)，...，delayRaw(n-1)，delayRaw(n)]

secondly, determining the delay frame number with the highest confidence in the delay sequence by judging the confidence of the delay sequence corresponding to each moment, and taking the delay frame number with the highest confidence as the final delay frame number of each moment. Since the delay estimation may have false detection, here, the confidence judgment is performed on the estimated delay sequence, and the number of delay frames with the highest confidence is determined through the confidence judgment, so that the accuracy of the delay estimation can be further improved.

Specifically, for example, the delay sequence corresponding to each time may be determined based on the MIDI information or the time-stamped lyric information of the musical instrument digital interface corresponding to the singing voice signal, so as to determine the number of delay frames with the highest confidence in the delay sequence. As an example, it may be first determined whether there is a singing voice for the singing voice signal corresponding to each time instant in the delay sequence based on the MIDI information or the lyric information with time stamp corresponding to the singing voice signal; then, according to the determination result and the delay sequence, obtaining a statistical histogram corresponding to the delay sequence; and finally, taking the delay frame number corresponding to the maximum value in the statistical histogram as the delay frame number with the highest confidence coefficient.

Specifically, whether singing voice exists in the signals corresponding to the time instants in the above sequence currently can be analyzed from the MIDI information (also can be lyrics information with time labels, such as lyrics files). Whether or not there is singing voice can be marked by, for example, lysics (n), if there is singing voice, lysics (n) is 1, and if there is no singing voice, lysics (n) is 0, the statistical histogram of the sequence delayrawvec (n) is combined with the mark, and the statistical method can be, for example, as follows:

first, initializing a histogram delayMap [ L ] ═ 0}, where L is the maximum delay that can occur, and L represents a sequence of L elements; then traverse delayRawVec (n), statistical histogram:

where i ═ n-M +1, n-M + 2., n-1, n }, delayMap (×) represents the x-th element in the histogram delayMap [ L ], and the basic meaning of this operation is to use low-weight statistics for the time periods in which there is singing voice and high-weight statistics for the time periods in which there is no singing voice, thereby avoiding the singing voice component clearnco (n) interfering with the delay estimation.

It should be noted that the above only shows an example of one histogram statistical manner, however, the histogram statistical manner of the present disclosure is not limited thereto. For example, when the values added by delayMap (delayraw (i)) are 10 and 1, respectively, the values added by the values of lyrics (0) and lyrics (n) are not limited to 1, but may be other values as long as the values added when the values added by the values of lyrics (n) are 0 are larger than the values added when the values added by the values of lyrics (n) are 1. The method of calculating delayMap (delaymaw (i)) is not limited to the above addition method, but may be another method, for example, when lysics (n) is 0 and lysics (n) is 1, delayMap (delaymaw (i)) is multiplied by different values, and when lysics (n) is 0, the multiplied value is larger than that when lysics (n) is 1.

Finally, the delay frame number corresponding to the maximum value in the histogram is taken as the delay frame number with the highest confidence (i.e., the final delay frame number at each time), that is,

delayFinal(n)＝max(delayMap[L])

wherein max (delayMap [ L ]) is the number of delay frames corresponding to the maximum value in delayMap [ L ].

Finally, the singing voice signal may be adjusted to align with the accompaniment signal according to the final number of delayed frames at each time instant. Specifically, for example, first, the number of time-domain delay samples at each time may be determined according to a predetermined maximum tolerated delay frame number. According to an example embodiment, in response to the final number of delay frames of the previous time of each time being less than or equal to the sum of the final number of delay frames of each time and the maximum number of delay tolerant frames and being greater than or equal to the difference between the final number of delay frames of each time and the maximum number of delay tolerant frames, determining the number of time-domain delay samples of each time according to the final number of delay frames of the previous time; and determining the time domain delay sample number of each moment according to the final delay frame number of each moment in response to the fact that the final delay frame number of the previous moment of each moment is smaller than the difference between the final delay frame number of each moment and the maximum tolerated delay frame number and is larger than the sum of the final delay frame number of each moment and the maximum tolerated delay frame number.

For example, if delayFinal (n-1) ≦ delayFinal (n) + tolerance and delayFinal (n-1) ≧ delayFinal (n) -tolerance, (where tolerance is the set maximum tolerated delay frame number, which may be, for example, the frame number corresponding to 30 ms of audio), the delay point number at the nth time is determined from the final delay frame number delayFinal (n-1) at the previous time (i.e., the nth-1 time), e.g., delaySamples (n) ≦ delayFinal (n-1), where frame is the number of samples corresponding to one frame of data;

if delayFinal (n-1) < delayFinal (n) -tolerance or delayFinal (n-1 > delayFinal + tolerance, the time-domain delay sample point number delaysamples (n) at the nth time is determined according to the final delay frame number at the nth time, for example, the time-domain delay point delaysamples (n) is frame delayFinal (n).

Secondly, the singing voice signal is adjusted based on the time domain delay sample number of each moment. For example, vocalOut (t-delaysamples (n)) is vocal (t), and vocalOut (t-delaysamples is the adjusted singing voice signal.

And finally, smoothing the adjusted singing voice signal. Since the signal adjusted according to the delay may be overlapped or broken, the adjusted singing voice signal may be smoothed, thereby allowing better continuity of the signal.

Optionally, the method shown in fig. 2 may further include: the aligned singing sound signal and accompaniment signal are mixed to obtain a mixed song signal (not shown). For example, the adjusted singing voice signal vocalaout (t) and the original accompaniment signal bgm (t) are mixed to obtain a final song signal:

music(t)＝limitation(bgm(t)+vocalOut(t))

wherein, limit (×) represents the amplitude control of the signal, so as to prevent the occurrence of clipping distortion. Higher quality songs can be obtained in the above manner.

The audio alignment method according to the exemplary embodiment of the present disclosure has been described above with reference to fig. 2. According to the audio alignment method, the delay of each moment can be estimated under the condition that the accompaniment is played outside, and the singing voice signal and the accompaniment signal are automatically aligned in real time according to the estimated delay.

In order to more intuitively understand the audio alignment method according to the exemplary embodiment of the present disclosure, the audio alignment method according to the exemplary embodiment of the present disclosure will be briefly described below with reference to fig. 3.

Fig. 3 is a schematic diagram illustrating a method for audio alignment of an exemplary embodiment of the present disclosure. As shown in fig. 3, the accompaniment signal and the collected singing voice signal including the play-out accompaniment may be first acquired, and then delay estimation may be performed to estimate a delay at each time instant between the accompaniment signal and the play-out accompaniment in the singing voice signal, on the basis of which an estimated delay sequence corresponding to each time instant may be obtained. Then, the delay frame number with the highest confidence in the delay sequence can be determined as the final delay result by performing confidence judgment on the delay sequence corresponding to each time. For example, the delay sequence corresponding to each time may be subjected to confidence determination based on MIDI information to determine the number of delay frames with the highest confidence in the delay sequence. Next, the singing voice signal may be adjusted to align the singing voice signal with the accompaniment signal according to the final number of delayed frames at each time instant. Finally, the aligned singing voice signal and the accompaniment signal can be subjected to sound mixing processing to obtain a complete song signal. According to the audio alignment method shown in fig. 3, a higher quality song can be obtained.

Fig. 4 is a block diagram of an apparatus for audio alignment (hereinafter, simply referred to as "audio alignment apparatus" for convenience of description) of an exemplary embodiment of the present disclosure;

referring to fig. 4, the audio aligning apparatus 400 may include a signal acquisition unit 401, a delay estimation unit 402, and an adjustment unit 403. Specifically, the signal acquisition unit 401 may be configured to acquire an accompaniment signal and a collected singing voice signal including a play-out accompaniment. The delay estimation unit 402 may be configured to estimate a delay at each time instant between the accompaniment signal and the play-out accompaniment in the singing voice signal. The adjusting unit 403 may be configured to adjust the singing voice signal according to the estimated delay per time instant to align the singing voice signal with the accompaniment signal. Optionally, the audio aligning apparatus 400 may further include a mixing unit (not shown), and the mixing unit may be configured to mix the aligned singing sound signal and the aligned accompaniment signal to obtain a mixed song signal.

Since the audio alignment method shown in fig. 2 can be performed by the audio alignment apparatus 400 shown in fig. 4, and the signal obtaining unit 401, the delay estimation unit 402, and the adjustment unit 403 can respectively perform operations corresponding to step S201, step S202, and step S203 in fig. 2, any relevant details related to the operations performed by the units in fig. 4 can be referred to in the corresponding description of fig. 2, and are not repeated here.

Furthermore, it should be noted that although the audio aligning apparatus 400 is described above as being divided into units for respectively performing corresponding processes, it is clear to those skilled in the art that the processes performed by the units may be performed without any specific unit division by the audio aligning apparatus 400 or without explicit demarcation between the units. In addition, the audio aligning apparatus 400 may further include a communication unit (not shown), an audio playing unit (not shown), a processing unit (not shown), and a storage unit (not shown), among others.

Fig. 5 is a block diagram of an electronic device 500 according to an embodiment of the disclosure. Referring to fig. 5, an electronic device 500 may include at least one memory 501 having a set of computer-executable instructions stored therein that, when executed by the at least one processor, perform an audio alignment method according to an embodiment of the present disclosure and at least one processor 502.

By way of example, the electronic device may be a PC computer, tablet device, personal digital assistant, smartphone, or other device capable of executing the set of instructions described above. The electronic device need not be a single electronic device, but can be any collection of devices or circuits that can execute the above instructions (or sets of instructions) either individually or in combination. The electronic device may also be part of an integrated control system or system manager, or may be configured as a portable electronic device that interfaces with local or remote (e.g., via wireless transmission).

In an electronic device, a processor may include a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a programmable logic device, a special-purpose processor system, a microcontroller, or a microprocessor. By way of example, and not limitation, processors may also include analog processors, digital processors, microprocessors, multi-core processors, processor arrays, network processors, and the like.

The processor may execute instructions or code stored in the memory, which may also store data. The instructions and data may also be transmitted or received over a network via a network interface device, which may employ any known transmission protocol.

The memory may be integral to the processor, e.g., RAM or flash memory disposed within an integrated circuit microprocessor or the like. Further, the memory may comprise a stand-alone device, such as an external disk drive, storage array, or any other storage device usable by a database system. The memory and the processor may be operatively coupled or may communicate with each other, such as through an I/O port, a network connection, etc., so that the processor can read files stored in the memory.

In addition, the electronic device may also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of the electronic device may be connected to each other via a bus and/or a network.

According to an embodiment of the present disclosure, there may also be provided a computer-readable storage medium storing instructions that, when executed by at least one processor, cause the at least one processor to perform an audio alignment method according to an exemplary embodiment of the present disclosure. Examples of the computer-readable storage medium herein include: read-only memory (ROM), random-access programmable read-only memory (PROM), electrically erasable programmable read-only memory (EEPROM), random-access memory (RAM), dynamic random-access memory (DRAM), static random-access memory (SRAM), flash memory, non-volatile memory, CD-ROM, CD-R, CD + R, CD-RW, CD + RW, DVD-ROM, DVD-R, DVD + R, DVD-RW, DVD + RW, DVD-RAM, BD-ROM, BD-R, BD-R LTH, BD-RE, Blu-ray or compact disc memory, Hard Disk Drive (HDD), solid-state drive (SSD), card-type memory (such as a multimedia card, a Secure Digital (SD) card or a extreme digital (XD) card), magnetic tape, a floppy disk, a magneto-optical data storage device, an optical data storage device, a hard disk, a magnetic tape, a magneto-optical data storage device, a, A solid state disk, and any other device configured to store and provide a computer program and any associated data, data files, and data structures to a processor or computer in a non-transitory manner such that the processor or computer can execute the computer program. The computer program in the computer-readable storage medium described above can be run in an environment deployed in a computer apparatus, such as a client, a host, a proxy device, a server, and the like, and further, in one example, the computer program and any associated data, data files, and data structures are distributed across a networked computer system such that the computer program and any associated data, data files, and data structures are stored, accessed, and executed in a distributed fashion by one or more processors or computers.

According to an embodiment of the present disclosure, there may also be provided a computer program product comprising computer instructions which, when executed by a processor, implement an audio alignment method according to an exemplary embodiment of the present disclosure.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

16页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种基于智能DJ音响系统的音频播放控制方法及装置

Audio alignment method and device, electronic equipment and storage medium

相关技术

网友询问留言