A kind of lyrics timestamp generation method based on sound spectrograph identification

文档序号：1755576 发布日期：2019-11-29 浏览：9次中文

阅读说明：本技术 一种基于语谱图识别的歌词时间戳生成方法 (A kind of lyrics timestamp generation method based on sound spectrograph identification ) 是由鄢腊梅郑杰文蒋琤琤袁友伟王奕菲施振浪于 2019-07-18 设计创作，主要内容包括：本发明公开了一种基于语谱图识别的歌词时间戳生成方法,步骤S1：利用音频空间位置差异分离人声与伴奏；步骤S2：根据响度与BPM对处理后的音频作基于行的时间分割；步骤S3：将上述分割后的音频转化为语谱图,并利用图像识别对其进行基于单字的时间分割,得到所需的歌词时间戳。采用本发明的技术方案,将原始音频进行人声提取,并结合Adaboost模型对其语谱图进行识别,能有效提高对齐准确性,并大大降低人工对齐的成本。(The invention discloses a kind of lyrics timestamp generation method based on sound spectrograph identification, step S1: audio space position difference separation voice and accompaniment are utilized；Step S2: according to loudness and BPM, to treated, audio is made to divide based on the capable time；Step S3: converting sound spectrograph for the audio after above-mentioned segmentation, and carries out the time segmentation based on individual character to it using image recognition, obtains required lyrics timestamp.Using technical solution of the present invention, original audio is subjected to voice extraction, and identify to its sound spectrograph in conjunction with Adaboost model, alignment accuracy can be effectively improved, and substantially reduce the cost being manually aligned.)

1. a kind of lyrics timestamp generation method based on sound spectrograph identification, which is characterized in that at least include the following steps:

Step S1: audio space position difference separation voice and accompaniment are utilized；

Step S2: according to loudness and BPM, to treated, audio is made to divide based on the capable time；

Step S3: sound spectrograph is converted by the audio after step S2 segmentation, and it is carried out using image recognition based on individual character Time segmentation, obtain required lyrics timestamp；

Wherein, the step S1 further comprises:

Step S1.1: the audio-frequency information of audio left and right acoustic channels is obtained respectively, and reversion processing, reversion processing are made to left channel audio Formula is as follows:

Wherein,Information after representing each sampling dot inversion of L channel,Represent each original sample point letter of L channel Breath, that is, correspond to the amplitude x under time i；The each sampling point information of L channel indicates are as follows:

WhereinIndicate each sampling point information for being located at center channels,Indicate the audio for deviateing center channels in L channel Information；

Step S1.2: the left channel audio after reversion is superimposed on right channel, obtains the audio far from center channels, superposition Formula is as follows:

In formulaFor the audio sample point of each separate center channels,For the initial data of each sampled point of right channel,For the audio-frequency information for deviateing center channels in L channel；

Step S1.3: the audio according to separate center channels obtained above is extracted as center channels, obtains center channels audio, Center channels expression formula are as follows:

WhereinFor each sample information positioned at center channels audio-frequency information, sign (*) is sign function, x_iIt is original audio Merge the sampling point information after left and right acoustic channels merge；

Step S1.4: obtained center channels are filtered:

WhereinFor the audio sample point information obtained after filtering, f (*) function representation filter function；

Step S1.5: pitch Detection is made to the obtained audio of original audio and above-mentioned steps S1.4, pitch Detection formula is as follows:

Wherein T indicates period therebetween, τ_minIndicate pitch period minimum value, τ_maxIndicate pitch period maximum value, argmin is indicated Take minimum independent variable；

Step S1.6: the two fundamental tone is made comparisons, and judges discrepancy, and adjust the time model according to discrepancy return step S1.4 Interior filtering parameter is enclosed, the voice after finally obtaining separation

The step S1.4 further comprises:

Step S1.4.1: noise is estimated using the method for linear prediction；

Step S1.4.2: DFT processing is made to original audio and estimation noise, solves its power spectrum；

Step S1.4.3: to asking after the two power spectrum subtraction secondly norm；

Step S1.4.4: make phse conversion using noise phase；

Step S1.4.5: IDFT processing is made to final result；

The step S1.5 further comprises:

Step S1.5.1: DFT processing is made to the audio data of input, obtains its complex number spectrum；

Step S1.5.2: square operation is made to the mould length of complex number spectrum, obtains power spectrum；

Step S1.5.3: making IDFT after taking logarithm to power spectrum, finally obtains its cepstrum；

The step S2 further comprises:

Step S2.1: separating audio by threshold value of audio loudness, and compare with the lyrics line number of input, obtains based on capable point Draw S_j,j=1,2 ...；

Step S2.2: to each graduation S_jMake BPM detection, wherein BPM indicates beat number per minute；

Step S2.3: it makes a mark, is denoted as to the graduation of BPM value mutation

Step S2.4: using dichotomy, and taking center time point to the graduation of each label is that graduation point makees graduation againAnd it is right Each graduation detects BPM again, until BPM changing value is less than given value A or cycle-index more than given range；As BPM changes Value is less than given value, then cancels the label of the graduation, and the BPM point being mutated otherwise is regarded as a new graduation point, script is marked The graduation of note is modified as two new graduation；

Step S2.5: new graduation number and lyrics line number are compared, wherein if graduation number is equal with lyrics line number, Carry out step S3；If graduation number is less than lyrics line number, BPM changing value threshold value is increased, return step S2.5 obtains new point It draws；If graduation number is greater than lyrics line number, reduce BPM changing value threshold value, return step S2.5 obtains new graduation；Directly It is equal with lyrics line number to graduation number, enter step S3；

The step S2.2 further comprises:

Step S2.2.1: phase estimation is made to some minizone, if its phase estimation value has big difference with practice values, assert it For starting point；

Step S2.2.2: the disposal of gentle filter is made to start point information；

Step S2.2.3: mobile mean value thresholding pretreatment is made to above-mentioned starting point function；

Step S2.2.4: its auto-correlation function is calculated；

Step S2.2.5: the tempo of next frame is predicted by building HMM；

Step S2.2.5: output beat period；

Wherein, BPM calculation formula is as follows:

Wherein Φ_iIndicate the difference that phase and actual phase are estimated at each time point, φ_iIt is actual phase,To estimate phase, t_iFor given point in time, N indicates the time point total number under the graduation, and Time is the graduation total time, and unit is minute, A Indicate phase difference threshold；

The step S3 further comprises:

Step S3.1: sound spectrograph is converted by the step S1 voice isolated；

Step S3.2: setting picture recognition module, using Adaboost machine learning algorithm, every row audio that step S2 is provided Sound spectrograph detect, find out wherein energy density the best part, and based on energy density to time of the row audio again Segmentation, obtains the audio time T of every part_j,wAnd total segmentation number N_j；

Step S3.3: to the given lyrics in the row number of words and step S3.2 segmentation sum comparison, if number of words and the total phase of segmentation Deng then using sliced time as the timestamp of the row lyrics；Other situations then return step S3.2 modifies weight after classifier weight It is new to calculate；The cycle-index of the cycle-index for recording the row simultaneously, the row compared with before is made comparisons, if cycle-index is continuous Be incremented by, then with it move ahead in the cycle-index the smallest standard of behaviour, row return step S2 after row graduation audio again, Sentence before the row and the row then retains lyrics timestamp；

Step S3.4: the lyrics timestamp of entire song is exported；

The step S3.1 further comprises:

Step S3.1.1: preemphasis processing:

WhereinFor treated audio data；

Step S3.1.2: framing is carried out by frame length of 10ms；

Step S3.1.3: to each frame adding window, it is preferable that use Hamming window；

Step S3.1.4: the power for making its length be equal to 2 every frame signal zero padding；

Step S3.1.5: DFT transform is made to above-mentioned data, and calculates its power spectrum；

Step S3.1.6: mel-frequency is calculated:

Mel (M, N)=10log₁₀(X(m,n)·(m,n)^T)

Wherein m is frame number, and n is frame length, and M is the time, and N is frequency, and X (m, n) is the data obtained through DFT transform, and symbol T is Transposition symbol, Mel (M, N) are the required energy densities in given time and frequency；

Step S3.1.7: logarithm is taken to above-mentioned output；

Step S3.1.8: DCT (discrete cosine transform) is carried out to the output in S3.1.7；

Step S3.1.9: the above results are normalized；

The step S3.2 further comprises:

Step S3.2.1: Weak Classifier is established, to the weights initialisation of input data；

Step S3.2.2: the optimal solution of its weight vector is solved；

Step S3.2.3: calculating error, exports confidence level；

Step S3.2.4: weight is updated using above-mentioned conclusion, and normalized is made to weight；

Step S3.2.5: reading in one group of new data and the handy above-mentioned Weak Classifier solved calculates next Weak Classifier；

Step S3.2.6: strong classifier is exported after all Weak Classifiers have been calculated, obtains model result.

Technical field

The present invention relates to music information retrieval fields, and in particular, to a kind of foundation gives the lyrics and audio generates the lyrics The method of timestamp.

Background technique

In modem pop music, the lyrics are as a kind of medium for conveying song content and thought, for helping spectators more Good understanding song has irreplaceable role, and the lyrics are asynchronous with song, seriously affects the appreciation experience of spectators.It is existing There is technology to need personnel to determine position of the lyrics in song with the sense of hearing in lyrics alignment, different personnel are to same song Timestamp division often differ larger, manual method not only heavy workload, at high cost, time-consuming, and accuracy also with it is a People's level has much relations.

And existing some alignment schemes require height to original audio, often only imitate to pure voice audio graduation Fruit is ideal, and the audio under actual conditions is mostly two-channel melody, therefore alignment accuracy is low.Meanwhile conventional method pair It is huge in the accurate sex differernce of different-style melody alignment, specifically there is very big association with learning sample, therefore its robustness is poor, It is dfficult to apply in practical problem.

In addition, existing craft lyrics alignment schemes are applied in lyrics alignment line by line more, and word for word lyrics alignment is because needing Will content relatively pair dramatically increase, and without unified file format and standard, lead to the lyrics text with word for word lyrics timestamp This rare numbers.

In view of the above-mentioned problems, not thering is effective solution method to be suggested yet at present.

Summary of the invention

Present invention aims at for manual alignment lyrics timestamp is currently needed, especially word for word lyrics timestamp is literary This status of this rareness, provides a kind of convenience, and fast, automatic timestamp generation method solves lyrics projection synchronous with song In time unifying problem.

It to achieve the above object, should the present invention provides a kind of lyrics timestamp generation method based on sound spectrograph identification Method includes the following steps: the song audio application Melody extraction technology to input, the voice audio after being separated；According to It carries out according to given lyrics file based on capable snippet extraction；Generate the sound spectrograph of corresponding audio file；Utilize image recognition mould Sound spectrograph graduation of the block to the part, judges possible individual character position, matches with the given lyrics in the row number of words, excellent Change recognition result, finally obtains the lyrics text with timestamp.

Technical scheme is as follows:

A kind of lyrics timestamp generation method based on sound spectrograph identification comprising the steps of:

Step S1: audio pretreatment is made voice accompaniment separation using the differences in spatial location of input sound source, is specifically included Following steps:

Step S1.1: the audio-frequency information of audio left and right acoustic channels is obtained respectively, and reversion processing is made to left channel audio, instead It is as follows to turn processing formula:

Wherein,Information after representing each sampling dot inversion of L channel,Represent each crude sampling of L channel Point information, that is, correspond to the amplitude x under time i；In addition, each sampling point information of L channel is also denoted as:

WhereinIndicate each sampling point information for being located at center channels,It indicates to deviate center channels in L channel Audio-frequency information.

Step S1.2: the left channel audio after reversion is superimposed on right channel, obtains the audio far from center channels, Its Superposition Formula is as follows:

In formulaFor the audio sample point of each separate center channels,For the original number of each sampled point of right channel According to,For the audio-frequency information for deviateing center channels in L channel.

Step S1.3: the audio according to separate center channels obtained above is extracted as center channels, obtains center channels Audio, center channels expression formula are as follows:

WhereinFor each sample information positioned at center channels audio-frequency information, sign (*) is sign function, x_iIt is original Audio merges the sampling point information after left and right acoustic channels merge.

Step S1.4: obtained center channels are filtered:

WhereinFor the audio sample point information obtained after filtering, f (*) function representation filter function in this example will Use and subtract spectrometry as filter function, its step are as follows:

Step S1.4.1: preferably, noise is estimated using the method for linear prediction；

Step S1.4.2: DFT (discrete Fourier transform) processing is made to original audio and estimation noise, solves its power Spectrum；

Step S1.4.3: to asking after the two power spectrum subtraction secondly norm；

Step S1.4.4: make phse conversion using noise phase；

Step S1.4.5: IDFT (inverse discrete Fourier transform) processing is made to final result.

Step S1.5: pitch Detection is made to the obtained audio of original audio and above-mentioned steps S1.4, pitch Detection formula is such as Under:

Wherein T indicates period therebetween, τ_minIndicate pitch period minimum value, τ_maxIndicate pitch period maximum value, argmin Expression takes minimum independent variable.Specifically, steps are as follows:

Step S1.5.1: DFT processing is made to the audio data of input, obtains its complex number spectrum；

Step S1.5.2: square operation is made to the mould length of complex number spectrum, obtains power spectrum；

Step S1.5.3: making IDFT after taking logarithm to power spectrum, finally obtains its cepstrum.

It should be noted that can also make aforesaid operations to right audio channel, finally obtained result is identical.

Step S2: based on capable audio separation and lyrics alignment, comprising the following steps:

Step S2.1: separate audio by threshold value of audio loudness, and compared with the lyrics line number of input, obtain possibility Based on capable graduation S_j, j=1,2 ....

Step S2.2: to each graduation S_jMake BPM (beat number per minute) detection, specifically:

Step S2.2.1: making phase estimation to some minizone, if its phase estimation value has big difference with practice values, Assert it for starting point；

Step S2.2.2: the disposal of gentle filter is made to start point information；

Step S2.2.3: mobile mean value thresholding pretreatment is made to above-mentioned starting point function；

Step S2.2.4: its auto-correlation function is calculated；

Step S2.2.5: the tempo of next frame is predicted by building HMM (hidden Markov model)；

Step S2.2.5: output beat period.

BPM calculation formula is as follows:

Wherein Φ_iIndicate the difference that phase and actual phase are estimated at each time point, φ_iIt is actual phase,For estimation Phase, t_iFor given point in time, N indicates the time point total number under the graduation, and Time is the graduation total time, and unit is Minute, A indicates phase difference threshold, is provided by experience.

Step S2.3: it makes a mark, is denoted as to the graduation of BPM value mutation

Step S2.4: using dichotomy, and taking center time point to the graduation of each label is that graduation point makees graduation againAnd BPM is detected again to each graduation, until BPM changing value is less than given value A or cycle-index more than given model It encloses.If BPM changing value is less than given value, then cancel the label of the graduation, the BPM point being mutated otherwise is regarded as a new minute Point is drawn, the graduation of script label is modified as two new graduation.

Step S2.5: comparing new graduation number and lyrics line number, may there is following three kinds of situations: graduation number with Lyrics line number is equal, then carries out step S3；Graduation number is less than lyrics line number, increases BPM changing value threshold value, return step S2.5 obtains new graduation；Graduation number is greater than lyrics line number, reduces BPM changing value threshold value, and return step S2.5 is obtained New graduation；Until graduation number is equal with lyrics line number, S3 is entered step.

Step S3: word-based lyrics alignment and timestamp generate, specifically includes the following steps:

Step S3.1: converting sound spectrograph for the step S1 voice isolated, specifically:

Step S3.1.1: preemphasis processing:

WhereinFor treated audio data；

Step S3.1.2: framing is carried out by frame length of 10ms；

Step S3.1.3: to each frame adding window, it is preferable that use Hamming window；

Step S3.1.4: the power for making its length be equal to 2 every frame signal zero padding；

Step S3.1.5: DFT transform is made to above-mentioned data, and calculates its power spectrum；

Step S3.1.6: mel-frequency is calculated:

Mel (M, N)=10log₁₀(X(m,n)·X(m,n)^T)

Wherein m is frame number, and n is frame length, and M is the time, and N is frequency, and X (m, n) is the data obtained through DFT transform, symbol Number T is transposition symbol, and Mel (M, N) is the required energy density in given time and frequency.

Step S3.1.7: logarithm is taken to above-mentioned output；

Step S3.1.8: DCT (discrete cosine transform) is carried out to the output in S3.1.7；

Step S3.1.9: the above results are normalized.

Step S3.2: picture recognition module is introduced, it is preferable that use Adaboost machine learning algorithm, give to step S2 The sound spectrograph of every row audio out detects, and finds out wherein energy density the best part, and based on energy density to the row sound The time of frequency is divided again, obtains the audio time T of every part_j,wAnd total segmentation number N_j.In the method, Adaboost algorithm Mainly including the following steps:

Step S3.2.1: Weak Classifier is established, to the weights initialisation of input data；

Step S3.2.2: the optimal solution of its weight vector is solved；

Step S3.2.3: calculating error, exports confidence level；

Step S3.2.4: weight is updated using above-mentioned conclusion, and normalized is made to weight；

Step S3.2.5: reading in one group of new data and the handy above-mentioned Weak Classifier solved calculates next weak typing Device；

Step S3.2.6: strong classifier is exported after all Weak Classifiers have been calculated, obtains model result.

Step S3.3: to the given lyrics in the row number of words and step S3.2 segmentation sum comparison, if number of words and segmentation are total Number is equal, then using sliced time as the timestamp of the row lyrics；Other situations then weigh by return step S3.2, modification classifier It is recalculated after weight.

The cycle-index of the cycle-index for recording the row simultaneously, the row compared with before is made comparisons, if cycle-index is not It is disconnected to be incremented by, then with it move ahead in the smallest standard of behaviour of cycle-index, row return step S2 after row graduation sound again Frequently, the sentence before the row and the row then retains lyrics timestamp.

Step S3.4: the lyrics timestamp of entire song is exported.

Compared with the prior art, the device have the advantages that it is as follows:

Accuracy is high: directly carrying out lyrics alignment to song compared to conventional method, present invention employs empty based on audio Between difference center channels extraction method, voice is separated with accompaniment, reduces the interference of irrelevant information.

Strong robustness: having used the method that word is aligned after leading alignment, effectively prevents in song modal particle to the lyrics pair Neat influence, while error accumulation is prevented using return parameters in the module being aligned based on word.

It is at low cost: to be generated since this method only needs song audio and the lyrics that lyrics timestamp can be completed, be not required to very important person Work participates in, and significantly reduces the cost of lyrics alignment.

Detailed description of the invention

Fig. 1 is a kind of block flow diagram of lyrics timestamp generation method based on sound spectrograph identification provided by the invention；

Fig. 2 is the flow chart of voice separation in step 1；

Fig. 3 (a) is original audio L channel waveform diagram；

Fig. 3 (b) is waveform diagram after L channel reversion；

Fig. 3 (c) is the audio volume control figure far from center channels；

Fig. 3 (d) is the audio volume control figure obtained after center channels are extracted；

Fig. 3 (e) is the voice audio volume control figure after separation；

Fig. 4 is in step 2 based on capable audio graduation flow chart；

Fig. 5 is to be divided based on the capable audio volume control time；

Fig. 6 is the audio frequency phase figure on the row minizone；

Fig. 7 is the audio segmentation flow chart based on individual character in step 3；

Fig. 8 is using logarithmic scale as the audio sound spectrograph of coordinate；

Fig. 9 is the sound spectrograph segmentation based on individual character；

Figure 10 is average alignment accuracy rate of the different alignment schemes to each music style；

The present invention that the following detailed description will be further explained with reference to the above drawings.

Specific embodiment

Method provided by the invention is described further below with reference to attached drawing.

Lyrics timestamp based on sound spectrograph image recognition generates, in music information retrieval field, by using engineering Habit makees image recognition to sound spectrograph, converts image processing problem for traditional audio analysis problem, has more compared with previous methods Intuitively, more succinct advantage, while having more preferably result in most cases.It can be optimized using pitch Detection and be based on studying carefully Unexpectedly the melody separating effect of position difference, the more simple contracting calculating that mixes again have huge advantage.And sentence branch is judged using BPM Operation efficiency is then largely improved on the basis of effect is controllable.Voice sound spectrograph is made using Adaboost model Image recognition can effectively exclude the interference therefore such as noise, accompaniment.Therefore, the present invention provides one kind is identified based on sound spectrograph Lyrics timestamp generation method.

The present invention provides a kind of lyrics timestamp generation method based on sound spectrograph identification, and whole and square, the present invention includes Three big steps, step S1: audio pretreatment, separation voice and accompaniment；Step S2: pretreated audio is made based on capable Time segmentation；Step S3: the time segmentation based on individual character is made to above-mentioned every row audio, obtains required lyrics timestamp.

It is the step flow chart of embodiment of the present invention method referring to Fig. 1, comprising the following steps:

Step S1: audio pretreatment makees voice accompaniment separation using the differences in spatial location of input sound source, referring to fig. 2 institute It is shown as voice separation process figure in a kind of lyrics timestamp generation method based on sound spectrograph identification provided by the invention, tool Body the following steps are included:

Step S1.1: the audio-frequency information of audio left and right acoustic channels is obtained respectively, and reversion processing is made to left channel audio, such as Shown in Fig. 3 (a), Fig. 3 (b), L channel original audio waveform diagram and L channel reversion are respectively shown treated the wave of audio Shape figure, it can be seen that reversion processing overturns image along the axis that amplitude is 0 from image, and reversion processing formula is as follows:

WhereinIndicate each sampling point information for being located at center channels,It indicates to deviate center channels in L channel Audio-frequency information.

Step S1.2: the left channel audio after reversion is superimposed on right channel, obtains the audio far from center channels, As shown in Fig. 3 (c), it is not difficult to find out that it is bigger under the mean value of amplitude of wave form is compared with script monophonic after step S1.2 processing, show this Song left and right acoustic channels information gap may be larger.Its superposition processing formula is as follows:

Step S1.3: the audio according to separate center channels obtained above is extracted as center channels, obtains center channels Audio, center channels expression formula are as follows:

Wherein xci is each sample information for being located at center channels audio-frequency information, and sign (*) is sign function, x_iIt is former Beginning audio merges the sampling point information after left and right acoustic channels merge, and final result is shown in shown in Fig. 3 (d), has in the figure apparent Waveform change point but is still can see in its starting end and is significantly not belonging to the accompaniment of voice and deposits for possible cut-point , therefore need to handle the part in next step.

Step S1.4: obtained center channels are filtered:

Step S1.4.1: preferably, noise is estimated using the method for linear prediction；

Step S1.4.2: DFT (discrete Fourier transform) processing is made to original audio and estimation noise, solves its power Spectrum；

Step S1.4.3: to asking after the two power spectrum subtraction secondly norm；

Step S1.4.4: make phse conversion using noise phase；

Step S1.4.5: IDFT (inverse discrete Fourier transform) processing is made to final result.

Step S1.5: pitch Detection is made to the obtained audio of original audio and above-mentioned steps S1.4, formula is as follows:

Step S1.5.1: DFT processing is made to the audio data of input, obtains its complex number spectrum；

Step S1.5.2: square operation is made to the mould length of complex number spectrum, obtains power spectrum；

Step S1.5.3: making IDFT after taking logarithm to power spectrum, finally obtains its cepstrum.

Step S1.6: the two fundamental tone is made comparisons, and judges discrepancy, and should according to discrepancy return step S1.4 adjustment Filtering parameter in time range, the voice after finally obtaining separationThe voice audio volume control figure of final output is shown in Fig. 3 (e) Shown, it can be seen that will preferably reject without vocal sections in the time domain from figure, voice and accompaniment mixing portion pass through Treated audio is more pure, can be used for the lyrics alignment operation of next step.

It should be noted that can also make aforesaid operations to right audio channel, finally obtained result is identical.

Step S2: flow chart is aligned based on capable audio separation and the lyrics referring to fig. 4, comprising the following steps:

Step S2.1: separate audio by threshold value of audio loudness, and compared with the lyrics line number of input, obtain possibility Based on capable graduation S_j, j=1,2 ....

Step S2.2: to each graduation S_jMake BPM (Beat Per Minute, beat number per minute) detection, specifically:

Step S2.2.1: making phase estimation to some minizone, if its phase estimation value has big difference with practice values, Assert it for starting point.Fig. 5 illustrates the dual-channel audio phase diagram in one section of minizone, the bright part table of color in figure The sound phase position where the audio with high-energy is illustrated.The i.e. prediction light tone audio to be made of this step can in future time The sound phase position of energy is simultaneously compared with actual value；

Step S2.2.2: the disposal of gentle filter is made to start point information；

Step S2.2.3: mobile mean value thresholding pretreatment is made to above-mentioned starting point function；

Step S2.2.4: its auto-correlation function is calculated；

Step S2.2.5: the tempo of next frame is predicted by building HMM (hidden Markov model)；

Step S2.2.5: output beat period.

BPM calculation formula is as follows:

Step S2.3: label is made to the graduation of BPM value mutation, is denoted as

Step S2.4: using dichotomy, and taking center time point to the graduation of each label is that graduation point makees graduation again And BPM is detected again to each graduation, until BPM changing value is less than given value A or cycle-index more than given range.Such as BPM Changing value is less than given value, then cancels the label of the graduation, and the BPM point being mutated otherwise is regarded as a new graduation point, will be former The graduation of this label is modified as two new graduation.

Step S3: Fig. 7 illustrates the word-based lyrics alignment of one kind and timestamp product process figure, specifically includes following Step:

Step S3.1: converting sound spectrograph for the step S1 voice isolated, specifically:

Step S3.1.1: preemphasis processing:

WhereinFor treated audio data；

Step S3.1.2: framing is carried out by frame length of 10ms；

Step S3.1.3: to each frame adding window, it is preferable that use Hamming window；

Step S3.1.4: the power for making its length be equal to 2 every frame signal zero padding；

Step S3.1.5: DFT transform is made to above-mentioned data, and calculates its power spectrum；

Step S3.1.6: mel-frequency is calculated:

Mel (M, N)=10log₁₀(X(m,n)·X(m,n)^T)

Step S3.1.7: logarithm is taken to above-mentioned output；

Step S3.1.8: DCT (discrete cosine transform) is carried out to the output in S3.1.7；

Step S3.1.9: the above results are normalized.

Fig. 8 is the sound spectrograph that above-mentioned steps are made, and particularly, the longitudinal axis has used logarithmic scale coordinate, energy in fig. 8 What is be more clear observes the place disposition of high energy audio, is conducive to next step image recognition.

Step S3.2.1: Weak Classifier is established, to the weights initialisation of input data；

Step S3.2.2: the optimal solution of its weight vector is solved；

Step S3.2.3: calculating error, exports confidence level；

Step S3.2.4: weight is updated using above-mentioned conclusion, and normalized is made to weight；

Step S3.2.5: reading in one group of new data and the handy above-mentioned Weak Classifier solved calculates next weak typing Device；

Step S3.2.6: strong classifier is exported after all Weak Classifiers have been calculated, obtains model result.

Step S3.4: the lyrics timestamp of entire song is exported.Fig. 9 then illustrates wherein one graduation, it can be seen that It can be automatically found the tone period of each word in given lyrics text under this methodology, and then generate timestamp.

In order to verify the technical effects of the present invention, below by the lyrics timestamp generation method that is identified based on sound spectrograph with Other traditional lyrics alignment schemes are compared:

It is 44.1KHz that experiment, which uses test song sample rate, samples the wav file of bit depth 16 (non-jitter), with The timestamp being manually aligned is as standard, by method provided by the present invention and LyricAlly system, based on GMM-HMM (height This mixed model-hidden Markov model) alignment schemes be compared.It should be noted that because being manually aligned, there is always accidentally Difference thinks that timestamp of the error less than 0.1s is that (BPM is effective time stamp assuming that a song is averaged in this experiment 240, then an about note time is 0.25s, and the BPM of most popular songs is less than 200).Wherein LyricAlly Alignment scheme of the system as this field early stage, by the way that a first popular song is split as the parts such as prelude, chorus, climax, and Voice detection is made to possible part, afterwards using phoneme duration estimation sentence duration and with lyrics text justification；And GMM-HMM Alignment schemes are to utilize the traveling pair of GMM-HMM model afterwards by converting MFCC (mel-frequency cepstrum coefficient) for song audio It together, is the alignment schemes of existing most mainstream.

Figure 10 illustrates different alignment schemes to the average alignment accuracy rate of each music style, because LyricAlly system is Based on the system of English song exploitation, transplanting difficulty is larger, therefore the English song that this experiment has chosen different music styles is total 25 is first, and every kind of style each 5 is first.It can be seen that LyricAlly system as this field earlier solutions, only has research significance, nothing Method is applied in actual operation；And the alignment of GMM-HMM model then has relatively good alignment accuracy rate, cooperating manually can pole The workload of big reduction personnel；Lyrics timestamp generation method proposed by the present invention based on sound spectrograph identification is then comprehensively excellent In above two scheme, it is high to be aligned accuracy rate, and adapts to the melody of different-style.

The following table 1 is the detailed data of above-mentioned comparative experiments.It can be seen that method provided by the invention is compared with GMM-HMM model There are 21 first (84%) alignment accuracys rate higher in 25 first different-style melodies, averagely alignment accuracy rate is up to 81.91%, most Low alignment accuracy rate 70.52%, is excellent in compared with conventional method.

1 present invention of table is aligned accuracy rate percentage with other conventional methods

The above description of the embodiment is only used to help understand the method for the present invention and its core ideas.It should be pointed out that pair For those skilled in the art, without departing from the principle of the present invention, the present invention can also be carried out Some improvements and modifications, these improvements and modifications also fall within the scope of protection of the claims of the present invention.

The foregoing description of the disclosed embodiments enables those skilled in the art to implement or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, defined herein General Principle can realize in other embodiments without departing from the spirit or scope of the present invention.Therefore, originally Invention is not intended to be limited to the embodiments shown herein, and is to fit to special with principles disclosed herein and novelty The consistent widest scope of point.

21页详细技术资料下载

A kind of lyrics timestamp generation method based on sound spectrograph identification

相关技术

网友询问留言