Method and device for determining tone similarity and computer storage medium

文档序号：1568592 发布日期：2020-01-24 浏览：18次中文

阅读说明：本技术 音色相似度的确定方法、装置及计算机存储介质 (Method and device for determining tone similarity and computer storage medium ) 是由劳振锋肖纯智于 2019-10-15 设计创作，主要内容包括：本公开提供了一种音色相似度的确定方法、装置及计算机存储介质,涉及数据处理技术领域。该方法可以基于k个第一均值特征向量以及k个第二均值特征向量,确定第一音频和第二音频的音色相似度。由于每个第一均值特征向量是基于m个第一特征向量中的多个第一特征向量的平均值确定的,每个第二均值特征向量是基于m个第二特征向量中的多个第二特征向量的平均值确定的,因此相较于现有技术中直接根据第一特征向量和第二特征向量确定音色相似度,本公开实施例提供的音色相似度确定方法确定的音色相似度的准确性较高。(The disclosure provides a method and a device for determining tone similarity and a computer storage medium, and relates to the technical field of data processing. The method may determine timbre similarities of the first audio and the second audio based on the k first mean feature vectors and the k second mean feature vectors. Since each first mean feature vector is determined based on the average value of a plurality of first feature vectors in the m first feature vectors, and each second mean feature vector is determined based on the average value of a plurality of second feature vectors in the m second feature vectors, the accuracy of determining the timbre similarity by the timbre similarity determining method provided by the embodiment of the disclosure is higher than that in the prior art, in which the timbre similarity is determined directly from the first feature vectors and the second feature vectors.)

1. A method for determining timbre similarity, the method comprising:

acquiring m first characteristic vectors of a first audio and m second characteristic vectors of a second audio, wherein the first audio and the second audio are different audios, and m is an integer greater than 1;

determining k first mean feature vectors from the m first feature vectors, each first mean feature vector being determined based on an average of a plurality of the m first feature vectors, k being a positive integer;

determining k second mean feature vectors corresponding to the k first mean feature vectors in a one-to-one manner according to the m second feature vectors, wherein each second mean feature vector is determined based on the average value of a plurality of second feature vectors in the m second feature vectors;

determining timbre similarities of the first audio and the second audio based on the k first mean feature vectors and the k second mean feature vectors.

2. The method of claim 1, wherein determining k first mean eigenvectors from the m first eigenvectors comprises:

dividing m first feature vectors into k different first vector groups, wherein each first vector group comprises n continuous first feature vectors, and n is an integer greater than 1 and smaller than m;

for each first vector group, determining an average value of n continuous first feature vectors included in the first vector group as a first average feature vector to obtain k first average feature vectors;

the determining k second mean feature vectors corresponding to the k first mean feature vectors one to one according to the m second feature vectors includes:

dividing the m second eigenvectors into k different second vector groups, each second vector group comprising n consecutive second eigenvectors;

for each second vector group, determining an average value of n continuous second feature vectors included in the second vector group as a second average feature vector, and obtaining k second average feature vectors.

3. The method according to claim 2, wherein the intersection of any two of the first vector sets is empty, and the first eigenvectors included in two adjacent first vector sets are consecutive;

the intersection of any two second vector groups is empty, and the second eigenvectors included in two adjacent second vector groups are continuous.

4. The method according to any one of claims 1 to 3, wherein k is an integer greater than 1, and wherein determining the timbre similarities of the first audio and the second audio based on k of the first mean feature vectors and k of the second mean feature vectors comprises:

processing each first mean value feature vector and one corresponding second mean value feature vector by adopting a Pearson algorithm, determining a tone distance, and obtaining k tone distances;

and determining the average value of the k tone color distances as the tone color similarity of the first audio and the second audio.

5. The method according to any one of claims 1 to 3, wherein the obtaining m first eigenvectors of the first audio and m second eigenvectors of the second audio comprises:

acquiring a plurality of first initial feature vectors of the first audio and a plurality of second initial feature vectors of the second audio;

and aligning the plurality of first initial feature vectors and the plurality of second initial feature vectors to obtain m first feature vectors and m second feature vectors.

6. The method of claim 5, wherein said aligning the plurality of first initial feature vectors and the plurality of second initial feature vectors comprises:

and performing alignment processing on the first initial characteristic vectors and the second initial characteristic vectors by adopting a dynamic time normalization algorithm.

7. The method of claim 5, wherein obtaining a plurality of first initial feature vectors of the first audio and a plurality of second initial feature vectors of the second audio comprises:

extracting a plurality of first mel frequency cepstrum parameters from the first audio frequency as a plurality of first initial characteristic vectors;

and extracting a plurality of second mel frequency cepstrum parameters from the second audio frequency to be used as a plurality of second initial characteristic vectors.

8. A determination device for timbre similarity, the device comprising:

the device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring m first characteristic vectors of a first audio and m second characteristic vectors of a second audio, the first audio and the second audio are different audios, and m is an integer greater than 1;

a first determining module, configured to determine k first mean feature vectors according to the m first feature vectors, where each first mean feature vector is determined based on an average of a plurality of the m first feature vectors, and k is a positive integer;

a second determining module, configured to determine, according to the m second feature vectors, k second mean feature vectors that are in one-to-one correspondence with the k first mean feature vectors, where each second mean feature vector is determined based on an average value of a plurality of second feature vectors in the m second feature vectors;

a third determining module, configured to determine timbre similarities of the first audio and the second audio based on the k first mean feature vectors and the k second mean feature vectors.

9. A determination device for timbre similarity, the device comprising: a memory and a processor; the memory stores a computer program which, when executed by the processor, causes the processor to execute the method of determining timbre similarity as claimed in any one of claims 1 to 7.

10. A computer storage medium having stored therein instructions that, when run on a computer, cause the computer to execute the method for determining timbre similarity according to any one of claims 1 to 7.

Technical Field

The present disclosure relates to the field of data processing technologies, and in particular, to a method and an apparatus for determining timbre similarity, and a computer storage medium.

Background

After a user records audio by using an audio client installed in a terminal such as a mobile phone, the audio client can determine the tone similarity between the audio of the user and other audio, so that the user can know other audio similar to the tone of the user.

Disclosure of Invention

The present disclosure provides a method and an apparatus for determining a timbre similarity, and a computer storage medium, which can solve the problem of low accuracy of determining the timbre similarity in the related art. The technical scheme is as follows:

in one aspect, a method for determining timbre similarity is provided, where the method includes:

determining timbre similarities of the first audio and the second audio based on the k first mean feature vectors and the k second mean feature vectors.

Optionally, the determining k first mean feature vectors according to the m first feature vectors includes:

the determining k second mean feature vectors corresponding to the k first mean feature vectors one to one according to the m second feature vectors includes:

dividing the m second eigenvectors into k different second vector groups, each second vector group comprising n consecutive second eigenvectors;

Optionally, an intersection of any two of the first vector groups is empty, and the first feature vectors included in two adjacent first vector groups are consecutive;

the intersection of any two second vector groups is empty, and the second eigenvectors included in two adjacent second vector groups are continuous.

Optionally, the determining the timbre similarity of the first audio and the second audio based on the k first mean feature vectors and the k second mean feature vectors includes:

processing each first mean value feature vector and one corresponding second mean value feature vector by adopting a Pearson algorithm, determining a tone distance, and obtaining k tone distances;

and determining the average value of the k tone color distances as the tone color similarity of the first audio and the second audio.

Optionally, the obtaining m first feature vectors of the first audio and m second feature vectors of the second audio includes:

acquiring a plurality of first initial feature vectors of the first audio and a plurality of second initial feature vectors of the second audio;

and aligning the plurality of first initial feature vectors and the plurality of second initial feature vectors to obtain m first feature vectors and m second feature vectors.

Optionally, the aligning the plurality of first initial feature vectors and the plurality of second initial feature vectors includes:

and performing alignment processing on the first initial characteristic vectors and the second initial characteristic vectors by adopting a dynamic time normalization algorithm.

Optionally, the obtaining a plurality of first initial feature vectors of the first audio and a plurality of second initial feature vectors of the second audio includes:

extracting a plurality of first mel frequency cepstrum parameters from the first audio frequency as a plurality of first initial characteristic vectors;

and extracting a plurality of second mel frequency cepstrum parameters from the second audio frequency to be used as a plurality of second initial characteristic vectors.

In another aspect, there is provided a determination apparatus of timbre similarity, the apparatus including:

a third determining module, configured to determine timbre similarities of the first audio and the second audio based on the k first mean feature vectors and the k second mean feature vectors.

Optionally, the first determining module is configured to:

the second determination module is to:

dividing the m second eigenvectors into k different second vector groups, each second vector group comprising n consecutive second eigenvectors;

Optionally, an intersection of any two of the first vector groups is empty, and the first feature vectors included in two adjacent first vector groups are consecutive;

the intersection of any two second vector groups is empty, and the second eigenvectors included in two adjacent second vector groups are continuous.

Optionally, k is an integer greater than 1, and the third determining module is configured to:

processing each first mean value feature vector and one corresponding second mean value feature vector by adopting a Pearson algorithm, determining a tone distance, and obtaining k tone distances;

and determining the average value of the k tone color distances as the tone color similarity of the first audio and the second audio.

Optionally, the obtaining module includes:

an obtaining submodule, configured to obtain a plurality of first initial feature vectors of the first audio and a plurality of second initial feature vectors of the second audio;

and the alignment submodule is used for performing alignment processing on the plurality of first initial characteristic vectors and the plurality of second initial characteristic vectors to obtain m first characteristic vectors and m second characteristic vectors.

Optionally, the alignment sub-module is configured to:

and performing alignment processing on the first initial characteristic vectors and the second initial characteristic vectors by adopting a dynamic time normalization algorithm.

Optionally, the obtaining sub-module is configured to:

extracting a plurality of first mel frequency cepstrum parameters from the first audio frequency as a plurality of first initial characteristic vectors; and extracting a plurality of second mel frequency cepstrum parameters from the second audio frequency to be used as a plurality of second initial characteristic vectors.

In another aspect, there is provided a device for determining timbre similarity, the device including: a memory and a processor; the memory stores a computer program that, when executed by the processor, causes the processor to execute the method of determining timbre similarity as described in the above aspect.

In still another aspect, a computer storage medium is provided, which stores instructions that, when executed on a computer, cause the computer to perform the method for determining timbre similarity as described in the above aspect.

In a further aspect, a computer program product comprising instructions is provided, which when run on the computer causes the computer to perform the method for determining timbre similarities of the above aspects.

The beneficial effect that technical scheme that this disclosure provided brought includes at least:

the present disclosure provides a method, an apparatus, and a computer storage medium for determining timbre similarities, which may determine timbre similarities of first and second audios based on k first mean feature vectors and k second mean feature vectors. Since each first mean feature vector is determined based on the average value of a plurality of first feature vectors in the m first feature vectors, and each second mean feature vector is determined based on the average value of a plurality of second feature vectors in the m second feature vectors, the accuracy of determining the timbre similarity by the timbre similarity determining method provided by the embodiment of the disclosure is higher than that in the prior art, in which the timbre similarity is determined directly from the first feature vectors and the second feature vectors.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present disclosure, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

Fig. 1 is a schematic diagram of a terminal provided in an embodiment of the present disclosure;

fig. 2 is a flowchart of a method for determining timbre similarity provided by an embodiment of the present disclosure;

fig. 3 is a flowchart of another method for determining timbre similarity provided by the embodiment of the present disclosure;

fig. 4 is a flowchart of a method for determining timbre similarities based on k first mean feature vectors and k second mean feature vectors according to an embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of a determination apparatus for timbre similarity according to an embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of an acquisition module provided in an embodiment of the present disclosure;

fig. 7 is a schematic structural diagram of another determination apparatus for timbre similarity according to an embodiment of the present disclosure.

Detailed Description

To make the objects, technical solutions and advantages of the present disclosure more apparent, embodiments of the present disclosure will be described in detail with reference to the accompanying drawings.

The method for determining the tone similarity distance provided by the embodiment of the disclosure can be applied to a terminal. Fig. 1 is a schematic diagram of a terminal according to an embodiment of the present disclosure. As shown in fig. 1, the terminal 100 may have an audio client 10a installed thereon. The audio client 10a may retrieve the first audio and the second audio and determine a timbre similarity of the first audio and the second audio.

The terminal 100 may be a smart phone, a tablet computer, an MP4(moving picture expert Audio Layer IV) player, a laptop portable computer, a desktop computer, or the like. The audio client 10a may be a client capable of recording and playing audio, and may be a karaoke client, for example.

The embodiment of the disclosure provides a method for determining timbre similarity, which can be applied to an audio client in the implementation environment shown in fig. 1. Referring to fig. 2, the method may include:

step 101, m first feature vectors of a first audio and m second feature vectors of a second audio are obtained.

Wherein m is an integer greater than 1. The first audio and the second audio are different audio.

And step 102, determining k first mean value feature vectors according to the m first feature vectors.

Wherein k is a positive integer, and each first mean feature vector is determined based on an average of a plurality of the m first feature vectors.

And 103, determining k second mean value feature vectors corresponding to the k first mean value feature vectors one by one according to the m second feature vectors.

Each second mean feature vector is determined based on a mean of a plurality of the m second feature vectors.

And step 104, determining the tone similarity of the first audio and the second audio based on the k first mean feature vectors and the k second mean feature vectors.

In this disclosure, a pearson algorithm may be used to process the k first mean feature vectors and k second mean feature vectors corresponding to the k first mean feature vectors one to one, so as to determine the timbre similarity of the first audio and the second audio. Alternatively, k first mean feature vectors and cosine distances of k second mean feature vectors corresponding to the k first mean feature vectors one to one may be determined to determine timbre similarities of the first audio and the second audio.

In summary, the embodiments of the present disclosure provide a method for determining timbre similarity, which may determine timbre similarities of a first audio and a second audio based on k first mean feature vectors and k second mean feature vectors. Since each first mean feature vector is determined based on the average value of a plurality of first feature vectors in the m first feature vectors, and each second mean feature vector is determined based on the average value of a plurality of second feature vectors in the m second feature vectors, the accuracy of determining the timbre similarity by the timbre similarity determining method provided by the embodiment of the disclosure is higher than that in the prior art, in which the timbre similarity is determined directly from the first feature vectors and the second feature vectors.

Fig. 3 is a flowchart of another method for determining timbre similarity provided by the embodiment of the present disclosure, which may be applied to an audio client in the implementation environment shown in fig. 1. Referring to fig. 3, the method may include:

step 201, a plurality of first initial feature vectors of a first audio and a plurality of second initial feature vectors of a second audio are obtained.

Wherein the first audio and the second audio are different audio. For example, the first audio and the second audio may be audio recorded by different users. And, the contents of the first audio and the second audio may be the same. For example, the first audio and the second audio may be audio recorded for the same content by different users obtained by an audio client. The content may include music score, lyrics, sentences, etc. The number of the plurality of first initial feature vectors and the number of the plurality of second initial feature vectors of the second audio may be the same or different.

In an optional implementation manner, the audio client may perform feature extraction of Mel-cepstrum (Mel-cepstrum) parameters on the first audio and the second audio to extract a plurality of first Mel-cepstrum parameters from the first audio as a plurality of first initial feature vectors and a plurality of second Mel-cepstrum parameters from the second audio as a plurality of second initial feature vectors.

For example, the audio client may employ an audio processing tool (e.g., Librosa) to first obtain a first mel spectrum based on a first audio and a second mel spectrum based on a second audio, and then may log the first mel spectrum and the second mel spectrum. Further, discrete cosine transform processing may be performed on the first mel frequency spectrum and the second mel frequency spectrum after the logarithm processing, so as to obtain a plurality of first mel cepstrum parameters and a plurality of second mel cepstrum parameters. Finally, the first mel-frequency cepstrum parameters can be used as a first initial characteristic vector, and the second mel-frequency cepstrum parameters can be used as a second initial characteristic vector. Or, the audio client may directly use a Speech Signal Processing Tool (SPTK) to perform feature extraction on the first audio and the second audio, so as to obtain a plurality of first mel cepstral parameters and a plurality of second mel cepstral parameters.

In another alternative implementation, the audio client may perform feature extraction of Mel-Frequency Cepstral Coefficients (MFCCs) on the first audio and the second audio to extract a plurality of first Mel-Frequency Cepstral parameters from the first audio as a plurality of first initial feature vectors and a plurality of second Mel-Frequency Cepstral parameters from the second audio as a plurality of second initial feature vectors.

In yet another alternative implementation, the audio client may perform feature extraction of weighted Mel-Frequency Cepstral Coefficients (WMFCC) on the first audio and the second audio to extract a plurality of first weighted Mel-Frequency Cepstral parameters from the first audio as a plurality of first initial feature vectors and a plurality of second weighted Mel-Frequency Cepstral parameters from the second audio as a plurality of second initial feature vectors.

In yet another optional implementation manner, the audio client may directly obtain frequency domain energy spectrums of the first audio and the second audio, extract a plurality of first energy values from the frequency domain energy spectrum of the first audio as a plurality of first initial feature vectors, and extract a plurality of second energy values from the frequency domain energy spectrum of the second audio as a plurality of second initial feature vectors.

For example, the audio client may perform a fourier transform on the first audio and the second audio to obtain a frequency domain energy spectrum of the first audio and a frequency domain energy spectrum of the second audio.

It should be noted that the dimensions of each of the first initial feature vectors and each of the second initial feature vectors may be equal, and each of the first initial feature vectors and each of the second initial feature vectors may be a multi-dimensional feature vector, that is, each of the initial feature vectors may include a plurality of components.

Step 202, performing alignment processing on the plurality of first initial feature vectors and the plurality of second initial feature vectors to obtain m first feature vectors and m second feature vectors.

Wherein m is an integer greater than 1.

In an optional implementation manner, the audio client may perform alignment processing on the plurality of first initial feature vectors and the plurality of second initial feature vectors by using a Dynamic Time Warping (DTW) algorithm to obtain m first feature vectors and m second feature vectors.

For example, the audio client may first establish a matrix grid of rows and columns based on the plurality of first initial feature vectors and the plurality of second initial feature vectors (the number of rows of the matrix grid may be equal to the number of the plurality of first initial feature vectors, and the number of columns of the matrix grid may be equal to the number of the plurality of second initial feature vectors). Then, calculating a numerical value corresponding to each grid point, taking any point outside the matrix grid as a starting point, taking the numerical value corresponding to any point as 0 (the any point can be considered as a grid point of the matrix grid which is positioned in the 0 th row and the 0 th column of the 1 st row, or the 1 st row and the 1 st column of the 0 th row or the 0 th column of the 0 th row), taking any grid point of the last row (or column) of the matrix grid as an end point, and sequentially connecting a plurality of grid points in the matrix grid to the end point from the starting point according to the sequence of the rows (or columns) to obtain a plurality of paths passing through the plurality of grid points of the matrix grid. Then, the cumulative distance of each path can be calculated, resulting in a plurality of cumulative distances. Finally, a path with the smallest cumulative distance may be determined according to the cumulative distances, and the first initial feature vectors and the second initial feature vectors may be adjusted according to the path with the smallest cumulative distance, so as to achieve alignment of the first initial feature vectors and the initial feature vectors.

If the number of the first initial feature vectors obtained by the audio client is P and the number of the second initial feature vectors is Q, the matrix grid may be a P-row Q-column matrix grid, and a value corresponding to a grid point of a P-th row and a Q-th column in the matrix grid is a distance between the P-th first initial feature vector and the Q-th second initial vector, where the distance may be a cosine distance or a euclidean distance. P is a positive integer not greater than P, and Q is a positive integer not greater than Q.

In another alternative implementation manner, the audio client may perform alignment processing on the plurality of first initial feature vectors and the plurality of second initial feature vectors by using a speech recognition manner to obtain m first feature vectors and m second feature vectors.

For example, the audio client may decode the obtained plurality of first initial feature vectors and the plurality of second initial feature vectors, and during the decoding process, a phoneme probability of each initial feature vector may be obtained by using information such as a pronunciation dictionary, an acoustic model, a language model, and the like. Then, for each initial feature vector, the phoneme corresponding to the initial feature vector can be determined according to the phoneme probability of the initial feature vector. Furthermore, a plurality of initial feature vectors corresponding to the same phoneme can be aligned, so that the alignment of the plurality of first initial feature vectors and the plurality of second initial feature vectors can be realized.

In the process of aligning a plurality of feature vectors corresponding to the same phoneme, a plurality of first initial feature vectors and a plurality of second initial feature vectors corresponding to the phoneme may be obtained. If the number of the first initial feature vectors corresponding to the phonemes is greater than the number of the second initial feature vectors corresponding to the phonemes, difference filling processing may be performed on the second initial feature vectors corresponding to the phonemes, or decimation processing may be performed on the first initial feature vectors corresponding to the phonemes. If the number of the first initial feature vectors corresponding to the phonemes is smaller than the number of the second initial feature vectors corresponding to the phonemes, the second initial feature vectors corresponding to the phonemes may be subjected to decimation processing or difference filling processing.

Since the audio client performs alignment processing on the plurality of first initial feature vectors and the plurality of second initial feature vectors, on one hand, it can be ensured that the number of the aligned first initial feature vectors (i.e., first feature vectors) is the same as that of the aligned second initial feature vectors (i.e., second feature vectors), so as to facilitate subsequent determination of the timbre similarities of the first audio and the second audio. On the other hand, the influence of the difference between the first audio and the second audio on the accuracy of the subsequently determined timbre similarity of the first audio and the second audio can be avoided. Wherein the difference of the first audio and the second audio may include: a difference in start times of the first audio and the second audio.

And step 203, determining k first mean eigenvectors according to the m first eigenvectors.

In an embodiment of the disclosure, each first mean feature vector is determined based on an average of a plurality of the m first feature vectors. The plurality of first eigenvectors may be continuous or discontinuous. Wherein k is a positive integer.

Optionally, the audio client may first divide the m first feature vectors into k different first vector groups. Then, for each first vector group, the audio client may determine an average value of a plurality of first feature vectors included in each first vector group as one first average feature vector, thereby obtaining k first average feature vectors. The k different first vector groups mean that first feature vectors included in any two first vector groups are not identical. I.e. any two first vector sets comprise first eigenvectors that are either completely different or only partially identical.

It should be noted that the number of the first feature vectors included in each of the k first vector groups may be different, that is, the k first mean feature vectors are determined based on the average value of the different number of first feature vectors. Alternatively, the number of the first feature vectors included in each of the k first vector groups may be the same, for example, n first vector groups may be all included, that is, the k first mean feature vectors are determined based on an average value of the same number of first feature vectors. Wherein n may be an integer greater than 1 and less than m, and n satisfies: m is n k.

It should be further noted that the playing time duration corresponding to each first vector group may be 100 milliseconds (ms) to 1 second(s), and correspondingly, the audio client may determine the value of n according to the playing time duration corresponding to each first vector group, that is, it is required to ensure that the sum of the interval playing time duration corresponding to the first n-1 first feature vectors in the n first feature vectors and the playing time duration corresponding to one first feature vector is 100ms to 1 s. Each first feature vector corresponds to a first audio frame, and the first audio frame refers to: the audio client samples a first audio to obtain one first audio frame of a plurality of discrete first audio frames in the process of obtaining the first initial vector. The interval playing duration corresponding to each first feature vector may refer to: and the playing distance between the first sampling points in the plurality of sampling points included by every two adjacent first audio frames in the plurality of first audio frames. The playing duration corresponding to the first feature vector may refer to: the first audio frame includes a playback interval between a first sample point and a last sample point of a plurality of sample points.

In the embodiment of the present disclosure, the audio client has multiple grouping modes for the m first feature vectors, and correspondingly, the k first vector groups obtained by dividing have multiple forms. The disclosed embodiments are illustrated in several alternative implementations as follows:

in a first alternative implementation manner, the audio client may divide every consecutive n first feature vectors in the m first feature vectors into one first vector group, and the first feature vector in each first vector group is adjacent to the last first feature vector in the previous first vector group, so as to obtain k first vector groups. The intersection of any two first vector groups in the k first vector groups is empty, that is, the n first feature vectors included in any two first vector groups are different. And the first eigenvectors included in two adjacent first vector groups are consecutive, that is, the last first eigenvector in the previous first vector group and the first eigenvector in the next first vector group in the two adjacent first vector groups are two adjacent first eigenvectors in the m first eigenvectors.

The audio client can divide the m first feature vectors into k first vector groups in an equal division manner, that is, every continuous n first feature vectors are divided into one first vector group, so that on one hand, the operation complexity of the audio client can be reduced, the operation efficiency of the audio client can be effectively improved, on the other hand, the stability of the tone of the first audio corresponding to each first vector group can be improved, and the accuracy of the subsequently determined tone similarity of the first audio and the second audio is ensured.

In a second alternative implementation manner, the audio client may divide every consecutive n first feature vectors in the m first feature vectors into one first vector group, and a first feature vector in each first vector group may be the same as any first feature vector except the first feature vector in a previous first vector group, so as to obtain k first vector groups. There are two first vector sets of the k first vector sets whose intersection is not empty, i.e. there are two first vector sets comprising the same first feature vector.

For example, assume that the audio client has obtained 8 first feature vectors: a. b, c, d, e, f, g, and h, each 4 consecutive first eigenvectors may be divided into a first vector group to obtain 4 first vector groups, a first vector group of the 4 first vector groups may include four first eigenvectors a, b, c, and d, a second first vector group may include four first eigenvectors b, c, d, and e, a third first vector group may include four first vector groups c, d, e, and f, and a fourth first vector group may include four first eigenvectors d, e, f, and g. And a first feature vector in a latter one of two adjacent first vector groups in the four first vector groups is the same as a second feature vector in a former one of the two first vector groups.

In a third alternative implementation manner, the audio client may divide n first feature vectors arranged at intervals among the m first feature vectors into a first vector group. For example, the audio client may divide odd-numbered first feature vectors of the m first feature vectors into a first vector group, and divide even-numbered first feature vectors into a first vector group, thereby obtaining 2 first vector groups.

For example, assume that the audio client has obtained 8 first feature vectors: a. b, c, d, e, f, g, and h, the 4 first eigenvectors arranged every other interval can be divided into a first vector group to obtain 2 first vectors, wherein one first vector group comprises four first eigenvectors a, c, e, and g, and the other first vector group comprises four first eigenvectors b, d, f, and h.

And 204, determining k second mean value feature vectors corresponding to the k first mean value feature vectors one by one according to the m second feature vectors.

In an embodiment of the disclosure, each second mean feature vector is determined based on an average of a plurality of second feature vectors of the m second feature vectors. The plurality of second eigenvectors may be continuous or discontinuous.

Optionally, the audio client may first divide the m second feature vectors into k different second vector groups. Then, for each second vector group, the audio client may determine an average value of a plurality of second feature vectors included in each second vector group as one second average feature vector, thereby obtaining k second average feature vectors.

Wherein, the k different second vector groups may mean that the second eigenvectors included in any two second vector groups are not completely different. I.e. any two second vector groups comprise second eigenvectors that are completely different or only partially identical.

It should be noted that the number of the second feature vectors included in each of the k second vector groups may be different, that is, the k second mean feature vectors are determined based on the average values of different numbers of second feature arrays. Alternatively, the number of the second eigenvectors included in each of the k second vector groups may be the same, for example, n, that is, the k second mean eigenvectors are determined based on the same number of second eigenvectors.

The manner in which the audio client groups the m second feature vectors may refer to the above manner for grouping the m first feature vectors, and details of the embodiment of the present disclosure are not repeated here.

It should be noted that the audio client needs to group the m first feature vectors and the m second feature vectors in the same grouping manner, that is, it needs to ensure that the number of first feature vectors included in each first vector group is the same as the number of second feature vectors included in a corresponding second vector group, the continuity between the first feature vectors is the same as the continuity between the second feature vectors, and the continuity between adjacent first vector groups is the same as the continuity between adjacent second vector groups, so as to ensure the accuracy of the finally determined first audio and second audio.

And step 205, determining the timbre similarity of the first audio and the second audio based on the k first mean feature vectors and the k second mean feature vectors.

In this embodiment of the present disclosure, the audio client may process the k first mean feature vectors and k second mean feature vectors corresponding to the k first mean feature vectors one to one by using a pearson algorithm to determine the timbre similarity of the first audio and the second audio. Alternatively, the audio client may determine k first mean feature vectors and cosine distances of k second mean feature vectors, thereby determining timbre similarities of the first audio and the second audio.

It should be noted that if k is equal to 1, the audio client may determine the timbre similarity of the first audio and the second audio directly based on the k first mean feature vectors and the k second mean feature vectors. If k is an integer greater than 1, the audio client may determine k timbre distances based on the k first mean feature vectors and the k second mean feature vectors, and then determine timbre similarity between the first audio and the second audio according to the k timbre distances.

In the embodiment of the present disclosure, k is an integer greater than 1, and a pearson algorithm is taken as an example to exemplarily describe the implementation process of step 205. Referring to fig. 4, the implementation process may include:

and step 2051, processing each first mean value feature vector and a corresponding second mean value feature vector by using a pearson algorithm, determining a tone distance, and obtaining k tone distances.

Wherein the pearson algorithm satisfies the following formula:

in the formula, corr_iI is the ith timbre distance in the k timbre distances, and i is a positive integer not greater than k. l is a dimension of each of the first and second mean feature vectors, i.e., the number of components included in each mean feature vector, and l is an integer greater than 1. X_ijIs referred to asThe jth component, Y, in the i first mean feature vectors_ijRefers to the jth component in the ith second mean feature vector, where j is a positive integer no greater than l.

Step 2052 determines the average of the k timbre distances as the timbre similarity of the first audio and the second audio.

Alternatively, the average may be an arithmetic average, a geometric average, or a root mean square value. The embodiments of the present disclosure do not limit this.

It should be noted that, the order of the steps of the method for determining the similarity between timbres provided by the embodiment of the present disclosure may be appropriately adjusted, and the steps may also be increased or decreased according to the situation. For example, step 203 may be performed synchronously with step 204. Any method that can be easily conceived by those skilled in the art within the technical scope of the present disclosure is covered by the protection scope of the present invention, and thus, the detailed description thereof is omitted.

In summary, the embodiments of the present disclosure provide a method for determining timbre similarity, which may determine timbre similarities of a first audio and a second audio based on k first mean feature vectors and k second mean feature vectors. Since each first mean feature vector is determined based on the average value of a plurality of first feature vectors in the m first feature vectors, and each second mean feature vector is determined based on the average value of a plurality of second feature vectors in the m second feature vectors, the accuracy of determination by the tone similarity determination method provided by the embodiment of the disclosure is higher compared with the determination of the tone similarity directly from the first feature vectors and the second feature vectors in the prior art.

The embodiment of the present disclosure provides a determination apparatus for determining timbre similarity, and referring to fig. 6, the apparatus may include:

an obtaining module 301, configured to obtain m first feature vectors of a first audio and m second feature vectors of a second audio.

The first audio and the second audio are different audio, and m is an integer greater than 1.

A first determining module 302, configured to determine k first mean feature vectors according to the m first feature vectors, where each first mean feature vector is determined based on an average of multiple first feature vectors in the m first feature vectors, and k is a positive integer.

A second determining module 303, configured to determine, according to the m second feature vectors, k second mean feature vectors that are in one-to-one correspondence with the k first mean feature vectors, where each second mean feature vector is determined based on an average value of a plurality of second feature vectors in the m second feature vectors.

A third determining module 304, configured to determine the timbre similarity of the first audio and the second audio based on the k first mean feature vectors and the k second mean feature vectors.

Optionally, the first determining module 302 is configured to:

dividing the m first feature vectors into k different first vector groups, wherein each first vector group comprises n continuous first feature vectors, and n is an integer greater than 1 and smaller than m;

for each first vector group, determining the average value of n continuous first feature vectors included in the first vector group as a first average feature vector to obtain k first average feature vectors.

The second determining module 303 is configured to:

dividing the m second eigenvectors into k different second vector groups, each second vector group comprising n consecutive second eigenvectors;

and for each second vector group, determining the average value of n continuous second feature vectors included in the second vector group as one second mean feature vector to obtain k second mean feature vectors.

Optionally, an intersection of any two first vector groups is empty, and the first feature vectors included in two adjacent first vector groups are continuous; the intersection of any two second vector groups is empty, and the second eigenvectors included in the two adjacent second vector groups are continuous.

Optionally, k is an integer greater than 1, and the third determining module 304 is configured to:

processing each first mean value feature vector and a corresponding second mean value feature vector by adopting a Pearson algorithm, determining a tone distance, and obtaining k tone distances;

and determining the average value of the k tone color distances as the tone color similarity of the first audio and the second audio.

Optionally, referring to fig. 6, the obtaining module 301 may include:

an obtaining sub-module 3011, configured to obtain a plurality of first initial feature vectors of the first audio and a plurality of second initial feature vectors of the second audio;

the alignment submodule 3012 is configured to perform alignment processing on the multiple first initial feature vectors and the multiple second initial feature vectors to obtain m first feature vectors and m second feature vectors.

Optionally, the alignment sub-module 3012 is configured to:

and performing alignment processing on the plurality of first initial characteristic vectors and the plurality of second initial characteristic vectors by adopting a dynamic time normalization algorithm.

Optionally, the obtaining sub-module 3011 is configured to:

extracting a plurality of first mel frequency cepstrum parameters from the first audio frequency to serve as a plurality of first initial characteristic vectors;

a plurality of second mel cepstral parameters are extracted from the second audio as a plurality of second initial feature vectors.

In summary, the present disclosure provides a device for determining timbre similarity, which may determine timbre similarities of a first audio and a second audio based on k first mean feature vectors and k second mean feature vectors. Since each first mean feature vector is determined based on the average value of a plurality of first feature vectors in the m first feature vectors, and each second mean feature vector is determined based on the average value of a plurality of second feature vectors in the m second feature vectors, the accuracy of determining the timbre similarity by the timbre similarity determining method provided by the embodiment of the disclosure is higher than that in the prior art, in which the timbre similarity is determined directly from the first feature vectors and the second feature vectors.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the apparatus, the modules and the sub-modules described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

Fig. 7 is a schematic structural diagram of another apparatus for determining timbre similarity according to an embodiment of the present disclosure, and referring to fig. 7, the apparatus 400 may include: a processor 401, a memory 402 and a computer program stored on the memory 402 and operable on the processor 401, wherein the processor 401, when executing the computer program, can implement the method for determining timbre similarity provided by the above method embodiment, for example, the method shown in fig. 2 or fig. 3.

The embodiment of the present disclosure also provides a computer-readable storage medium, in which instructions are stored, and when the computer-readable storage medium runs on a computer, the computer is caused to execute the method for determining timbre similarity provided by the above method embodiment, for example, the method shown in fig. 2 or fig. 3.

The embodiment of the present disclosure further provides a computer program product containing instructions, which when run on a computer, causes the computer to execute the method for determining the timbre similarity provided by the above method embodiment, for example, the method shown in fig. 2 or fig. 3.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only exemplary of the present disclosure and is not intended to limit the present disclosure, and any modification, equivalent replacement, or improvement made within the spirit and principle of the present disclosure should be included in the scope of the present disclosure.

18页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：视频资源的输出方法和服务器

Method and device for determining tone similarity and computer storage medium

相关技术

网友询问留言