Music recognition method, device, electronic equipment and computer readable storage medium

文档序号：193392 发布日期：2021-11-02 浏览：42次中文

阅读说明：本技术 乐曲识别方法、装置、电子设备及计算机可读存储介质 (Music recognition method, device, electronic equipment and computer readable storage medium ) 是由潘颂声曹偲朱一闻刘华平赵翔宇李鹏于 2021-08-03 设计创作，主要内容包括：本公开提供一种乐曲识别方法、乐曲识别装置、电子设备及计算机可读存储介质；涉及人工智能技术领域。该乐曲识别方法应用于包括音频输入装置的终端设备,该方法包括：通过音频输入装置接收用户输入的音频；对音频进行分析并确定音频的面貌信息,面貌信息包括无效音频、低信噪比音频或高信噪比音频；当面貌信息指示音频为低信噪比音频时,根据低信噪比识别策略对音频进行识别,并输出低信噪比识别结果；当面貌信息指示音频为高信噪比音频时,根据高信噪比识别策略对音频进行识别,并输出高信噪比识别结果；基于低信噪比识别结果或高信噪比识别结果确定所识别到的乐曲。本公开可以针对各种场景和音频进行有效的识别,并输出准确率较高的匹配结果。(The present disclosure provides a music recognition method, a music recognition apparatus, an electronic device, and a computer-readable storage medium; relates to the technical field of artificial intelligence. The music piece recognition method is applied to a terminal device comprising an audio input device, and comprises the following steps: receiving audio input by a user through an audio input device; analyzing the audio and determining the face information of the audio, wherein the face information comprises invalid audio, low signal-to-noise ratio audio or high signal-to-noise ratio audio; when the face information indicates that the audio is the low signal to noise ratio audio, identifying the audio according to a low signal to noise ratio identification strategy, and outputting a low signal to noise ratio identification result; when the face information indicates that the audio is the high signal-to-noise ratio audio, identifying the audio according to a high signal-to-noise ratio identification strategy, and outputting a high signal-to-noise ratio identification result; the identified music piece is determined based on the low signal-to-noise ratio identification result or the high signal-to-noise ratio identification result. The method and the device can effectively identify various scenes and audios and output the matching result with high accuracy.)

1. A music recognition method applied to a terminal device including an audio input device, the method comprising:

receiving audio input by a user through an audio input device;

analyzing the audio and determining face information of the audio, wherein the face information comprises invalid audio, low signal-to-noise ratio audio or high signal-to-noise ratio audio;

when the face information indicates that the audio is low signal to noise ratio audio, identifying the audio according to a low signal to noise ratio identification strategy, and outputting a low signal to noise ratio identification result;

when the face information indicates that the audio is high signal-to-noise ratio audio, identifying the audio according to a high signal-to-noise ratio identification strategy, and outputting a high signal-to-noise ratio identification result;

determining the identified music piece based on the low signal-to-noise ratio identification result or the high signal-to-noise ratio identification result.

2. A music recognition method according to claim 1, wherein the analyzing the audio and determining face information of the audio comprises:

performing framing processing on the audio to obtain a plurality of audio frames;

respectively calculating the probability of each audio frame belonging to an effective audio frame, and determining the corresponding audio frame as an effective audio frame when the probability of belonging to the effective audio frame is greater than or equal to a preset first threshold value; otherwise, determining the corresponding audio frame as an invalid audio frame;

counting the number of effective audio frames, and calculating the ratio of the number of the effective audio frames to the number of the plurality of audio frames;

and when the ratio is smaller than a preset second threshold value, determining that the face information is invalid audio and stopping identifying the audio.

3. A music recognition method according to claim 2, wherein the method further comprises:

calculating a mean of the probabilities of belonging to valid audio frames based on the number of valid audio frames when the ratio is greater than or equal to the second threshold;

when the average value is greater than or equal to a preset third threshold value, determining that the face information is high signal-to-noise ratio audio; otherwise, determining the face information as low signal-to-noise ratio audio.

4. A music recognition method according to claim 1, wherein the low snr identification policy includes a plurality of identification sub-policies, and the identifying the audio according to the low snr identification policy includes:

sequentially calling the plurality of identifier strategies according to a preset sequence to identify the audio, and sequentially determining at least one identifier result identified by each identifier strategy and low signal-to-noise ratio similarity between the at least one identifier result and corresponding music in the music library based on a preset music library;

and when the low signal-to-noise ratio similarity corresponding to one identifier strategy in the plurality of identifier strategies is larger than or equal to a preset fourth threshold, determining the music corresponding to the identifier strategy as the identified music and stopping calling the identifier strategy to identify the audio.

5. A music recognition method according to claim 1, wherein the high snr identification strategy includes a plurality of identification sub-strategies, and the identifying the audio according to the high snr identification strategy includes:

carrying out scene classification processing on the audio, and determining to call one or more of the multiple recognition sub-strategies to recognize the audio according to the scene classification result;

determining one or more recognizer results recognized through one or more recognizer strategies and one or more high signal-to-noise ratio similarities between the one or more recognizer results and corresponding music pieces in a music library based on a preset music library;

and respectively comparing each high signal-to-noise ratio similarity with a preset fifth threshold, and determining the music corresponding to the high signal-to-noise ratio similarity larger than or equal to the fifth threshold as the identified music.

6. The music piece recognition method according to claim 5, wherein each recognition sub-policy corresponds to a category of audio and a sub-policy threshold, and the scene classification processing on the audio and determining to invoke one or more recognition sub-policies among the recognition sub-policies to recognize the audio according to the result of the scene classification comprises:

determining classification probabilities that the audios respectively belong to the audio categories based on a classification model;

comparing each classification probability with the sub-strategy threshold corresponding to each audio class;

when the audio frequency is larger than or equal to the sub-strategy threshold value, determining that the audio frequency belongs to the corresponding audio frequency category, and determining to call the corresponding identification sub-strategy to identify the audio frequency;

and when the number of the audio signals is smaller than the sub-strategy threshold value, stopping calling the corresponding recognition sub-strategy to recognize the audio.

7. A music recognition method according to claim 4 or 5, wherein the method further comprises:

ordering the identified music according to the low signal-to-noise ratio similarity or the high signal-to-noise ratio similarity corresponding to the identified music; and

the sorted music pieces are subjected to merging and deduplication processing.

8. A music recognition apparatus applied to a terminal device including an audio input apparatus, characterized in that the apparatus comprises:

the receiving module is used for receiving audio input by a user through an audio input device;

the audio analysis module is used for analyzing the audio and determining the face information of the audio, wherein the face information comprises invalid audio, low signal-to-noise ratio audio or high signal-to-noise ratio audio;

the audio recognition module is used for recognizing the audio according to a low signal-to-noise ratio recognition strategy and outputting a low signal-to-noise ratio recognition result when the face information indicates that the audio is the low signal-to-noise ratio audio; when the face information indicates that the audio is the high signal-to-noise ratio audio, identifying the audio according to a high signal-to-noise ratio identification strategy, and outputting a high signal-to-noise ratio identification result;

and the identification decision module is used for determining the identified music piece based on the low signal-to-noise ratio identification result or the high signal-to-noise ratio identification result.

9. An electronic device, comprising:

a memory; and

a processor coupled to the memory, the processor configured to perform the music recognition method of any of claims 1-7 based on instructions stored in the memory.

10. A computer-readable storage medium on which a program is stored, which when executed by a processor implements a music recognition method according to any one of claims 1 to 7.

Technical Field

The present disclosure relates to the field of artificial intelligence technologies, and in particular, to a music recognition method, a music recognition apparatus, an electronic device, and a computer-readable storage medium based on an artificial intelligence technology.

Background

This section is intended to provide a background or context to the embodiments of the disclosure recited in the claims. The description herein is not admitted to be prior art by inclusion in this section.

With the increasing demand for entertainment, more and more users try to search for a mood melody through an audio clip. This way of identifying a corresponding piece of music by a piece of audio may be referred to as "listening to the music". Existing main ways for listening to sound and recognizing music may include: audio fingerprinting, singing recognition, and humming recognition. However, when one of the above methods is applied alone, the audio with a low signal-to-noise ratio may not be effectively identified, or for example, a live performance or a greatly adapted song may not be effectively identified, so that the identification accuracy is reduced, an incorrect identification result is obtained by matching, and the like, which seriously affects the user experience.

Therefore, there is a need for an improved music recognition method and apparatus that can perform efficient recognition at least for various scenes and audios and output a matching result with a high accuracy.

Disclosure of Invention

In view of the above, there is a need for a music piece recognition scheme that can solve, at least to some extent, the problem of a single auditory sound recognition technique that is caused by low snr audio or widely adapted songs, and that has poor recognition accuracy and matches to wrong music pieces.

In this context, embodiments of the present disclosure desirably provide a music recognition method, a music recognition apparatus, an electronic device, and a computer-readable storage medium.

According to a first aspect of the present disclosure, there is provided a music piece recognition method applied to a terminal device including an audio input device, the method comprising: receiving audio input by a user through an audio input device; analyzing the audio and determining face information of the audio, wherein the face information comprises invalid audio, low signal-to-noise ratio audio or high signal-to-noise ratio audio; when the face information indicates that the audio is low signal to noise ratio audio, identifying the audio according to a low signal to noise ratio identification strategy, and outputting a low signal to noise ratio identification result; when the face information indicates that the audio is high signal-to-noise ratio audio, identifying the audio according to a high signal-to-noise ratio identification strategy, and outputting a high signal-to-noise ratio identification result; determining the identified music piece based on the low signal-to-noise ratio identification result or the high signal-to-noise ratio identification result.

Optionally, the analyzing the audio and determining the face information of the audio includes: performing framing processing on the audio to obtain a plurality of audio frames; respectively calculating the probability of each audio frame belonging to an effective audio frame, and determining the corresponding audio frame as an effective audio frame when the probability of belonging to the effective audio frame is greater than or equal to a preset first threshold value; otherwise, determining the corresponding audio frame as an invalid audio frame; counting the number of effective audio frames, and calculating the ratio of the number of the effective audio frames to the number of the plurality of audio frames; and when the ratio is smaller than a preset second threshold value, determining that the face information is invalid audio and stopping identifying the audio.

Optionally, the method further comprises: calculating a mean of the probabilities of belonging to valid audio frames based on the number of valid audio frames when the ratio is greater than or equal to the second threshold; when the average value is greater than or equal to a preset third threshold value, determining that the face information is high signal-to-noise ratio audio; otherwise, determining the face information as low signal-to-noise ratio audio.

Optionally, the identifying the audio according to the low snr identification policy includes: sequentially calling the plurality of identifier strategies according to a preset sequence to identify the audio, and sequentially determining at least one identifier result identified by each identifier strategy and low signal-to-noise ratio similarity between the at least one identifier result and corresponding music in the music library based on a preset music library; and when the low signal-to-noise ratio similarity corresponding to one identifier strategy in the plurality of identifier strategies is larger than or equal to a preset fourth threshold, determining the music corresponding to the identifier strategy as the identified music and stopping calling the identifier strategy to identify the audio.

Optionally, the identifying the audio according to the high snr identifying policy includes: carrying out scene classification processing on the audio, and determining to call one or more of the multiple recognition sub-strategies to recognize the audio according to the scene classification result; determining one or more recognizer results recognized through one or more recognizer strategies and one or more high signal-to-noise ratio similarities between the one or more recognizer results and corresponding music pieces in a music library based on a preset music library; and respectively comparing each high signal-to-noise ratio similarity with a preset fifth threshold, and determining the music corresponding to the high signal-to-noise ratio similarity larger than or equal to the fifth threshold as the identified music.

Optionally, the step of respectively corresponding each identifier policy to an audio category and a sub-policy threshold, where the step of performing scene classification processing on the audio and determining to invoke one or more of the multiple identifier policies to identify the audio according to a result of the scene classification includes: determining classification probabilities that the audios respectively belong to the audio categories based on a classification model; comparing each classification probability with the sub-strategy threshold corresponding to each audio class; when the audio frequency is larger than or equal to the sub-strategy threshold value, determining that the audio frequency belongs to the corresponding audio frequency category, and determining to call the corresponding identification sub-strategy to identify the audio frequency; and when the number of the audio signals is smaller than the sub-strategy threshold value, stopping calling the corresponding recognition sub-strategy to recognize the audio.

Optionally, the method further comprises: ordering the identified music according to the low signal-to-noise ratio similarity or the high signal-to-noise ratio similarity corresponding to the identified music; and merging and de-repeating the sorted music pieces.

According to a second aspect of the present disclosure, there is provided a music recognition apparatus applied to a terminal device including an audio input apparatus, characterized in that the apparatus includes: the receiving module is used for receiving audio input by a user through an audio input device; the audio analysis module is used for analyzing the audio and determining the face information of the audio, wherein the face information comprises invalid audio, low signal-to-noise ratio audio or high signal-to-noise ratio audio; the audio recognition module is used for recognizing the audio according to a low signal-to-noise ratio recognition strategy and outputting a low signal-to-noise ratio recognition result when the face information indicates that the audio is the low signal-to-noise ratio audio; when the face information indicates that the audio is the high signal-to-noise ratio audio, identifying the audio according to a high signal-to-noise ratio identification strategy, and outputting a high signal-to-noise ratio identification result; and the identification decision module is used for determining the identified music piece based on the low signal-to-noise ratio identification result or the high signal-to-noise ratio identification result.

Optionally, the audio analysis module is configured to: performing framing processing on the audio to obtain a plurality of audio frames; respectively calculating the probability of each audio frame belonging to an effective audio frame, and determining the corresponding audio frame as an effective audio frame when the probability of belonging to the effective audio frame is greater than or equal to a preset first threshold value; otherwise, determining the corresponding audio frame as an invalid audio frame; counting the number of effective audio frames, and calculating the ratio of the number of the effective audio frames to the number of the plurality of audio frames; and when the ratio is smaller than a preset second threshold value, determining that the face information is invalid audio and stopping identifying the audio.

Optionally, the audio analysis module is further configured to: calculating a mean of the probabilities of belonging to valid audio frames based on the number of valid audio frames when the ratio is greater than or equal to the second threshold; when the average value is greater than or equal to a preset third threshold value, determining that the face information is high signal-to-noise ratio audio; otherwise, determining the face information as low signal-to-noise ratio audio.

Optionally, the low snr identification policy includes a plurality of identifier policies, and the audio identification module is configured to: sequentially calling the plurality of identifier strategies according to a preset sequence to identify the audio, and sequentially determining at least one identifier result identified by each identifier strategy and low signal-to-noise ratio similarity between the at least one identifier result and corresponding music in the music library based on a preset music library; and when the low signal-to-noise ratio similarity corresponding to one identifier strategy in the plurality of identifier strategies is larger than or equal to a preset fourth threshold, determining the music corresponding to the identifier strategy as the identified music and stopping calling the identifier strategy to identify the audio.

Optionally, the high snr identification policy includes a plurality of identifier policies, and the audio identification module is configured to: carrying out scene classification processing on the audio, and determining to call one or more of the multiple recognition sub-strategies to recognize the audio according to the scene classification result; determining one or more recognizer results recognized through one or more recognizer strategies and one or more high signal-to-noise ratio similarities between the one or more recognizer results and corresponding music pieces in a music library based on a preset music library; and respectively comparing each high signal-to-noise ratio similarity with a preset fifth threshold, and determining the music corresponding to the high signal-to-noise ratio similarity larger than or equal to the fifth threshold as the identified music.

Optionally, each identifier sub-policy corresponds to an audio class and a sub-policy threshold, respectively, and the audio identification module is configured to: determining classification probabilities that the audios respectively belong to the audio categories based on a classification model; comparing each classification probability with the sub-strategy threshold corresponding to each audio class; when the audio frequency is larger than or equal to the sub-strategy threshold value, determining that the audio frequency belongs to the corresponding audio frequency category, and determining to call the corresponding identification sub-strategy to identify the audio frequency; and when the number of the audio signals is smaller than the sub-strategy threshold value, stopping calling the corresponding recognition sub-strategy to recognize the audio.

Optionally, the apparatus further comprises a sorting and deduplication module, the sorting and deduplication module being configured to: ordering the identified music according to the low signal-to-noise ratio similarity or the high signal-to-noise ratio similarity corresponding to the identified music; and merging and de-repeating the sorted music pieces.

According to a third aspect of the present disclosure, there is provided an electronic device comprising: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to perform the method of any one of the above via execution of the executable instructions.

According to a fourth aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of any one of the above.

According to the music recognition method, the music recognition device, the electronic equipment and the computer readable storage medium of the embodiments of the present disclosure, on one hand, various music audio recognition strategies are integrated together and only one query entry is provided to the user, so that it is possible to prevent the user who is not familiar with the sound recognition strategy from selecting an inappropriate recognition strategy, thereby resulting in an unintended music; by the method, the operation difficulty of listening to the voice and recognizing the music and the cognitive cost of the user are reduced, and the customer experience is correspondingly improved. On the other hand, different sub-recognition systems are pertinently started for the audios with different characteristics by analyzing that the appearances of the audios are invalid audios or effective audios comprising low signal-to-noise ratio audios and high signal-to-noise ratio audios and selecting different recognition strategies according to the appearance information of the effective audios, so that the overall operation amount of the music recognition system is effectively reduced. On the other hand, the music identification strategy is subdivided into a low signal-to-noise ratio identification strategy and a high signal-to-noise ratio identification strategy, so that corresponding identification strategy flows can be set distinctively, effective identification can be carried out on different music audios, more accurate identification results are achieved, and the overall identification accuracy of the music identification system is improved.

Drawings

The above and other objects, features and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:

fig. 1 shows a schematic diagram of an application scenario of a music recognition method according to an embodiment of the present disclosure;

FIG. 2 schematically shows audio waveforms corresponding to pure noise, low signal-to-noise ratio music, and musical soundtrack, respectively;

FIG. 3 schematically shows a flow diagram of a music recognition method according to one embodiment of the present disclosure;

FIG. 4 schematically illustrates a flow diagram for performing audio face analysis according to one embodiment of the present disclosure;

FIG. 5 schematically illustrates a flow diagram of a low signal-to-noise ratio identification strategy according to one embodiment of the present disclosure;

FIG. 6 illustrates a flow chart for implementing audio fingerprinting according to an embodiment of the present disclosure;

FIG. 7 schematically illustrates a flow diagram including scene classification of high signal-to-noise ratio audio for a music recognition method according to one embodiment of the present disclosure;

FIG. 8 schematically shows a flow chart for audio scene classification according to an embodiment of the present disclosure;

FIG. 9 schematically illustrates a flow diagram of a high signal-to-noise ratio identification strategy according to one embodiment of the present disclosure;

fig. 10 schematically shows a block diagram of a music recognition device according to one embodiment of the present disclosure;

FIG. 11 illustrates a schematic structural diagram of a computer system suitable for use in implementing an electronic device of an embodiment of the present disclosure.

In the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.

Detailed Description

The principles and spirit of the present disclosure will be described with reference to a number of exemplary embodiments. It is understood that these embodiments are given solely for the purpose of enabling those skilled in the art to better understand and to practice the present disclosure, and are not intended to limit the scope of the present disclosure in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

As will be appreciated by one skilled in the art, embodiments of the present disclosure may be embodied as a system, apparatus, device, method, or computer program product. Accordingly, the present disclosure may be embodied in the form of: entirely hardware, entirely software (including firmware, resident software, micro-code, etc.), or a combination of hardware and software.

According to an embodiment of the present disclosure, a music recognition method, a music recognition apparatus, an electronic device, and a computer-readable storage medium are provided.

In this document, any number of elements in the drawings is by way of example and not by way of limitation, and any nomenclature is used solely for differentiation and not by way of limitation.

The principles and spirit of the present disclosure are explained in detail below with reference to several representative embodiments of the present disclosure.

Summary of The Invention

In the related art related to music recognition, a user usually performs listening recognition by three ways, i.e., audio fingerprint recognition, flip recognition, and humming recognition. However, these three recognition methods are often provided with respective recognition entries, which may confuse a user unfamiliar with the listening to a song recognition strategy, and may not know from which recognition entry a song more suitable for the current recognition is to be entered. On the other hand, the above three recognition methods are usually used alone to recognize music, however, a single recognition method tends to have high limitation. As shown in fig. 2, by comparing audio waveform diagrams 201, 202 and 203 corresponding to pure noise, music with low signal-to-noise ratio and music original with high signal-to-noise ratio, if the audio is classified according to the dimension of the signal-to-noise ratio, there is a significant difference between different types of audio, which results in that a single music recognition method is not suitable for recognizing various types of audio.

For example, for audio fingerprinting, it is generally not possible to identify effectively for audio with low signal-to-noise ratio; in addition, for the audio with higher definition, if the music corresponding to the audio is not in the identification music library, but is successfully matched with one other music in the identification music library when the audio fingerprint is matched, the music is reported to the user as an identification result, so that the false identification is caused, and the user experience is influenced. And, for example, the humming recognition method cannot effectively recognize the audio with low signal-to-noise ratio.

The inventor finds that the problem can be better solved by analyzing the input audio to determine whether the audio belongs to the audio with high signal-to-noise ratio or the audio with low signal-to-noise ratio and differentially adopting different identification strategies aiming at the audio with high or low signal-to-noise ratio, so that the cognitive cost of a user on the identification strategies is reduced and the accuracy of music identification is improved.

Based on the above, the basic idea of the present disclosure is: analyzing whether the input audio belongs to a high signal-to-noise ratio audio or a low signal-to-noise ratio audio on the basis of determining that the input audio does not belong to an invalid audio of pure noise; meanwhile, the existing multiple music identification strategies are combined for use and corresponding identification strategy flows are set, so that different identification strategies are adopted and different identification thresholds are set for high signal-to-noise ratio audio or low signal-to-noise ratio audio respectively, and more accurate music identification is realized.

The technical scheme can integrate various music audio recognition strategies and provide only one query entrance for the user, so that the phenomenon that the user who is not familiar with the sound recognition strategy selects an unsuitable recognition strategy and can not query the wanted music is avoided; by the method, the operation difficulty of listening to the voice and recognizing the music and the cognitive cost of the user are reduced, and the customer experience is correspondingly improved. On the other hand, different sub-recognition systems are pertinently started for the audios with different characteristics by analyzing that the appearances of the audios are invalid audios or effective audios comprising low signal-to-noise ratio audios and high signal-to-noise ratio audios and selecting different recognition strategies according to the appearance information of the effective audios, so that the overall operation amount of the music recognition system is effectively reduced. On the other hand, the music identification strategy is subdivided into a low signal-to-noise ratio identification strategy and a high signal-to-noise ratio identification strategy, so that corresponding identification strategy flows can be set distinctively, effective identification can be carried out on different music audios, more accurate identification results are achieved, and the overall identification accuracy of the music identification system is improved.

Having described the general principles of the present disclosure, various non-limiting embodiments of the present disclosure are described in detail below.

Application scenario overview

It should be noted that the following application scenarios are merely illustrated to facilitate understanding of the spirit and principles of the present disclosure, and embodiments of the present disclosure are not limited in this respect. Rather, embodiments of the present disclosure may be applied to any scenario where applicable.

Fig. 1 shows an application scenario of the music recognition method according to an embodiment of the present disclosure, wherein the system architecture 100 may include one or more of the terminal devices 101, 102, 103, the network 104 and the server 105. The network 104 is used to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may be accessed in various types of connections, such as by wire, wireless communication links, or fiber optic cables. The terminal devices 101, 102, 103 may be various electronic devices including audio input means, for example including but not limited to desktop computers, portable computers, smart phones, tablet computers, and the like including a microphone for capturing audio. It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired. For example, server 105 may be a server cluster comprised of multiple servers, or the like.

For example, in an exemplary embodiment, a user may input audio to be recognized to the terminal device 101, 102, or 103, and the terminal device 101, 102, or 103 may analyze and determine face information of the input audio, recognize the input audio according to a low signal-to-noise ratio recognition policy or a high signal-to-noise ratio recognition policy, and send a corresponding recognition result to the server 105 through the network 104, so that the server 105 performs matching in a recognition music library according to the received recognition result and finally sends the matched music to the terminal device 101, 102, or 103 through the network 104. In addition, for example, the terminal device 101, 102, or 103 may be configured to collect only input audio and upload the collected audio to the server 105 via the network 104, and the server 105 may perform a series of operations such as analyzing face information, recognizing audio, and matching recognized music. It should be understood by those skilled in the art that the foregoing application scenarios are only examples, and the exemplary embodiment is not limited thereto.

By the music recognition method, the operation difficulty of listening to the voice to recognize the music and the cognitive cost of the user can be reduced, and the overall recognition accuracy of the music recognition system is improved.

Exemplary method

A music piece recognition method according to an aspect of an exemplary embodiment of the present disclosure is described with reference to fig. 3.

The present exemplary embodiment provides a music piece recognition method applied to a terminal device including an audio input device. Referring to fig. 3, the music recognition method may include the steps of:

s310, receiving audio input by a user through an audio input device;

s320, analyzing the audio and determining the face information of the audio, wherein the face information comprises invalid audio, low signal-to-noise ratio audio or high signal-to-noise ratio audio;

s330, when the face information indicates that the audio is a low signal to noise ratio audio, identifying the audio according to a low signal to noise ratio identification strategy, and outputting a low signal to noise ratio identification result;

s340, when the face information indicates that the audio is high signal to noise ratio audio, identifying the audio according to a high signal to noise ratio identification strategy, and outputting a high signal to noise ratio identification result;

step S350, the identified music is determined based on the low signal-to-noise ratio identification result or the high signal-to-noise ratio identification result.

In the above-provided music piece recognition method, the audio input by the user may be received through an audio input device such as a microphone, and the face information of the audio may be analyzed by calculating the probability that the audio frame belongs to a valid audio frame based on the VAD algorithm, and the recognition process is ended if it is determined that the face information of the input audio is pure noise, that is, invalid audio; and if the face information of the input audio is determined to belong to the high signal to noise ratio audio or the low signal to noise ratio audio, adopting different identification strategies aiming at the high signal to noise ratio audio or the low signal to noise ratio audio respectively. For example, the high snr audio may be scene-classified to determine which one or more music recognition schemes the high snr audio is applicable to, and the corresponding one or more recognition schemes are invoked to respectively recognize the high snr audio; or multiple music recognition schemes can be combined and the music recognition schemes are called in sequence according to a preset sequence to recognize the low signal-to-noise ratio audio, so that targeted optimization recognition strategies can be realized for input audios with different appearances, and the final recognition result can not be limited to a single music recognition strategy. By the music recognition method, the operation difficulty of listening to the voice to recognize the music and the cognitive cost of the user can be reduced, the overall operation amount of the music recognition system is effectively reduced, and the overall recognition accuracy of the music recognition system is improved.

Next, in another embodiment, the above steps are explained in more detail.

In step S310, audio input by a user is received through an audio input device.

In the present exemplary embodiment, as described above, the user can input audio to be recognized to a terminal device such as a desktop computer, a portable computer, a smartphone, a tablet computer, and the like. In order to capture or receive audio input by a user, the terminal device may include an audio input means, such as a microphone, a sound pickup, or the like, for capturing audio. The input audio may be, for example, a piece of music soundtrack, a piece of song being turned by the user or other performer, or a humming voice of a human voice, etc. In addition, the input audio may be, for example, a piece of rap made by a user or other performers, and the like, which is not particularly limited in the present exemplary embodiment.

In step S320, the audio is analyzed and face information of the audio is determined, where the face information includes invalid audio, low snr audio, or high snr audio.

In the present exemplary embodiment, the inputted audio may be analyzed, for example, according to the flow shown in fig. 4. Therein, at S410, a voice endpoint detection (VAD) algorithm may be used to perform processing calculations on the input audio. For example, an end-to-end neural network model VAD algorithm may be selected, wherein the neural network may be, for example, a conventional Deep Neural Network (DNN) or a Convolutional Neural Network (CNN).

According to an embodiment of the present disclosure, based on the VAD algorithm, the input audio may be framed, for example, the input audio may be framed in units of 20 milliseconds, and since the audio may be regarded as a series of sound waveforms distributed along the time axis, for example, a clip of the input audio in an interval of 0 to 20ms on the time axis may be divided into one audio frame, and then 20 milliseconds may be taken as a fixed time window length, the time window may be shifted backward by 10 milliseconds along the time axis, that is, the time window may be shifted to correspond to an interval of 10 to 30ms on the time axis, and an audio clip in the interval may be divided into another audio frame, and then the time window may be shifted again. By analogy, a plurality of audio frames can be obtained.

Based on the framing strategy, the training audio used for training the neural network model can be framed to obtain a plurality of training audio frames. Spectral features may be extracted for each training audio frame, where the spectral features may include, for example, mel-frequency cepstral coefficients (MFCCs), log-domain mel-frequency spectra, and so on. The extracted spectral features of each training audio frame are of a fixed dimension and are input into a neural network model. And the output of the neural network model is the label corresponding to the training audio frame. Wherein the training audio frame can be divided into two categories of noise and effective audio; if the training audio frame contains music or human voice, it is considered to be valid audio whose label is labeled with 1, otherwise it is considered to be noise whose label is labeled with 0. And then, training the neural network model by using a cross entropy training criterion, wherein the learning rate adopts 0.001 and a gradient descent algorithm, and obtaining the trained neural network model when the network converges to a point that the loss value does not descend any more.

In the above manner, the input audio may be framed to obtain a plurality of input audio frames; and spectral features may be extracted for each input audio frame in the same manner as the training audio frame, and the probability p that each input audio frame belongs to noise is calculated from the spectral features of each input audio frame using the trained neural network model_nAnd probability p of valid audio frame_sSo that p is satisfied_n+p_s1. Traversing all input audio frames in this manner and determining the probability p that each input audio frame belongs to a valid audio frame_sWhether it is greater than or equal to a preset first threshold value, which may be empirically set to 0.2 in general. If p is satisfied_sIf the number of the input audio frames is more than or equal to 0.2, the corresponding input audio frame is judged to belong to the valid audio frameOtherwise, the corresponding input audio frame is judged to belong to noise. In practical applications, the value of the first threshold may also be set to other values between 0 and 1 according to actual requirements, and is not limited to the above-mentioned exemplary values.

In S420, the number of valid audio frames may be counted, and a ratio of the number to the number of all audio frames may be calculated and compared with a preset second threshold. In S430, if the ratio is smaller than the second threshold, it is determined that the face information of the input audio belongs to an invalid audio, that is, the main component of the input audio is noise. At this time, for example, a text prompt "the corresponding music cannot be found" may be fed back to the user, and the recognition process may be terminated. The second threshold value can be flexibly set according to actual requirements, for example, when there is a higher requirement for the input audio, the second threshold value can be set to 0.08 or lower; conversely, the second threshold may be set to 0.09 or higher, for example, which is not particularly limited in this example embodiment.

In the above embodiment, before the input audio is identified, it is first determined whether the audio belongs to an invalid audio, that is, whether the main component of the audio is noise, and if so, the identification process is terminated. Through the flow, the invalid audio is effectively prevented from being identified and processed by wasting the computing power, so that the overall computation of the music identification system is reduced.

According to an embodiment of the present disclosure, if the ratio is greater than or equal to the second threshold, in S440, the probability p of belonging to a valid audio frame is calculated by the following formula further based on the number of valid audio frames_sMean value of

Wherein n is a natural number of 1 or more and represents the number of valid audio frames, andrepresenting the probability p of the i-th valid audio frame of a total of n valid audio frames_s。

Is calculated to obtainThereafter, in S450, the process may be executedWith a preset third threshold value p_vA comparison is made. If it is satisfied withThat is, the number of valid audio frames in the input audio reaches a certain level, the face information of the input audio may be determined as a high signal-to-noise ratio audio in S470, and conversely, the face information of the input audio may be determined as a low signal-to-noise ratio audio in S460. Wherein the third threshold value p_vCan be flexibly set according to actual requirements, for example, if higher accuracy of identification of the music is expected, p_vSmaller values, such as 0.5; and if it is desired to identify a higher recall rate, p_vLarger values, such as 0.8; this example embodiment is not particularly limited thereto.

Through the embodiment, on the premise that the input audio is effective audio, the face information of the input audio can be further refined into high signal to noise ratio audio or low signal to noise ratio audio, so that the corresponding identification strategy flow can be pertinently adopted according to the face information of the input audio subsequently, and the integral identification accuracy of the music identification system is favorably improved.

In step S330, when the face information indicates that the audio is a low snr audio, identifying the audio according to a low snr identification policy, and outputting a low snr identification result.

In this exemplary embodiment, if the face information of the input audio is determined to be a low snr audio, a corresponding low snr identification procedure is started to identify the audio, that is, a preset low snr identification policy is invoked to identify the input audio, and thus a low snr identification result is obtained.

According to an embodiment of the present disclosure, the low snr recognition strategy may include, for example, a plurality of recognition sub-strategies such as an audio fingerprinting strategy, a singing recognition strategy, and a humming recognition strategy. Besides the three identifier strategies listed above, the low snr identification strategy may also include other types of identifier strategies such as a rap identification strategy and a dialect identification strategy according to actual needs. In the embodiments of the present disclosure, three recognition sub-strategies, including an audio fingerprint recognition strategy, a singing recognition strategy, and a humming recognition strategy, are taken as an example for explanation; and in the following description of the present disclosure, for the sake of simplicity and readability, the audio fingerprinting strategy is referred to as sub-strategy 1, the singing recognition strategy is referred to as sub-strategy 2, and the humming recognition strategy is referred to as sub-strategy 3.

When the low signal-to-noise ratio recognition strategy is called to recognize the input audio, the three recognition sub-strategies can be called in sequence according to the sequence of 'audio fingerprint recognition-singing recognition-humming recognition' to recognize. Under the normal condition, the recognition accuracy is reduced in sequence from the audio fingerprint recognition to the singing recognition and then to the humming recognition, so that the recognition sequence of the audio fingerprint recognition, the singing recognition and the humming recognition can be set to ensure that the recognition result output by the low signal-to-noise ratio recognition strategy is as accurate as possible. For example, as shown in fig. 5, at S510, sub-policy 1 may be invoked to recognize the input audio, and at least one recognized sub-result may be determined based on a preset music gallery, and a low signal-to-noise ratio similarity between the at least one recognized sub-result and a corresponding music piece in the music gallery may be determined. For example, an audio fingerprint in the input audio may be extracted and audio fingerprint matched with a candidate music piece in a preset music library, if the matching rate of the input audio with a section of the candidate music pieces a and B respectively exceeds a preset threshold, two recognizer results corresponding to the candidate music pieces a and B respectively may be determined, and the low signal-to-noise ratio similarity of the two recognizer results with the candidate music pieces a and B in the music library respectively, for example, 0.53 and 0.56 respectively, may be determined according to the matching rate.

At S520, the low snr similarity may be compared to a preset fourth threshold. The fourth threshold may be a value greater than 0 and less than 1, and may be set according to actual requirements; for example, when the requirement for recognition accuracy is high, the fourth threshold value may be set to a relatively large value, and when several recognition candidates need to be reported as many as possible for the user to refer to, the fourth threshold value may be set to a relatively small value. In the example given above, if the fourth threshold is set to 0.6, the two identified recognizer results have insufficient low signal-to-noise ratio similarity with the music candidates a and B, and the music candidates a and B cannot be taken as the final identification result; whereas if the fourth threshold is set to 0.4, then there is a sufficiently high low signal-to-noise similarity between the two identified recognizer results and the candidate tracks a and B, and the candidate tracks a and B can then be taken as the final identified track result. Therefore, if the low snr similarity is greater than or equal to the fourth threshold, then go to S580, the music corresponding to the identifier result identified by sub-policy 1 may be regarded as the finally identified music, that is, the low snr identification result, and the calling of the subsequent identifier policy to identify the input audio is stopped.

If the low snr similarity is smaller than the fourth threshold, in S530, the sub-strategy 2 may be invoked to identify the input audio, and obtain at least one corresponding identifier result and the low snr similarity with the candidate music piece based on the preset music library.

In S540, the low snr similarity may be continuously compared with the fourth threshold, and if the low snr similarity is greater than or equal to the fourth threshold, go to S580, the music corresponding to the identifier result identified by sub-policy 2 may be used as the finally identified music, that is, the low snr identification result, and the invoking of the subsequent identifier policy to identify the input audio is stopped.

If the low snr similarity is smaller than the fourth threshold, in S550, the sub-policy 3 may be invoked to identify the input audio, and obtain at least one corresponding identifier result and the low snr similarity with the candidate music piece based on the preset music library.

In S560, the low snr similarity is continuously compared with the fourth threshold, and if the low snr similarity is greater than or equal to the fourth threshold, then the process goes to S580, the music corresponding to the identifier result identified by sub-policy 3 may be regarded as the finally identified music, i.e., the low snr identification result, and the process of music identification is terminated.

If the low signal-to-noise ratio similarity is smaller than the fourth threshold, go to S570, call sub-policy 1, sub-policy 2, and sub-policy 3, all fail to identify, and fail to identify a result that meets the requirement, for example, a text prompt such as "cannot find the corresponding music" may be returned to feed back the identification failure to the user.

Through the embodiment, when the face information of the audio is determined to be the low signal to noise ratio audio, the identification strategy process special for the low signal to noise ratio audio is set, namely the sub-strategy 1, the sub-strategy 2 and the sub-strategy 3 are called in sequence to identify, so that the low signal to noise ratio audio which is difficult to identify usually can be effectively identified, an identification result which is as accurate as possible is output, and the overall identification accuracy of the music identification system is improved.

Among the above three identification sub-strategies, the process of identifying the input audio by using sub-strategy 1 is described as an example. As shown in fig. 6, at S610, an audio fingerprint of input audio may be extracted. Where an audio fingerprint refers to a unique data feature in a piece of audio, the audio fingerprint may be extracted in the form of an identifier by a specific algorithm, which may include, but is not limited to, a shazam algorithm, a Landmark algorithm, and so on.

After the audio fingerprint of the input audio is extracted, an audio fingerprint matching the extracted audio fingerprint may be retrieved in the candidate melody library at S620.

After the matching audio fingerprint is retrieved, in S630, the audio fingerprint sequence in each time frame of the input audio and the audio fingerprint sequence of the candidate melody may be matched along the time axis, and the number of times the input audio and the audio fingerprint sequence of the candidate melody are matched at each time point is counted.

At S640, candidate songs may be found and formed into a list according to the number of times of audio fingerprint matching, and the list may further include the number of times of matching corresponding to the candidate songs.

In S650, it may be determined whether to select a higher matching frequency discrimination threshold or a lower matching frequency discrimination threshold according to a signal-to-noise ratio condition (higher signal-to-noise ratio or lower signal-to-noise ratio) of the input audio, and determine whether the matching frequency corresponding to each candidate song is higher or lower than the selected discrimination threshold by discriminating each candidate song according to the selected discrimination threshold.

In S660, each candidate song whose matching frequency is higher than the discrimination threshold may be selected as a final recognition result and the recognized song list may be output.

Similarly, the remaining two recognition sub-strategies 2 and 3 may be implemented and recognize input audio according to respective implementations, which are not specifically described in this example embodiment.

In step S340, when the face information indicates that the audio is a high snr audio, the audio is identified according to a high snr identification policy, and a high snr identification result is output.

In this exemplary embodiment, if the face information of the input audio is determined to be a high snr audio, a corresponding high snr identification procedure is started to identify the audio, that is, a preset high snr identification policy is invoked to identify the input audio, and thus a high snr identification result is obtained.

According to an embodiment of the present disclosure, the high snr recognition strategy may also include multiple recognition sub-strategies such as an audio fingerprinting strategy, a singing recognition strategy, and a humming recognition strategy, for example, and may also include other types of recognition sub-strategies as described above. The present example is still illustrated by taking three recognition sub-strategies including an audio fingerprinting strategy (sub-strategy 1), a singing recognition strategy (sub-strategy 2) and a humming recognition strategy (sub-strategy 3) as an example, but the recognition sub-strategies involved in practical applications may not be limited to the three listed recognition sub-strategies.

When the high snr identification strategy is applied to identify the input audio, first, a scene classification process is performed on the input audio, that is, which one or more of an audio fingerprint category, a singing category, and a humming category the input audio belongs to is determined. As shown in fig. 7, for the audio input by the user, at S710, the audio face information of the input audio may be firstly analyzed in the manner described above, and when it is determined that the face information of the input audio is the low snr audio, then it may go to S730 to invoke the low snr identification policy as described above to make an identification decision on the input audio; when it is determined that the face information of the input audio is the audio with high snr, the process may accordingly go to S720 to perform scene classification processing on the input audio, so as to determine whether the input audio belongs to one or more of the audio fingerprint category, the singing category, or the humming category. After the scene classification of the input audio is determined, the process may go to S730, and a corresponding high snr identification strategy is used to perform an identification decision on the input audio, and finally an identification result is output.

The accuracy of classification recognition is generally higher because the audio features of high signal-to-noise ratio audio are better. The process of scene classification may be implemented as a machine learning classification process, which may take as input audio signals or spectral features, and as output probabilities that the input audio belongs to various categories. Classifiers can be constructed, for example, based on a Recurrent Neural Network (RNN) and supervised training using labeled data. A scene classification process of the input audio will now be described with reference to fig. 8.

At S810, training audio for training the RNN-based classification model may be input.

At S820, spectral feature extraction may be performed on the training audio, and the extracted spectral features may be labeled with labels (x1, x2, x 3). Here, the labels x1, x2, x3 may correspond to the audio fingerprint category, the singing category, and the humming category, respectively, and the values of the labels x1, x2, x3 are 0 or 1, and each audio belongs to only one category at the same time. That is, for a training audio, the label labeled to the training audio may be in the form of one of (1,0,0), (0,1,0) or (0,0,1), which respectively represents that the training audio belongs to the audio fingerprint category, the singing category or the humming category.

At S830, the annotated spectral feature data may be used as input data to the RNN, and supervised training is performed based on the RNN, and the output data is a probability that the training audio belongs to an audio fingerprint class, a singing class, or a humming class (y1, y2, y3), wherein a value range of y1, y2, or y3 is greater than 0 and less than 1. After the training is completed, a trained classification model is obtained.

At S840, audio to be classified, i.e., audio whose face information is high snr audio, may be input.

At S850, spectral feature extraction may be performed on the audio to be classified, and the extracted spectral features may be input into the trained classification model.

At S860, by applying the trained classification model, the audio to be classified may be classified in combination with the RNN and eventually a probability (y1, y2, y3) that the audio to be classified belongs to the audio fingerprint class, the singing class or the humming class is output. For example, for an audio to be classified, after the classification calculation of the classification model, the output probability may be (0.7,0.3,0.2), the probability y1 representing that the audio to be classified belongs to the audio fingerprint class is 0.7, the probability y2 belongs to the singing class is 0.3, and the probability y3 belongs to the humming class is 0.2.

The audio scene classification process is described above with RNN as an exemplary embodiment, since the audio itself has a time distribution characteristic, and RNN is suitable for processing sequential inputs that have an association relationship with each other at different time instants. It should be noted that, besides RNN, a classification model may be constructed based on a Deep Neural Network (DNN) and scene classification may be implemented, which is not particularly limited in this example embodiment.

As shown in fig. 9, after the input audio is scene-classified in the above manner at S910 and the probabilities of the input audio belonging to the respective categories are obtained, at S920, the respective probabilities may be compared with the preset sub-policy thresholds of the respective audio categories, respectively. Wherein each audio class may correspond to an identification sub-strategy and a sub-strategy threshold, respectively, e.g. the audio fingerprint class may correspond to sub-strategy 1 and audio fingerprint threshold, the singing class may correspond to sub-strategy 2 and singing recognition threshold, and the humming class may correspond to sub-strategy 3 and humming recognition threshold. The sub-strategy thresholds may be set independently of each other according to actual requirements, for example, the audio fingerprint threshold, the singing recognition threshold and the humming recognition threshold may be set to 0.6, 0.3 and 0.3 respectively according to the recognition accuracy of each recognition sub-strategy. In this case, for example, the probabilities output according to the above example classification model are (0.7,0.3,0.2), then comparing each probability to the sub-policy threshold for each audio class may yield: the probability y1 that the input audio belongs to the audio fingerprint class is greater than the audio fingerprint threshold, the probability y2 that it belongs to the singing class is equal to the singing recognition threshold, and the probability y3 that it belongs to the humming class is less than the humming recognition threshold. Thus, it may be determined that the input audio belongs to the audio fingerprint category and also belongs to the singing category, but not to the humming category.

In this case, then at S930 and S940, sub-policy 1 and sub-policy 2 may be invoked accordingly to identify the input audio, while sub-policy 3 ceases to be invoked for identification. It should be noted that if in another example, the probability y3 of belonging to the humming category is 0.4 and is greater than the humming recognition threshold 0.3, then the sub-strategy 3 may be invoked to recognize the input audio at S950 accordingly.

In the identification process, one or more identifier results identified by the invoking sub-policy 1 and the sub-policy 2 can be determined based on a preset music library, and one or more high signal-to-noise ratio similarities between the one or more identifier results and corresponding music pieces in the music library can be determined. For example, two recognizer results corresponding to the candidate music pieces C and D may be determined, respectively, for example, using audio fingerprinting and singing over recognition, and it may be determined that the two recognizer results have high signal-to-noise ratio similarities of 0.78 and 0.57 with the candidate music pieces C and D in the music library, respectively.

At S960, the high snr similarity may be compared with a preset fifth threshold. The fifth threshold may be a value greater than 0 and less than 1, and may be set according to actual requirements. In the above example, for example, the fifth threshold may be set to 0.6, and then 0.78>0.6 may be obtained, that is, the recognizer result obtained by using the sub-policy 1 has a sufficiently high signal-to-noise ratio similarity with the candidate music piece C; whereas 0.57<0.6, i.e. the high snr similarity between the recognizer result obtained with sub-strategy 2 and the candidate music piece D is insufficient, the candidate music piece C can be taken as the final recognized music piece result, i.e. the high snr recognition result.

Through the embodiment, when the face information of the audio is determined to be the audio with high signal to noise ratio, the identification strategy flow special for the audio with high signal to noise ratio is set, namely, one or more of the sub-strategy 1, the sub-strategy 2 or the sub-strategy 3 is determined to be called for identification according to the probability that the input audio belongs to the audio fingerprint category, the singing category or the humming category respectively, so that the identification can be carried out by pertinently selecting the proper identification sub-strategy rather than all the identification sub-strategies according to the actual category of the input audio, the overall operation amount of the music identification system is effectively reduced, meanwhile, the corresponding identification sub-strategies are called for identification when the input audio possibly belongs to more than one category, the accuracy of the identification result is also ensured, and the overall identification accuracy of the music identification system is further improved.

In step S350, the identified music piece is determined based on the low signal-to-noise ratio identification result or the high signal-to-noise ratio identification result.

In the present exemplary embodiment, after determining the low snr identification result or the high snr identification result in the above-mentioned manner, the low snr identification result or the high snr identification result may be used as the finally identified target music piece to report the related information of the music piece to the user.

According to a further embodiment of the present disclosure, as shown in fig. 9, at S970, the corresponding identified music pieces may be ranked, for example, from high to low in similarity according to the determined low snr similarity or high snr similarity, for example, the music pieces E, F, G are identified according to the low snr identification policy, and the corresponding low snr similarities are 0.4, 0.6, 0.5, respectively, and then the identified music pieces may be ranked F, G, E; whereas a music H, I is identified according to a high snr identification strategy, e.g. by sub-strategy 1, with corresponding high snr similarities of 0.6, 0.8, while a music I, J is identified by sub-strategy 3, with corresponding high snr similarities of 0.4, 0.3, the identified music can be ranked as I (0.8), H, I (0.4), J.

At S980, the sorted music pieces may also be subjected to merge and deduplication processing, for example, in the case of the above example, then the music pieces identified according to the high snr identification policy may be finally merged and deduplicated to I, H, J, i.e., the identification result of I (0.4) is removed as the duplicate result. Whereas if there is no duplication between the tracks identified according to either the low snr identification strategy or the high snr identification strategy, the step of merging and de-duplication processing may be skipped.

Finally, at S990, the recognized music piece list may be output to the user based on the recognition result subjected to the sorting and deduplication process, so that the user can select a desired music piece in the music piece list.

Through the embodiment, the recognition result with higher similarity can be preferentially fed back to the user, and repeated results in the recognition result can be removed, so that the reported recognition result is concise and easy to read, the user can intuitively and quickly find the required result, the user friendliness of the recognition system is improved, and the user experience is improved.

Exemplary devices

Having introduced the music recognition method according to the exemplary embodiment of the present disclosure, next, a music recognition apparatus according to an exemplary embodiment of the present disclosure will be described with reference to fig. 10. Wherein, the apparatus embodiment part can inherit the related description in the method embodiment, so that the apparatus embodiment can obtain the support of the related specific description of the method embodiment.

Referring to fig. 10, the music recognition apparatus 1000 may be applied to a terminal device including an audio input apparatus, and the music recognition apparatus 1000 may include a receiving module 1010, an audio analysis module 1020, an audio recognition module 1030, and a recognition decision module 1040, wherein:

the receiving module 1010 may be configured to receive audio input by a user through an audio input device;

audio analysis module 1020 may be configured to analyze the audio and determine facial information of the audio, the facial information including invalid audio, low signal-to-noise ratio audio, or high signal-to-noise ratio audio;

the audio identification module 1030 may be configured to identify the audio according to a low snr identification policy and output a low snr identification result when the face information indicates that the audio is a low snr audio; when the face information indicates that the audio is the high signal-to-noise ratio audio, identifying the audio according to a high signal-to-noise ratio identification strategy, and outputting a high signal-to-noise ratio identification result; and

the recognition decision module 1040 may be configured to determine the recognized musical composition based on the low snr identification result or the high snr identification result.

Since the functional modules of the music recognition device in the embodiment of the present disclosure are the same as those in the embodiment of the present invention, they are not described herein again.

Exemplary device

Next, an electronic device of an exemplary embodiment of the present disclosure will be described. The electronic equipment of the exemplary embodiment of the present disclosure includes the above music piece recognition device.

As will be appreciated by one skilled in the art, aspects of the present disclosure may be embodied as a system, method or program product. Accordingly, various aspects of the present disclosure may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.

In some possible embodiments, an electronic device according to the present disclosure may include at least one processing unit, and at least one memory unit. Wherein the storage unit stores program code which, when executed by the processing unit, causes the processing unit to perform the steps in the music recognition method according to various exemplary embodiments of the present disclosure described in the above-mentioned "methods" section of this specification. For example, the processing unit may perform steps S310 to S350 as described in fig. 3.

An electronic device 1100 according to this embodiment of the disclosure is described below with reference to fig. 11. The electronic device 1100 shown in fig. 11 is only an example and should not bring any limitations to the function and scope of use of the embodiments of the present disclosure.

As shown in fig. 11, the computer system 1100 includes a Central Processing Unit (CPU)1101, which can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)1102 or a program loaded from a storage section 1108 into a Random Access Memory (RAM) 1103. In the RAM 1103, various programs and data necessary for system operation are also stored. The CPU1101, ROM 1102, and RAM 1103 are connected to each other by a bus 1104. An input/output (I/O) interface 1105 is also connected to bus 1104.

The following components are connected to the I/O interface 1105: an input portion 1106 including a keyboard, mouse, and the like; an output portion 1107 including a display such as a Cathode Ray Tube (CRT) display, a Liquid Crystal Display (LCD), and a speaker; a storage section 1108 including a hard disk and the like; and a communication section 1109 including a network interface card such as a LAN card, a modem, or the like. The communication section 1109 performs communication processing via a network such as the internet. A driver 1110 is also connected to the I/O interface 1105 as necessary. A removable medium 1111 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 1110 as necessary, so that a computer program read out therefrom is mounted into the storage section 1108 as necessary.

In particular, the processes described below with reference to the flowcharts may be implemented as computer software programs, according to embodiments of the present disclosure. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication portion 1109 and/or installed from the removable medium 1111. The computer program, when executed by a Central Processing Unit (CPU)1101, performs various functions defined in the method and system of the present application.

Exemplary program product

In some possible embodiments, various aspects of the disclosure may also be implemented in the form of a program product including program code for causing a terminal device to perform the steps in the music recognition method according to various exemplary embodiments of the disclosure described in the above-mentioned "method" section of this specification when the program product is run on the terminal device, for example, the terminal device may perform steps S310 to S350 as described in fig. 3.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical disk, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In addition, as technology advances, readable storage media should also be interpreted accordingly.

Program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device over any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., over the internet using an internet service provider).

It should be noted that although in the above detailed description several modules or sub-modules of the music recognition device are mentioned, such a division is merely exemplary and not mandatory. Indeed, the features and functionality of two or more of the modules described above may be embodied in one module, in accordance with embodiments of the present disclosure. Conversely, the features and functions of one module described above may be further divided into embodiments by a plurality of modules.

Further, while the operations of the disclosed methods are depicted in the drawings in a particular order, this does not require or imply that these operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.

While the spirit and principles of the present disclosure have been described with reference to several particular embodiments, it is to be understood that the present disclosure is not limited to the particular embodiments disclosed, nor is the division of aspects, which is for convenience only as the features in such aspects may not be combined to benefit. The disclosure is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

26页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：烟雾报警声识别方法及系统

Music recognition method, device, electronic equipment and computer readable storage medium

相关技术

网友询问留言