Audio library generation method and device, electronic equipment and storage medium

文档序号:1937565 发布日期:2021-12-07 浏览:12次 中文

阅读说明:本技术 一种音频库的生成方法、装置、电子设备和存储介质 (Audio library generation method and device, electronic equipment and storage medium ) 是由 张义飞 康斌 于 2021-04-25 设计创作,主要内容包括:本申请涉及计算机技术领域,尤其涉及人工智能技术领域,提供一种音频库的生成方法、装置、电子设备和存储介质,用以提高音频库的时效性,其中,方法包括:将待识别音频与第一音频库匹配失败后,将待识别音频与第二音频库进行匹配,其中,第二音频库是基于与第一音频库匹配失败的各个第二音频建立的,若匹配成功的第二音频中,存在累计匹配成功次数达到预设门限值的目标第二音频,则将目标第二音频转存至第一音频库。通过将与第一音频库匹配失败的待识别音频存入第二音频库,并在累计匹配次数达到预设门限值时,转存至第一音频库,提高了音频库的时效性,进而实现音频的精准推荐。(The application relates to the technical field of computers, in particular to the technical field of artificial intelligence, and provides a method and a device for generating an audio library, electronic equipment and a storage medium, which are used for improving timeliness of the audio library, wherein the method comprises the following steps: and matching the audio to be identified with a second audio library after the audio to be identified fails to be matched with the first audio library, wherein the second audio library is established based on each second audio failed to be matched with the first audio library, and if the target second audio with the accumulated matching success times reaching a preset threshold value exists in the successfully matched second audio, transferring the target second audio to the first audio library. The audio to be identified which is failed to be matched with the first audio library is stored in the second audio library, and when the accumulated matching times reach a preset threshold value, the audio is stored in the first audio library, so that the timeliness of the audio library is improved, and accurate recommendation of the audio is realized.)

1. A method for generating an audio library, the method comprising:

matching the audio to be identified with a preset first audio library, and if the matching fails, matching the audio to be identified with a preset second audio library to obtain a target matching result; wherein the second audio libraries are established based on respective second audio that failed to match the first audio library;

if the audio to be identified is successfully matched with at least one second audio in the second audio library based on the target matching result, and a target second audio with the accumulated matching success frequency reaching a preset threshold value exists in the at least one second audio, obtaining the target second audio;

and transferring the target second audio serving as a new first audio to the first audio library.

2. The method of claim 1, further comprising:

and if the audio to be recognized is determined to be unsuccessfully matched with the second audio library based on the target matching result, when the audio to be recognized is determined to accord with the preset audio detection condition, the audio to be recognized is taken as a new second audio to be transferred to the second audio library.

3. The method of claim 2, wherein the determining that the audio to be recognized complies with a preset audio detection condition comprises at least one of:

if the audio to be identified comprises an audio clip of a specified type and the duration of the audio clip reaches a preset duration threshold value, determining that the audio to be identified meets the audio detection condition;

if the audio to be identified comprises an audio clip of a specified type and the time length proportion of the audio clip reaches a preset proportion threshold value, determining that the audio to be identified meets the audio detection condition; the duration proportion is used for representing the ratio of the duration of the audio clip to the total duration of the audio to be identified;

and if at least one audio clip contained in the audio to be identified is of a specified type, determining that the audio to be identified meets the audio detection condition.

4. The method as claimed in claim 2, wherein the determining that the audio to be recognized meets a preset audio detection condition comprises:

inputting the audio to be recognized into a trained target audio detection model to obtain a detection prediction value; the target audio detection model is obtained by training an audio detection model to be trained based on an audio labeling data set;

and when the detection predicted value reaches a preset prediction threshold value, determining that the audio to be identified accords with the audio detection condition.

5. The method of any of claims 1-4, prior to matching the audio to be identified with a preset first audio library, further comprising:

acquiring multimedia content to be identified, and extracting the audio to be identified from the multimedia content;

after the second audio is transferred to the first audio library as a new first audio, the method further includes:

and taking the new first audio as an audio recognition result of the multimedia content, and recording a multimedia identifier of the multimedia content corresponding to the new first audio.

6. The method of claim 5, further comprising:

generating a respective first audio group based on every two first audios in the first audio library;

for each obtained first audio group, the following operations are respectively performed:

determining two first audios contained in one of the first audio groups;

acquiring a multimedia identifier set corresponding to each of the two first audios; each multimedia identifier represents one multimedia content successfully matched with the corresponding first audio;

and if the number of the multimedia identifications which repeatedly appear in the two obtained multimedia identification sets reaches a preset number threshold value, generating a new audio identification corresponding to the two first audio associations.

7. The method of claim 5, further comprising, after recording the multimedia identification of the target multimedia content corresponding to the new first audio:

responding to an input operation triggered in a client, and determining a recommended audio set from each first audio based on a multimedia identification set corresponding to each first audio contained in the first audio library; each multimedia identifier represents one multimedia content successfully matched with the corresponding first audio;

presenting the set of recommended audio in the client.

8. The method of claim 7, wherein determining the set of recommended audios from the respective first audios based on their respective corresponding sets of multimedia identifications for the respective first audios included in the first audio library comprises:

acquiring a multimedia identification set corresponding to each first audio contained in the first audio library, and acquiring an evaluation value corresponding to each first audio based on each acquired multimedia identification set; wherein each evaluation value is used for representing the use state of the corresponding first audio;

and sequencing the first audios based on the evaluation values corresponding to the first audios to obtain a target sequence, and sequentially selecting a set number of first audios from the target sequence to serve as a recommended audio set.

9. The method as claimed in claim 8, wherein said obtaining respective evaluation values corresponding to the respective first audios based on the obtained respective sets of multimedia identifications comprises:

acquiring interaction state information of each corresponding multimedia content based on the acquired multimedia identifier set corresponding to one first audio in each first audio;

obtaining the weight corresponding to each multimedia content based on the interaction state information of each multimedia content and the preset mapping relation between the interaction state information and the weight;

and obtaining the evaluation value of the first audio based on the weight corresponding to each multimedia content.

10. The method of any one of claims 1-4, wherein matching the audio to be identified with a preset first audio library comprises:

based on a preset audio fingerprint extraction algorithm, performing audio fingerprint extraction on the audio to be identified to obtain an audio fingerprint to be identified corresponding to the audio to be identified; the audio fingerprint to be identified is used for representing the audio characteristic corresponding to the audio to be identified;

based on the audio fingerprint extraction algorithm, performing audio fingerprint extraction on each first audio contained in the first audio library to obtain a first audio fingerprint corresponding to each first audio; wherein each first audio fingerprint is used for representing the audio characteristics of a corresponding one of the first audios;

respectively calculating the similarity between the audio fingerprint to be identified and each obtained first audio fingerprint;

and if the similarity between the audio fingerprint to be identified and the at least one first audio fingerprint in the first audio fingerprints reaches a preset similarity threshold value, determining that the audio to be identified is successfully matched with the first audio library.

11. An apparatus for generating an audio library, comprising:

the matching unit is used for matching the audio to be recognized with a preset first audio library, and if the matching fails, matching the audio to be recognized with a preset second audio library to obtain a target matching result; wherein the second audio libraries are established based on respective second audio that failed to match the first audio library;

the determining unit is configured to obtain a target second audio if the audio to be identified is successfully matched with at least one second audio in the second audio library based on the target matching result, and the target second audio with the accumulated matching success frequency reaching a preset threshold exists in the at least one second audio;

and the unloading unit is used for unloading the target second audio serving as a new first audio to the first audio library.

12. The apparatus of claim 11, wherein the dump unit is further to:

and if the audio to be recognized is determined to be unsuccessfully matched with the second audio library based on the target matching result, when the audio to be recognized is determined to accord with the preset audio detection condition, the audio to be recognized is taken as a new second audio to be transferred to the second audio library.

13. The apparatus according to claim 12, wherein when determining that the audio to be recognized meets a preset audio detection condition, the determining unit is configured to perform at least one of the following operations:

if the audio to be identified comprises an audio clip of a specified type and the duration of the audio clip reaches a preset duration threshold value, determining that the audio to be identified meets the audio detection condition;

if the audio to be identified comprises an audio clip of a specified type and the time length proportion of the audio clip reaches a preset proportion threshold value, determining that the audio to be identified meets the audio detection condition; the duration proportion is used for representing the ratio of the duration of the audio clip to the total duration of the audio to be identified;

and if at least one audio clip contained in the audio to be identified is of a specified type, determining that the audio to be identified meets the audio detection condition.

14. An electronic device, comprising a processor and a memory, wherein the memory stores a computer program which, when executed by the processor, causes the processor to carry out the steps of the method of any of claims 1 to 10.

15. A computer-readable storage medium, characterized in that it comprises a computer program for causing an electronic device to carry out the steps of the method according to any one of claims 1 to 10, when said computer program is run on said electronic device.

Technical Field

The application relates to the technical field of computers, and provides a method and a device for generating an audio library, electronic equipment and a storage medium.

Background

With the rapid development of User Generated Content (UGC) multimedia Content, audio Content (referred to as audio for short) becomes an important component of multimedia Content, such as Background Music (BGM); the use of audio content plays an important role in multimedia content, such as classification, recommendation, re-creation, and the like of multimedia content.

In the related art, the audio library is constructed in advance, so that the audio content is used, for example, the audio content is identified. However, in this method, audio content that does not exist in the audio library cannot be identified in time, and particularly, in the case that a large amount of multimedia content is uploaded at the same time, it is difficult to identify newly uploaded audio content in time, which results in poor timeliness of the audio library.

Disclosure of Invention

The embodiment of the application provides a method and a device for generating an audio library, electronic equipment and a storage medium, which are used for improving the timeliness of the audio library and ensuring that audio contents which do not exist in the audio library are identified in time.

In a first aspect, an embodiment of the present application provides a method for generating an audio library, including:

matching the audio to be identified with a preset first audio library, and if the matching fails, matching the audio to be identified with a preset second audio library to obtain a target matching result; wherein the second audio libraries are established based on respective second audio that failed to match the first audio library;

if the audio to be identified is successfully matched with at least one second audio in the second audio library based on the target matching result, and a target second audio with the accumulated matching success frequency reaching a preset threshold value exists in the at least one second audio, obtaining the target second audio;

and transferring the target second audio serving as a new first audio to the first audio library.

In a second aspect, an embodiment of the present application provides an apparatus for generating an audio library, including:

the matching unit is used for matching the audio to be recognized with a preset first audio library, and if the matching fails, matching the audio to be recognized with a preset second audio library to obtain a target matching result; wherein the second audio libraries are established based on respective second audio that failed to match the first audio library;

the determining unit is configured to obtain a target second audio if the audio to be identified is successfully matched with at least one second audio in the second audio library based on the target matching result, and the target second audio with the accumulated matching success frequency reaching a preset threshold exists in the at least one second audio;

and the unloading unit is used for unloading the target second audio serving as a new first audio to the first audio library.

Optionally, when it is determined that the audio to be identified meets a preset audio detection condition, the determining unit is configured to:

inputting the audio to be recognized into a trained target audio detection model to obtain a detection prediction value; the target audio detection model is obtained by training an audio detection model to be trained based on an audio labeling data set;

and when the detection predicted value reaches a preset prediction threshold value, determining that the audio to be identified accords with the audio detection condition.

Optionally, before matching the audio to be recognized with the preset first audio library, the matching unit is further configured to:

acquiring multimedia content to be identified, and extracting the audio to be identified from the multimedia content;

then after the second audio is dumped as a new first audio to the first audio library, the dump unit is further configured to:

and taking the new first audio as an audio recognition result of the multimedia content, and recording a multimedia identifier of the multimedia content corresponding to the new first audio.

Optionally, the apparatus further includes an association unit, where the association unit is configured to:

generating a respective first audio group based on every two first audios in the first audio library;

for each obtained first audio group, the following operations are respectively performed:

determining two first audios contained in one of the first audio groups;

acquiring a multimedia identifier set corresponding to each of the two first audios; each multimedia identifier represents one multimedia content successfully matched with the corresponding first audio;

and if the number of the multimedia identifications which repeatedly appear in the two obtained multimedia identification sets reaches a preset number threshold value, generating a new audio identification corresponding to the two first audio associations.

Optionally, the apparatus further includes a recommending unit, where the recommending unit is configured to:

responding to an input operation triggered in a client, and determining a recommended audio set from each first audio based on a multimedia identification set corresponding to each first audio contained in the first audio library; each multimedia identifier represents one multimedia content successfully matched with the corresponding first audio;

presenting the set of recommended audio in the client.

Optionally, when determining a recommended audio set from each first audio based on the multimedia identifier set corresponding to each first audio in the first audio library, the recommending unit is specifically configured to:

acquiring a multimedia identification set corresponding to each first audio contained in the first audio library, and acquiring an evaluation value corresponding to each first audio based on each acquired multimedia identification set; wherein each evaluation value is used for representing the use state of the corresponding first audio;

and sequencing the first audios based on the evaluation values corresponding to the first audios to obtain a target sequence, and sequentially selecting a set number of first audios from the target sequence to serve as a recommended audio set.

Optionally, when obtaining the evaluation value corresponding to each first audio based on each obtained multimedia identifier set, the recommending unit is specifically configured to:

acquiring interaction state information of each corresponding multimedia content based on the acquired multimedia identifier set corresponding to one first audio in each first audio;

obtaining the weight corresponding to each multimedia content based on the interaction state information of each multimedia content and the preset mapping relation between the interaction state information and the weight;

and obtaining the evaluation value of the first audio based on the weight corresponding to each multimedia content.

Optionally, when the audio to be identified is matched with a preset first audio library, the matching unit is specifically configured to:

based on a preset audio fingerprint extraction algorithm, performing audio fingerprint extraction on the audio to be identified to obtain an audio fingerprint to be identified corresponding to the audio to be identified; the audio fingerprint to be identified is used for representing the audio characteristic corresponding to the audio to be identified;

based on the audio fingerprint extraction algorithm, performing audio fingerprint extraction on each first audio contained in the first audio library to obtain a first audio fingerprint corresponding to each first audio; wherein each first audio fingerprint is used for representing the audio characteristics of a corresponding one of the first audios;

respectively calculating the similarity between the audio fingerprint to be identified and each obtained first audio fingerprint;

and if the similarity between the audio fingerprint to be identified and the at least one first audio fingerprint in the first audio fingerprints reaches a preset similarity threshold value, determining that the audio to be identified is successfully matched with the first audio library.

In a third aspect, an embodiment of the present application provides an electronic device, including a processor and a memory, where the memory stores a computer program, and when the computer program is executed by the processor, the processor is caused to execute the steps of the audio library generation method.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, which includes a computer program, and when the computer program runs on an electronic device, the computer program is configured to enable the electronic device to execute the steps of the audio library generation method described above.

In the embodiment of the application, after the audio to be recognized is unsuccessfully matched with the first audio library, the audio to be recognized is matched with the second audio library, wherein the second audio library is established based on each second audio failed to be matched with the first audio library, and if the target second audio with the accumulated matching success times reaching the preset threshold value exists in the successfully matched second audio, the target second audio is transferred to the first audio library.

Therefore, the audio to be recognized can be used as the second audio to be stored in the second audio library after being failed to be matched with the first audio library, and therefore the matching success rate of the audio to be recognized is improved. In addition, the target second audio is transferred to the first audio library, so that timeliness of the first audio library is improved, audio which does not exist in the first audio library is guaranteed to be recognized in time, and matching success rate of the audio to be recognized is further improved.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the application. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

fig. 1A is a schematic diagram of a possible application scenario provided in an embodiment of the present application;

fig. 1B is a schematic diagram of another possible application scenario provided in the embodiment of the present application;

fig. 1C is a block chain diagram provided in an embodiment of the present application;

fig. 1D is a flowchart of a block generation method provided in an embodiment of the present application;

fig. 2 is a schematic flowchart of a method for generating an audio library provided in an embodiment of the present application;

fig. 3A is a schematic flowchart of a matching method of an audio to be recognized and a first audio library provided in an embodiment of the present application;

fig. 3B is a logic diagram illustrating a matching process of the audio to be recognized and the first audio library provided in the embodiment of the present application;

fig. 4 is a logic diagram illustrating a case that matching between the audio to be recognized and the second audio library is successful according to an embodiment of the present application;

FIG. 5 is a logic diagram illustrating a case where matching between an audio to be identified and a second audio library fails according to an embodiment of the present application;

FIG. 6 is a schematic diagram of logic for acquiring audio to be recognized according to an embodiment of the present application;

fig. 7A is a schematic diagram of a multimedia identifier set provided in an embodiment of the present application;

FIG. 7B is a diagram of a first audio library provided in an embodiment of the present application;

fig. 8 is a flowchart illustrating an audio recommendation method based on a first audio library provided in an embodiment of the present application;

fig. 9A is a schematic diagram of an application operation interface provided in an embodiment of the present application;

FIG. 9B is a schematic representation of a presentation of a set of recommended audios provided in an embodiment of the present application;

FIG. 10 is a schematic diagram illustrating a generation of a hot door sound effect library according to an embodiment of the present application;

fig. 11 is a schematic structural diagram of a detecting apparatus provided in an embodiment of the present application;

fig. 12 is a schematic diagram of a hardware component structure of an electronic device provided in an embodiment of the present application;

fig. 13 is a schematic diagram of a hardware composition structure of a terminal device provided in an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments, but not all embodiments, of the technical solutions of the present application. All other embodiments obtained by a person skilled in the art without any inventive step based on the embodiments described in the present application are within the scope of the protection of the present application.

Some concepts related to the embodiments of the present application are described below.

1. Multimedia content. Multimedia content in the embodiments of the present application includes, but is not limited to, content including two or more media, including, but not limited to, text, data, images, animation, audio, and the like.

2. Audio content. The audio content in the embodiment of the present application includes, but is not limited to, background music, special effect sound, and the like, wherein the special effect sound may be laughing, cheering, and the like. For convenience of description, the audio content will be hereinafter simply referred to as audio.

3. Audio fingerprint: the audio corresponding to the audio content is transformed into a time-frequency spectrogram, for example, Fast Fourier Transform (FFT) is used for transformation, and an audio fingerprint representing the identity of the audio content is constructed for the audio content based on the statistical features of the time-frequency peak in the time-frequency spectrogram. When the similarity of the audio fingerprints of the two audio contents exceeds the similarity threshold, the two audio contents can be judged to be the same, and the similarity of the audio fingerprints reflects the similarity of the two audio contents.

The embodiments of the present application relate to Artificial Intelligence (AI) and Machine Learning technologies, and are designed based on a voice technology and Machine Learning (ML) in the AI.

Artificial intelligence is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Machine learning is a multi-field cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

According to the embodiment of the application, when the condition that the audio to be recognized accords with the preset audio detection condition is determined, the machine learning audio detection model is adopted. The method for training the audio detection model provided by the embodiment of the application can be divided into two parts, including a training part and an application part; the training part is used for training an audio detection model by machine learning technology, an audio labeling data set given in the embodiment of the application is used as a training data set to train the audio detection model, after the training data in the training data set is input into the audio detection model, the output result of the audio detection model is obtained, and the model parameters are continuously adjusted by an optimization algorithm in combination with the output result; the application part is used for detecting the audio to be recognized by using the audio detection model obtained by training in the training part, and obtaining the checking and predicting value of the audio to be recognized. In addition, it should be further noted that, in the embodiment of the present application, the audio detection model may be trained online or offline, and is not limited herein.

Cloud technology refers to a hosting technology for unifying serial resources such as hardware, software, network and the like in a wide area network or a local area network to realize calculation, storage, processing and sharing of data.

The cloud technology is based on the general names of network technology, information technology, integration technology, management platform technology, application technology and the like applied in the cloud computing business model, can form a resource pool, is used as required, and is flexible and convenient. Cloud computing technology will become an important support. Background services of the technical network system require a large amount of computing and storage resources, such as video websites, picture-like websites and more web portals. With the high development and application of the internet industry, each article may have its own identification mark and needs to be transmitted to a background system for logic processing, data in different levels are processed separately, and various industrial data need strong system background support and can only be realized through cloud computing.

Big data (Big data) refers to a data set which cannot be captured, managed and processed by a conventional software tool within a certain time range, and is a massive, high-growth-rate and diversified information asset which can have stronger decision-making power, insight discovery power and flow optimization capability only by a new processing mode. With the advent of the cloud era, big data has attracted more and more attention, and the big data needs special technology to effectively process a large amount of data within a tolerance elapsed time. The method is suitable for the technology of big data, and comprises a large-scale parallel processing database, data mining, a distributed file system, a distributed database, a cloud computing platform, the Internet and an extensible storage system.

With the rapid development of UGC multimedia content, audio content becomes an important component of multimedia content, such as BGM, special effect sound, and the like. The use of audio content plays an important role in multimedia content, such as classification, recommendation, re-creation, and the like of multimedia content.

In the related art, the audio library is pre-constructed to use audio content, for example, the audio content is identified, and further, based on the identification result of the audio content, hot audio recommendation, multimedia content classification, and the like are performed. However, in this method, audio content that does not exist in the audio library cannot be identified in time, and particularly, in the case that a large amount of multimedia content is uploaded at the same time, it is difficult to identify newly uploaded audio content in time, which results in poor timeliness of the audio library.

Because the timeliness of the audio frequency library is poor, the problem that audio frequency content which does not exist in the audio frequency library cannot be identified in time is caused.

Therefore, after the audio to be recognized is unsuccessfully matched with the hot audio library, the audio to be recognized can be used as temporary audio and stored in the temporary music library, and therefore the matching success rate of the audio to be recognized is improved. In addition, the temporary audio frequency with the accumulated matching success frequency reaching the set threshold value is stored in the hot audio frequency library, so that the timeliness of the hot audio frequency library is improved, the audio frequency which does not exist in the hot audio frequency library is ensured to be identified in time, and the matching success rate of the audio frequency to be identified is further improved.

The preferred embodiments of the present application will be described in conjunction with the drawings of the specification, it should be understood that the preferred embodiments described herein are for purposes of illustration and explanation only and are not intended to limit the present application, and features of the embodiments and examples of the present application may be combined with each other without conflict.

Fig. 1A is a schematic diagram illustrating a possible application scenario in an embodiment of the present application. In this application scenario, a terminal device 110, a server 120 and a data storage node 130 are included. The terminal device 110, the server 120 and the data storage node 130 communicate with each other via a communication network.

In one possible embodiment, the communication network is a wired network or a wireless network. The terminal device 110, the server 120, and the data storage node 130 may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein.

A user logs in an application operation interface through the terminal device 110, and the terminal device 110 uploads multimedia content to a multimedia service system deployed on the server 120 by responding to an operation triggered by the user on the application operation interface, so that the server 120 generates an audio library based on the multimedia content uploaded by the terminal device 110. For example, the server obtains the hot audio, establishes a hot audio library, and obtains the temporary audio, establishes a temporary audio library based on the multimedia content uploaded by the terminal device 110. Illustratively, after the terminal device 110 responds to the user operation, the recommended audio set returned by the server 120 can be received and presented.

In this embodiment of the application, the application may be social software, such as instant messaging software and short video software, and may also be an applet, a web page, and the like, which is not limited herein. The terminal device 110 needs to have an application installed thereon, where the application may be software, or an application such as a web page or an applet, and the server 120 is a server corresponding to the software, or the web page or the applet.

In the embodiment of the present application, the terminal device 110 is an electronic device used by a user, and the electronic device may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, and the like. Each terminal device 110 is connected to the server 120 through a wireless Network, and the server 120 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), and a big data and artificial intelligence platform. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein.

The data storage node 130 communicates with the server 120 through a communication network, and the data storage node 130 is used for storing hot audio obtained by the server 120 and for storing temporary audio obtained by the server 120.

In one possible embodiment, the data storage node 130 may store data in the form of a database. The hot audio and the temporary audio may be stored in the same database, or may be stored in different databases, which is not limited in this application.

In another possible implementation, referring to FIG. 1B, the data storage node 130 may store data in the form of a data sharing system 140.

The data sharing system 140 refers to a system for performing data sharing between nodes, the data sharing system may include a plurality of nodes 141, and the plurality of nodes 141 may refer to respective clients in the data sharing system. Each node 141 may receive input information during normal operation and maintain shared data within the data sharing system based on the received input information. In order to ensure information intercommunication in the data sharing system, information connection can exist between each node in the data sharing system, and information transmission can be carried out between the nodes through the information connection. For example, when an arbitrary node in the data sharing system receives input information, other nodes in the data sharing system acquire the input information according to a consensus algorithm, and store the input information as data in shared data, so that the data stored on all the nodes in the data sharing system are consistent.

Each node in the data sharing system has a node identifier corresponding thereto, and each node in the data sharing system may store a node identifier of another node in the data sharing system, so that the generated block is broadcast to the other node in the data sharing system according to the node identifier of the other node in the following. Each node may maintain a node identifier list as shown in the following table, and store the node name and the node identifier in the node identifier list correspondingly. The node identifier may be an Internet Protocol (IP) address or any other information that can be used to identify the node, and only the IP address is used as an example in table 1.

TABLE 1 node identification List

Node name Node identification
Node 1 117.114.151.174
Node 2 117.116.189.145
…… ……
Node N 119.123.789.258

Each node in the data sharing system stores one identical blockchain. The block chain is composed of a plurality of blocks, referring to fig. 1C, the block chain is composed of a plurality of blocks, the starting block includes a block header and a block main body, the block header stores an input information characteristic value, a version number, a timestamp and a difficulty value, and the block main body stores input information; the next block of the starting block takes the starting block as a parent block, the next block also comprises a block head and a block main body, the block head stores the input information characteristic value of the current block, the block head characteristic value of the parent block, the version number, the timestamp and the difficulty value, and the like, so that the block data stored in each block in the block chain is associated with the block data stored in the parent block, and the safety of the input information in the block is ensured.

When each block in the block chain is generated, referring to fig. 1D, when a node where the block chain is located receives input information, the input information is verified, after the verification is completed, the input information is stored in the memory pool, and the hash tree for recording the input information is updated; and then, updating the updating time stamp to the time when the input information is received, trying different random numbers, and calculating the characteristic value for multiple times, so that the calculated characteristic value can meet the following formula:

SHA256(SHA256(version+prev_hash+merkle_root+ntime+nbits+x))<TARGET

wherein, SHA256 is a characteristic value algorithm used for calculating a characteristic value; version number (version) is version information of related block protocols in the block chain; prev _ hash is a block head characteristic value of a parent block of the current block; merkle _ root is a characteristic value of the input information; ntime is the update time of the update timestamp; nbits is the current difficulty, is a fixed value within a period of time, and is determined again after exceeding a fixed time period; x is a random number; TARGET is a feature threshold, which can be determined from nbits.

Therefore, when the random number meeting the formula is obtained through calculation, the information can be correspondingly stored, and the block head and the block main body are generated to obtain the current block. And then, the node where the block chain is located respectively sends the newly generated blocks to other nodes in the data sharing system where the newly generated blocks are located according to the node identifications of the other nodes in the data sharing system, the newly generated blocks are verified by the other nodes, and the newly generated blocks are added to the block chain stored in the newly generated blocks after the verification is completed.

Referring to fig. 2, it is a schematic flowchart of a method for generating an audio library provided in this embodiment of the present application, where the method for generating an audio library is applied to a generating device of an audio library, where the generating device may be the server 120 or a device deployed in the server 120, and a specific implementation flow of the method is as follows:

s201, the generating device matches the audio to be recognized with a preset first audio library, and if the matching fails, the audio to be recognized is matched with a preset second audio library to obtain a target matching result; wherein the second audio libraries are created based on respective second audio that failed to match the first audio library.

Specifically, when S201 is executed, the following steps may be adopted, but not limited to:

and S2011, the generating device matches the audio to be identified with a preset first audio library to obtain a first matching result.

In the embodiment of the present application, the first audio library may also be referred to as a hot audio library, and the second audio library may also be referred to as a temporary audio library.

In some embodiments, referring to fig. 3A, in order to improve the matching efficiency, when performing S2011, the following steps may be adopted:

s20111, the generating device extracts the audio fingerprint of the audio to be identified based on a preset audio fingerprint extraction algorithm, and obtains the audio fingerprint to be identified corresponding to the audio to be identified.

The audio fingerprint to be identified is used for representing the audio characteristic corresponding to the audio to be identified. In the embodiment of the present application, the audio features include, but are not limited to, one or more of zero-crossing rate, short-term energy, short-term average amplitude difference, spectrogram, short-term power spectral density, spectral entropy, fundamental frequency, formants, and the like. By zero-crossing rate, for example, it is meant the number of times the sign of a signal changes in each frame, e.g., the signal changes from positive to negative or from negative to positive.

In the embodiment of the present application, the audio fingerprint extraction algorithm may employ, but is not limited to, an echo print (echo) algorithm, a keypoint (landmark) algorithm, a chroma feature (chromaprint) algorithm, and the like.

Taking the audio to be identified as BGM-A as an example, the generating device extracts the audio fingerprint of the BGM-A by adopting an echo algorithm to obtain the audio fingerprint to be identified corresponding to the BGM-A, wherein the zero crossing rate of a first frame of the audio fingerprint to be identified, which represents the BGM-A, is 9, and the zero crossing rate of a second frame of the audio fingerprint to be identified is 10.

S20112, the generating device extracts the audio fingerprints of the first audios contained in the first audio library based on the audio fingerprint extraction algorithm, and obtains the first audio fingerprints corresponding to the first audios.

Wherein each first audio fingerprint is used for characterizing the audio characteristics of a corresponding one of the first audios.

For example, referring to fig. 3B, it is assumed that the first audio library includes BGM1, BGM2, and BGM3, and the generating device performs audio fingerprint extraction on BGM1, BGM2, and BGM3 included in the first audio library to obtain audio fingerprints corresponding to BGM1, BGM2, and BGM3, respectively.

S20113, the generating device respectively calculates the similarity between the audio fingerprint to be identified and each obtained first audio fingerprint.

It should be noted that, in the embodiment of the present application, the similarity between the audio fingerprint to be identified and one of the first audio fingerprints is used to characterize the number of frames of the audio fingerprint to be identified and the first audio fingerprint that contain the same audio fingerprint.

For example, referring to fig. 3B, the generating device calculates the similarity between the audio fingerprint to be recognized and the audio fingerprints of BGM1, BGM2, and BGM3, respectively, where the similarity 1 between the audio fingerprint to be recognized and the audio fingerprint of BGM1 is 50%, the similarity 2 between the audio fingerprint to be recognized and the audio fingerprint of BGM2 is 80%, and the similarity 3 between the audio fingerprint to be recognized and the audio fingerprint of BGM3 is 90%.

And S20114, the generating device judges whether the similarity between each first audio fingerprint and the audio fingerprint to be identified exists or not, and at least one first audio fingerprint reaching a preset similarity threshold value exists, if so, S20115 is executed, and if not, S20116 is executed.

S20115, the generating device obtains a first matching result, and the first matching result represents that the audio to be identified is successfully matched with a preset first audio library.

For example, assuming that the preset similarity threshold value is 85%, at this time, the similarities between the audio fingerprint to be recognized and the audio fingerprints of the BGM1 and the BGM2 do not reach 85%, and the similarity between the audio fingerprint to be recognized and the audio fingerprint of the BGM3 reaches 85%, the generating device determines that a first audio fingerprint, the similarity of which with the audio fingerprint to be recognized reaches the preset similarity threshold value, exists in the audio fingerprints of the BGM1, the BGM2, and the BGM3, and obtains a first matching result, where the first matching result represents that the BGM-a and the first audio library are successfully matched.

S20116, the generating device obtains a first matching result, and the first matching result represents that the audio to be identified fails to be matched with a preset first audio library.

For example, assuming that the preset similarity threshold is 95%, at this time, the similarity between the audio fingerprint to be recognized and the first audio fingerprints of BGM1, BGM2, and BGM3 does not reach 95%, the generating device determines that, in the audio fingerprints of BGM1, BGM2, and BGM3, there is no first audio fingerprint whose similarity with the audio fingerprint to be recognized reaches the preset similarity threshold, and obtains a first matching result, where the first matching result represents that the matching between BGM-a and the first audio library fails.

And S2012, the generating device judges whether the matching is successful or not based on the first matching result, if so, S2011 is executed for the next audio to be identified, and if not, S2013 is executed.

And S2013, the generating device matches the audio to be recognized with a preset second audio library to obtain a target matching result.

In the embodiment of the application, the second audio library is established based on each second audio which fails to be matched with the first audio library.

The matching process of the audio to be recognized and the preset second audio library is the same as the matching process of the audio to be recognized and the preset first audio library, and therefore the details are not repeated herein.

S202, if the generating device determines that the audio to be identified is successfully matched with at least one second audio in the second audio library based on the target matching result, and a target second audio with the accumulated matching success frequency reaching a preset threshold value exists in the at least one second audio, the target second audio is obtained.

For example, assuming that the preset threshold value is 10, the second audio library contains BGM4, the generation means determines that the BGM-a matches the BGM4 in the second audio library successfully based on the target matching result, and the number of times of the BGM4 that matches successfully reaches 10 in total, and then sets BGM4 as the target second audio.

For another example, if the preset threshold value is 10, the second audio library includes BGM4 and BGM5, and the generation device determines that the BGM-a is successfully matched with BGM4 and BGM5 in the second audio library based on the target matching result, and the number of times of the BGM4 that the total number of times of the BGM5 that the BGM-a is successfully matched reaches 10, the BGM4 is taken as the target second audio.

S203, the generating device stores the target second audio serving as the new first audio to the first audio library.

For example, the generating means may dump the BGM4 to the first audio library as a new first audio.

Next, S201 to S203 will be described with a specific example.

Taking the audio to be recognized as BGM-B as an example, referring to fig. 4, if the BGM-B fails to match the first audio library, the generating device matches the BGM-B with the second audio library, and if the BGM-B succeeds in matching with the BGM-X in the second audio library and the number of times of the BGM-X that is successfully matched reaches a preset threshold, the generating device transfers the BGM-X as a new first audio to the first audio library.

In some embodiments, if the generating device determines that the audio to be recognized fails to match the second audio library based on the target matching result, when it is determined that the audio to be recognized meets the preset audio detection condition, the audio to be recognized is saved to the second audio library as a new second audio.

Specifically, the following two ways, but not limited to, may be adopted to determine that the audio to be identified meets the preset audio detection condition:

mode A:

the generation device determines that the audio to be identified meets a preset audio detection condition, and comprises at least one of the following operations:

in operation a1, if the audio to be recognized includes an audio segment of the specified type, and the duration of the audio segment reaches a preset duration threshold, the generating device determines that the audio to be recognized meets the audio detection condition.

In the embodiment of the present application, the audio to be identified may include, but is not limited to, one or more of the following types of audio clips: speech, singing, music, silence, noise, machine sound, environmental sound, and so forth.

For example, it is assumed that the specified type of audio segment is a music or singing type of audio segment, the preset time duration threshold is 3 minutes, and the BGM-a includes a speech type of audio segment 1 and a music type of audio segment 2, where the time duration of the audio segment 2 is 4 minutes, and at this time, the time duration of the audio segment 2 in the BGM-a reaches the preset time duration threshold 3 minutes, and the generating device determines that the BGM-a meets the audio detection condition.

In operation a2, if the audio to be recognized includes an audio segment of the specified type, and the duration ratio of the audio segment reaches a preset ratio threshold, the generating device determines that the audio to be recognized meets the audio detection condition.

The duration proportion is used for representing the ratio of the duration of the audio segments to the total duration of the audio to be identified.

For example, assuming that the specified type of audio segment is an audio segment of a music or singing type, the preset proportion threshold value is 80%, and the BGM-a includes an audio segment 1 of a speech type and an audio segment 2 of a music type, where the duration of the audio segment 1 is 1 minute, the duration of the audio segment 2 is 4 seconds, at this time, the duration proportion of the audio segment 2 in the BGM-a is 80%, and the duration proportion of the audio segment 2 in the BGM-a reaches the preset proportion threshold value of 80%, the generating device determines that the BGM-a meets the audio detection condition.

In operation a3, if at least one audio segment contained in the audio to be recognized is of a specified type, the generating device determines that the audio to be recognized conforms to the audio detection condition.

For example, assuming that the audio segment of the specified genre is an audio segment of music or a singing genre, and the audio segment 1 of the singing genre and the audio segment 2 of the music genre are included in the BGM-a, at this time, both the audio segment 1 and the audio segment 2 included in the BGM-a are of the specified genre, the generating means determines that the BGM-a meets the audio detection condition.

For another example, assuming that the audio segment of the specified genre is an audio segment of a music or singing genre and the audio segment 1 of the singing genre is included in the BGM-a, at this time, the audio segment 1 included in the BGM-a is the specified genre, the generating device determines that the BGM-a meets the audio detection condition.

It should be noted that, in the embodiment of the present application, a preset audio event detection algorithm may be adopted to determine whether the audio to be identified meets a preset audio detection condition. The audio event detection algorithm may employ, but is not limited to, an Artificial Neural Network (ANN), a Hidden Markov Model (HMM), and the like.

In the above embodiment, in the case that the matching between the audio to be recognized and the second audio library fails, the audio to be recognized, which is the audio to be recognized that meets the requirement of the duration or the duration ratio, and the audio to be recognized, which is not superimposed with the audio segments such as the voices, the environmental noise (e.g., wind noise, rain noise, etc.), may be saved as a new second audio in the second audio library.

Mode B: in order to improve the matching efficiency, the audio to be recognized can be determined to accord with the preset audio detection condition by combining with a machine learning technology. Specifically, the generating device determines that the audio to be identified meets the preset audio detection condition, and includes the following steps:

b1, inputting the audio to be identified to the trained target audio detection model by the generating device to obtain a detection prediction value.

The target audio detection model is obtained after the audio detection model to be trained is trained on the basis of the audio labeling data set. The audio annotation data set characterizes a training data set of annotated audio types.

It should be noted that, in the implementation of the present application, the target audio detection model may adopt, but is not limited to, a deep learning model. The detection predicted value may be represented by a level or a value, which is not limited in the present application, and hereinafter, the detection predicted value is described by using a value as an example.

For example, the generation means inputs BGM-a to the trained target audio detection model, and obtains a detection prediction value of BGM-a as 90.

b2, when the generation device determines that the detection prediction value reaches a preset prediction threshold value, determining that the audio to be identified accords with the audio detection condition.

For example, assuming that the preset prediction threshold value is 85 and the detection prediction value of BGM-a is 90, the generation device determines that the detection prediction value of BGM-a reaches the preset prediction threshold value, and determines that BGM-a meets the audio detection condition.

In the following, a specific embodiment will be described for the case where the matching between the audio to be recognized and the second audio library fails.

Taking the audio to be recognized as the BGM-X, referring to fig. 5, the first audio library does not include the first audio, the generating device matches the BGM-X with the first audio library, and if the BGM-X fails to match the first audio library, matches the BGM-X with the second audio library. And when the BGM-X is determined to accord with the preset audio detection condition by the generation device, the BGM-X is added into the second audio library.

In some embodiments, to implement establishing an audio library for audio content contained in multimedia content, before performing S201, the generating device acquires the multimedia content to be identified and extracts the audio to be identified from the multimedia content.

Specifically, the generating device obtains the multimedia content to be identified, extracts the audio content from the multimedia content to be identified, and takes the audio content as the audio to be identified when it is determined that the audio content meets the preset matching condition.

It should be noted that, in the embodiment of the present application, an audio file may be extracted from the multimedia content to be identified by using, but not limited to, an audio/video processing tool FFmpeg. When the generating device determines that the audio file meets the preset matching condition, at least one of the operations a1 and a2 is adopted, which is not described herein again.

For example, referring to fig. 6, the multimedia content to be recognized includes images, text, and audio, the generating device obtains the multimedia content to be recognized, and extracts the audio content from the multimedia content to be recognized, where the audio content includes audio segments of a speech type, an audio segment of a machine sound type, and an audio segment of a music type. The generation device determines that the audio content contains the audio segments of the music types, the duration of the audio segments of the music types reaches a preset duration threshold value, and the audio content is used as the audio to be identified.

Furthermore, after the generating device transfers the second audio as a new first audio to the first audio library, the generating device further includes: the generating device takes the new first audio as the audio recognition result of the multimedia content and records the multimedia identification of the multimedia content corresponding to the new first audio. The multimedia identifier may be represented by, but not limited to, an Identity Document (ID).

Taking the audio to be recognized as BGM-B as an example, assuming that BGM-B is extracted from video 1, the generating device stores BGM-X as a new first audio in the first audio library, and then takes BGM-X as an audio recognition result of video 1.

In the above embodiment, a corresponding audio library may be established for the audio content included in the multimedia content, and when the multimedia content is subsequently acquired, the audio content may be identified, the multimedia content may be classified, the multimedia content may be recommended, and the like based on the audio library.

It should be noted that, in the foregoing embodiment, only one multimedia content is taken as an example, and the generating device may further obtain a multimedia content stream, and obtain each audio to be identified by using the above extraction method for each multimedia content in the multimedia content stream.

In some embodiments, during the use of the audio, the user may partially alter the original audio, for example, accelerate or decelerate a portion of the original audio, insert other segments into the original audio, superimpose some other audio on the original audio, etc., so that there may be a large amount of similar first audio in the first audio. In order to implement the aggregation of similar audio associations, in this embodiment of the application, the generating device generates corresponding first audio groups based on every two first audios in the first audio library, and performs the following operations for each obtained first audio group:

the generating device determines two first audios contained in one of the first audio groups; acquiring a multimedia identifier set corresponding to each of the two first audios; and if the number of the multimedia identifications which repeatedly appear in the two obtained multimedia identification sets reaches a preset number threshold value, generating a new audio identification corresponding to the two first audio associations.

Wherein each multimedia identifier characterizes a multimedia content successfully matched with the corresponding first audio.

Taking a first audio group (first audio group 1) in the first audio library as an example, assuming that the first audio group 1 includes BGM1 and BGM2, and the preset number threshold is 2, referring to fig. 7A, the generating device determines that the first audio group 1 includes BGM1 and BGM2, then the generating device obtains a multimedia identifier set corresponding to BGM1 and obtains a multimedia identifier set corresponding to BGM2, where the multimedia identifier set corresponding to BGM1 includes multimedia identifiers of video 1, video 2, video 3, video 4, video 8, and video 9, and the multimedia identifier set corresponding to BGM2 includes multimedia identifiers of video 3, video 4, video 6, and video 7. In the two obtained multimedia identifier sets, the repeated multimedia identifiers are the multimedia identifiers of video 3 and video 4, and at this time, the number of the repeated multimedia identifiers reaches the preset number threshold, as shown in fig. 7B, the generating device generates a new audio identifier BGM-1-2 in association with the BGM1 and the BGM 2.

In some embodiments, referring to fig. 8, a flowchart of an audio recommendation method based on a generated first audio library provided in an embodiment of the present application is shown, where the method includes the following steps:

s801, the generating device responds to an input operation triggered in the client, and determines a recommended audio set from each first audio based on a multimedia identification set corresponding to each first audio contained in the first audio library.

The input operation triggered in the client includes, but is not limited to, a selection operation for an audio recommendation function, and the like. For example, referring to FIG. 9A, the generation apparatus is responsive to a selection operation triggered in the client for a control 901 ("template library").

Specifically, when S801 is executed, the following steps may be adopted, but not limited to, to determine a recommended audio set from each first audio based on the corresponding multimedia identifier set of each first audio included in the first audio library:

s8011, the generating device obtains a multimedia identifier set corresponding to each first audio included in the first audio library, and obtains an evaluation value corresponding to each first audio based on each obtained multimedia identifier set. Wherein each evaluation value is used for representing the use state of the corresponding first audio.

Specifically, when S8011 is executed, the following steps may be adopted, but are not limited to:

s80111, the generating device obtains interaction state information of each corresponding multimedia content based on the obtained multimedia identifier set corresponding to the first audio X. The first audio X is any one of the first audios.

In the embodiment of the present application, the interaction state information includes, but is not limited to, one or more of the following information: click times, praise times, comment times, forwarding times and the like.

Taking the first audio X as BGM3 as an example, it is assumed that the BGM3 corresponds to the multimedia identifier set 1, and the multimedia identifier set 1 includes multimedia identifiers of video 1, video 2, and video 3. The generation device acquires the praise times of the corresponding video 1, video 2 and video 3 based on the acquired multimedia identifier set 1 corresponding to the BGM3, where the praise time of the video 1 is 0, the praise time of the video 2 is 10, and the praise time of the video 3 is 100.

S80112, the generating device obtains weights corresponding to the multimedia contents based on the interaction state information of the multimedia contents and a mapping relationship between preset interaction state information and the weights.

Still taking the first audio X as the BGM3 as an example, it is assumed that, in the mapping relationship between the preset interaction state information and the weights, the weight of the multimedia content with the number of prawns of 0 is 1, the weight of the multimedia content with the number of prawns of 10 is 2, and the weight of the multimedia content with the number of prawns of 100 is 3. The generation device obtains respective corresponding weights of the video 1, the video 2 and the video 3 based on the praise times of the video 1, the video 2 and the video 3 and the mapping relation, wherein the weights of the video 1, the video 2 and the video 3 are respectively 0, 2 and 3.

S80113, the generating device obtains an evaluation value of the first audio X based on the respective weights corresponding to the respective multimedia contents.

Still taking the first audio X as the BGM3 as an example, the generation apparatus obtains an evaluation value of 0+2+3 to 5 of the BGM3 based on the weights corresponding to each of the video 1, the video 2, and the video 3.

It should be noted that, in the embodiment of the present application, in the preset mapping relationship between the interaction state information and the weights, values of different interaction state information may correspond to the same weights, that is, the weights of the multimedia contents are the same.

Still taking the first audio X as the BGM3 as an example, assuming that the mapping relationship between the preset interaction state information and the weights represents the same weight for each multimedia content, the generating device obtains the BGM3 with an evaluation value of 1+1+1 — 3.

It should be noted that, in the embodiment of the present application, the number of the multimedia identifier sets corresponding to the first audio X may also be used as the evaluation value of the first audio X, which is not limited in this application and is not described herein again.

S8022, the generating device sequences the first audios based on the evaluation values corresponding to the first audios to obtain a target sequence, and sequentially selects a set number of first audios from the target sequence to serve as a recommended audio set.

Taking the first audio as BGM1, BGM2, and BGM3 as an example, assuming that the evaluation values corresponding to BGM1, BGM2, and BGM3 are 1, 2, and 3, respectively, and the set number is 2, the generating device sorts BGM1, BGM2, and BGM3 based on the evaluation values corresponding to BGM1, BGM2, and BGM3, and obtains an object sequence: the method comprises the steps of BGM3, BGM2 and BGM1, and BGM3 and BGM2 are sequentially selected from a target sequence and serve as a recommended audio set.

S802, the generating device presents the recommended audio set in the client.

For example, referring to fig. 9B, the generating device presents a recommended audio set in the client, and the recommended audio set includes audio such as "whale of avatar island", "Lemon", "big fish", and the like.

In some embodiments, in order to facilitate subsequent classification, recommendation, and re-authoring of multimedia content, the first audio library further includes audio knowledge information corresponding to each first audio, each audio knowledge information includes an audio identifier of the corresponding first audio, and each audio knowledge information may further include, but is not limited to, one or more of an audio name, an audio tag, a related multimedia content tag, and the like of the corresponding first audio. For example, the audio knowledge information of the BGM1 includes an audio identifier, an audio name, an audio tag, and a related multimedia content tag of the BGM1, where the audio name of the BGM1 is "big fish," the audio tag of the BGM1 is "cure," and the related multimedia content tag of the BGM1 is "entertainment.

Each multimedia content has corresponding multimedia knowledge information, each multimedia knowledge information includes a multimedia identifier of the corresponding multimedia content, and each multimedia knowledge information may further include, but is not limited to, one or more of a multimedia name, a keyword, a multimedia tag, an audio name, and the like of the corresponding multimedia content. For example, the multimedia knowledge information of video 1 is { what is a blue-blue flower bud? # vernal division, pattern #, windblown }, wherein the video name of video 1 is "what is formed into blue-blue flower buds? "the keyword of video 1 is" spring minute pattern ", and the audio name of video 1 is" strong wind ".

In order to facilitate subsequent classification and recommendation of the multimedia content, if the audio to be recognized is successfully matched with the first audio library, the generating device may add the audio knowledge information of the first audio successfully matched with the audio to be recognized to the multimedia knowledge information of the multimedia content to which the audio to be recognized belongs.

In order to further improve the first audio library, in this embodiment of the application, according to a set statistical interval, for each first audio contained in the first audio library, the multimedia knowledge information of the multimedia content successfully matched with each first audio may be added to the audio knowledge information of the corresponding first audio.

In some embodiments, the generating means may periodically clean the second audio library to discover new first audio in a timely manner. As an example, the generating device may clean the second audio library according to the storage time of each second audio when the number of the second audio included in the second audio library exceeds a preset number threshold. As another example, the generating device may clean the second audio library according to a preset cleaning time interval and according to a storage time of each second audio.

Next, a process of generating the hot sound library will be described by taking multimedia content as an example.

Referring to fig. 10, the preset hot sound effect library does not include any sound effect, and the preset temporary sound effect library does not include any sound effect.

Firstly, a generating device acquires a video 1 to be identified, extracts an audio sound effect-C to be identified from the video 1, matches the sound effect-C with a hot sound effect library, matches the sound effect-C with a temporary sound effect library after the matching fails, and stores the sound effect-C into the temporary sound effect library if the sound effect-C meets a preset audio detection condition after the matching fails again.

Then, the generating device acquires a video 2 to be identified, extracts an audio sound-D to be identified from the video 1, matches the sound-D with a hot sound library, matches the sound-D with a temporary sound library after the matching fails, and stores the sound-D into the temporary sound library if the sound-D meets preset audio detection conditions after the matching fails again.

And then, the generating device acquires a video 3 to be identified, extracts an audio sound effect-C-1 to be identified from the video 3, matches the sound effect-C-1 with the hot door sound effect library, matches the sound effect-C-1 with the temporary sound effect library after the matching fails, successfully matches the sound effect-C-1 with the sound effect C, and updates the accumulated matching success times of the sound effect C.

And finally, the generation device acquires the video 4 to be identified, extracts the audio sound-C-2 to be identified from the video 4, matches the sound-C-2 with the hot door sound library, matches the sound-C-2 with the temporary sound library after the matching fails, successfully matches the sound-C-2 with the sound-C, and updates the accumulated matching success times of the sound-C.

And the generating device is used for transferring the sound effect-C to the hot door sound effect library when the accumulated matching success times of the sound effect-C reach a preset threshold value.

Based on the same inventive concept, the embodiment of the application provides a device for generating an audio library. As shown in fig. 11, it is a schematic structural diagram of an apparatus 1100 for generating an audio library, and may include:

the matching unit 1101 is configured to match the audio to be recognized with a preset first audio library, and if the matching fails, match the audio to be recognized with a preset second audio library to obtain a target matching result; wherein the second audio library is established based on respective second audio that failed to match the first audio library;

a determining unit 1102, configured to, if it is determined that the audio to be identified is successfully matched with at least one second audio in the second audio library based on the target matching result, and there is a target second audio in the at least one second audio, where the cumulative number of successful matches reaches a preset threshold, obtain the target second audio;

a dump unit 1103, configured to dump the target second audio as a new first audio to the first audio library.

Optionally, the unloading unit 1103 is further configured to:

and if the audio to be recognized is determined to be failed to be matched with the second audio library based on the target matching result, when the audio to be recognized is determined to accord with the preset audio detection condition, the audio to be recognized is taken as a new second audio to be transferred to the second audio library.

Optionally, when determining that the audio to be identified meets the preset audio detection condition, the determining unit 1102 is configured to perform at least one of the following operations:

if the audio to be identified comprises the audio clip of the specified type and the duration of the audio clip reaches a preset duration threshold value, determining that the audio to be identified meets the audio detection condition;

if the audio to be identified comprises the audio clips of the specified type and the time length proportion of the audio clips reaches a preset proportion threshold value, determining that the audio to be identified meets audio detection conditions; the duration proportion is used for representing the ratio of the duration of the audio clip to the total duration of the audio to be identified;

and if at least one audio clip contained in the audio to be identified is of the designated type, determining that the audio to be identified accords with the audio detection condition.

Optionally, when determining that the audio to be identified meets the preset audio detection condition, the determining unit 1102 is configured to:

inputting the audio to be recognized into a trained target audio detection model to obtain a detection prediction value; the target audio detection model is obtained by training an audio detection model to be trained based on an audio labeling data set;

and when the detection predicted value reaches a preset prediction threshold value, determining that the audio to be identified accords with the audio detection condition.

Optionally, before matching the audio to be recognized with the preset first audio library, the matching unit 1101 is further configured to:

acquiring multimedia content to be identified, and extracting audio to be identified from the multimedia content;

then after the second audio is dumped as the new first audio to the first audio library, the dump unit 1103 is further configured to:

and taking the new first audio as an audio recognition result of the multimedia content, and recording a multimedia identifier of the multimedia content corresponding to the new first audio.

Optionally, the apparatus further includes an associating unit 1104, where the associating unit 1104 is configured to:

generating a respective first audio group based on every two first audios in a first audio library;

for each obtained first audio group, the following operations are respectively performed:

determining two first audios contained in one of the first audio groups;

acquiring a multimedia identifier set corresponding to each of the two first audios; each multimedia identifier represents one multimedia content successfully matched with the corresponding first audio;

and if the number of the multimedia identifications which repeatedly appear in the two obtained multimedia identification sets reaches a preset number threshold value, generating a new audio identification corresponding to the two first audio associations.

Optionally, the apparatus further includes a recommending unit 1105, where the recommending unit 1105 is configured to:

responding to an input operation triggered in a client, and determining a recommended audio set from each first audio based on a multimedia identification set corresponding to each first audio contained in a first audio library; each multimedia identifier represents one multimedia content successfully matched with the corresponding first audio;

presenting the set of recommended audio in the client.

Optionally, when determining the recommended audio set from each first audio based on the multimedia identifier set corresponding to each first audio in the first audio library, the recommending unit 1105 is specifically configured to:

acquiring a multimedia identification set corresponding to each first audio contained in a first audio library, and acquiring an evaluation value corresponding to each first audio based on each acquired multimedia identification set; wherein each evaluation value is used for representing the use state of the corresponding first audio;

and sequencing the first audios based on the evaluation values corresponding to the first audios to obtain a target sequence, and sequentially selecting a set number of first audios from the target sequence to serve as a recommended audio set.

Optionally, when obtaining the evaluation value corresponding to each first audio based on each obtained multimedia identifier set, the recommending unit 1105 is specifically configured to:

acquiring interaction state information of each corresponding multimedia content based on a multimedia identifier set corresponding to one first audio in each acquired first audio;

obtaining the weight corresponding to each multimedia content based on the interaction state information of each multimedia content and the preset mapping relation between the interaction state information and the weight;

an evaluation value of the first audio is obtained based on the respective weights corresponding to the respective multimedia contents.

Optionally, when the audio to be identified is matched with the preset first audio library, the matching unit 1101 is specifically configured to:

based on a preset audio fingerprint extraction algorithm, performing audio fingerprint extraction on the audio to be identified to obtain an audio fingerprint to be identified corresponding to the audio to be identified; the audio fingerprint to be identified is used for representing the audio characteristic corresponding to the audio to be identified;

based on an audio fingerprint extraction algorithm, performing audio fingerprint extraction on each first audio contained in a first audio library to obtain a first audio fingerprint corresponding to each first audio; wherein each first audio fingerprint is used for representing the audio characteristics of a corresponding one of the first audios;

respectively calculating the similarity between the audio fingerprints to be identified and each obtained first audio fingerprint;

and if the similarity between the audio fingerprint to be identified and at least one first audio fingerprint reaching a preset similarity threshold exists in the first audio fingerprints, determining that the audio to be identified is successfully matched with the first audio library.

For convenience of description, the above parts are separately described as modules (or units) according to functional division. Of course, the functionality of the various modules (or units) may be implemented in the same one or more pieces of software or hardware when implementing the present application.

With regard to the apparatus in the above-described embodiment, the specific manner in which each unit executes the request has been described in detail in the embodiment related to the method, and will not be elaborated here.

As will be appreciated by one skilled in the art, aspects of the present application may be embodied as a system, method or program product. Accordingly, various aspects of the present application may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.

Having described the method and apparatus for generating an audio library according to an exemplary embodiment of the present application, an electronic device according to another exemplary embodiment of the present application will be described.

Fig. 12 is a block diagram illustrating an electronic device 1200 according to an example embodiment, the apparatus comprising:

a processor 1210;

a memory 1220 for storing instructions executable by the processor 1210;

wherein the processor 1210 is configured to execute instructions to implement a method of generating an audio library in embodiments of the present disclosure, such as the steps shown in fig. 2, fig. 3A, or fig. 8.

In an exemplary embodiment, a storage medium including operations, such as the memory 1220 including operations, which are executable by the processor 1210 of the electronic device 1200 to perform the above-described method, is also provided. Alternatively, the storage medium may be a non-transitory computer readable storage medium, for example, the non-transitory computer readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a portable Compact disc Read Only Memory (CD-ROM), a magnetic tape, a floppy Disk, an optical data storage device, and the like.

Based on the same inventive concept, referring to fig. 13, an embodiment of the present application further provides a terminal device 1300, where the terminal device 1300 may be an electronic device such as a smart phone, a tablet computer, a laptop computer, or a PC.

The terminal device 1300 includes a display unit 1340, a processor 1380, and a memory 1320, where the display unit 1340 includes a display panel 1341 for displaying information input by a user or information provided to the user, various operation interfaces of the terminal device 1300, and the like, and in the embodiment of the present application, the display panel 1341 is mainly used for displaying an operation interface, a shortcut window, and the like of an application program installed in the terminal device 1300. Alternatively, the Display panel 1341 may be configured in the form of an LCD (Liquid Crystal Display) or an OLED (Organic Light-Emitting Diode).

The processor 1380 is used to read a computer program and then execute a method defined by the computer program, for example, the processor 1380 reads an application, thereby running the application on the terminal device 1300 and displaying an operation interface on the display unit 1340. The Processor 1380 may include one or more general purpose processors and may also include one or more DSPs (Digital Signal processors) for performing the relevant operations to implement the solutions provided by the embodiments of the present application.

Memory 1320 typically includes both memory and external storage, such as RAM, ROM, and CACHE (CACHE). The external memory can be a hard disk, an optical disk, a USB disk, a floppy disk or a tape drive. The memory 1320 is used to store computer programs, including application programs and the like, and other data, which may include data generated by an operating system or application programs after being executed, including system data (e.g., configuration parameters for the operating system) and user data. In the embodiment of the present application, the program instructions are stored in the memory 1320, and the processor 1380 executes the program instructions in the memory 1320 to implement the method for generating the audio library discussed above.

In addition, the terminal device 1300 may further include a display unit 1340 for receiving input numerical information, character information, or contact touch operation/non-contact gesture, and generating signal input related to user setting and function control of the terminal device 1300, and the like. Specifically, in the embodiment of the present application, the display unit 1340 may include a display panel 1341. The display panel 1341, such as a touch screen, can collect touch operations of a user (e.g., operations of the user on the display panel 1341 or on the display panel 1341 using any suitable object or accessory such as a finger, a stylus, etc.) on or near the display panel 1341, and drive the corresponding connection device according to a preset program. Alternatively, the display panel 1341 may include two portions of a touch detection device and a touch controller. The touch detection device comprises a touch controller, a touch detection device and a touch control unit, wherein the touch detection device is used for detecting the touch direction of a user, detecting a signal brought by touch operation and transmitting the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, and sends the touch point coordinates to the processor 1380, where the touch controller can receive and execute commands sent by the processor 1380. In this embodiment, if a user performs a selection operation on a widget in the operation interface, and a touch detection device in the display panel 1341 detects a touch operation, the touch detection device transmits a signal corresponding to the detected touch operation, the touch controller converts the signal into a touch point coordinate and transmits the touch point coordinate to the processor 1380, and the processor 1380 determines the widget selected by the user according to the received touch point coordinate.

The display panel 1341 can be implemented by various types, such as resistive, capacitive, infrared, and surface acoustic wave. In addition to the display unit 1340, the terminal device 1300 may also include an input unit 1330, the input unit 1330 may include, but is not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and the like. In fig. 13, an example is given in which the input unit 1330 includes an image input device 1331 and other input devices 1332.

In addition to the above, the terminal device 1300 may further include a power supply 1390 for supplying power to other modules, an audio circuit 1360, a near field communication module 1370, and an RF circuit 1310. Terminal device 1310 may also include one or more sensors 1350, such as acceleration sensors, light sensors, pressure sensors, and the like. The audio circuit 1360 specifically includes a speaker 1361, a microphone 1362, and the like, for example, a user may use voice control, the terminal device 1300 may collect the sound of the user through the microphone 1362, may control the sound of the user, and when a prompt is required, plays a corresponding prompt sound through the speaker 1361.

Based on the same inventive concept, the present application also provides a computer program product or a computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to execute the method for generating the audio library provided in the various alternative implementations of the above embodiment.

In some possible embodiments, various aspects of the audio library generation method provided by the present application may also be implemented in the form of a program product including a computer program for causing a computer device to perform the steps in the audio library generation method according to various exemplary embodiments of the present application described above in this specification when the program product is run on the computer device, for example, the computer device may perform the steps as shown in fig. 2, fig. 3A or fig. 8.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a RAM, a ROM, an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a CD-ROM, an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The program product of the embodiments of the present application may be a CD-ROM and include program code and may run on a computing device. However, the program product of the present application is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with a command execution system, apparatus, or device.

A readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with a command execution system, apparatus, or device. While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all alterations and modifications as fall within the scope of the application.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

36页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:水电厂巡检方法、装置、计算机设备及存储介质

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!