Comment output method and device and computer storage medium

文档序号:1391430 发布日期:2020-02-28 浏览:20次 中文

阅读说明:本技术 一种评论输出方法、装置、以及计算机存储介质 (Comment output method and device and computer storage medium ) 是由 缪畅宇 于 2019-11-11 设计创作,主要内容包括:本申请实施例公开了一种评论输出方法、装置、以及计算机存储介质,所述方法涉及人工智能领域中的自然语言处理方向,所述方法包括:获取目标音频对应的文本信息、以及音频旋律信息,分别对文本信息、以及音频旋律信息进行特征提取,得到文本特征、以及多个音频旋律特征,基于语义解码模型对文本特征进行语义解码,得到解码数据,基于当前状态信息,对多个音频旋律特征进行融合,得到表征音频旋律特征受关注程度的注意力特征数据,基于语义解码模型,对解码数据、以及注意力特征数据进行语义解码,得到目标音频对应的评论,输出评论。该方案可以提高评论与目标音频之间的匹配程度。(The embodiment of the application discloses a comment output method, a comment output device and a computer storage medium, wherein the method relates to a natural language processing direction in the field of artificial intelligence, and the method comprises the following steps: the method comprises the steps of obtaining text information and audio melody information corresponding to a target audio, respectively extracting features of the text information and the audio melody information to obtain text features and a plurality of audio melody features, conducting semantic decoding on the text features based on a semantic decoding model to obtain decoded data, conducting fusion on the audio melody features based on current state information to obtain attention feature data representing the attention degree of the audio melody features, conducting semantic decoding on the decoded data and the attention feature data based on the semantic decoding model to obtain comments corresponding to the target audio, and outputting the comments. The scheme can improve the matching degree between the comments and the target audio.)

1. A comment output method is characterized by comprising;

acquiring text information and audio melody information corresponding to target audio;

respectively extracting the characteristics of the text information and the audio melody information to obtain a text characteristic and a plurality of audio melody characteristics;

semantic decoding is carried out on the text features based on a semantic decoding model to obtain decoded data, and the decoded data comprise current state information representing the hidden state of the semantic decoding model;

based on the current state information, fusing a plurality of audio melody characteristics to obtain attention characteristic data representing the attention degree of the audio melody characteristics;

semantic decoding is carried out on the decoded data and the attention feature data based on the semantic decoding model, and comments corresponding to the target audio are obtained;

and outputting the comment.

2. The comment output method of claim 1, wherein the feature extraction is performed on the text information and the audio melody information respectively to obtain a text feature and a plurality of audio melody features, and the method comprises:

performing feature extraction on the text information to obtain text features;

and performing feature extraction on the audio melody information based on various audio melody feature extraction methods to obtain the audio melody features corresponding to each audio melody feature extraction method.

3. A comment output method according to claim 2, characterized in that the text feature includes a plurality of text sub-features;

performing feature extraction on the text information to obtain text features, wherein the feature extraction comprises the following steps:

determining a plurality of types of text sub-information from the text information;

and respectively extracting the characteristics of the text sub-information of the multiple types to obtain the text sub-characteristics corresponding to the text sub-information of each type.

4. The comment output method according to claim 3, wherein the semantic decoding is performed on the text feature based on a semantic decoding model to obtain decoded data, and the decoded data includes current state information representing a hidden state of the semantic decoding model, and includes:

fusing a plurality of text sub-features in the text features to obtain initial text features;

and performing semantic decoding on the initial text features based on a semantic decoding model to obtain decoded data, wherein the decoded data comprises current state information representing the hidden state of the semantic decoding model.

5. The comment output method according to claim 1, wherein the fusion of a plurality of audio melody features based on the current state information to obtain attention feature data representing the degree of attention to which the audio melody features are focused comprises:

acquiring the weight corresponding to each audio melody feature based on the current state information;

and performing weighting operation on the plurality of audio melody characteristics based on the weight to obtain attention characteristic data representing the attention degree of the audio melody characteristics.

6. The comment output method according to claim 1, wherein performing semantic decoding on the decoded data and the attention feature data based on the semantic decoding model to obtain the comment corresponding to the target audio includes:

taking the decoded data, and the attention feature data as current inputs to the semantic decoding model;

based on the semantic decoding model, performing semantic decoding on the current input to obtain semantic decoded data;

updating decoded data based on the semantically decoded data;

when the decoding data does not meet the termination condition, returning to execute the step of fusing the audio melody characteristics based on the current state information to obtain attention characteristic data representing the attention degree of the audio melody characteristics;

and when the decoding data meet a termination condition, obtaining a comment corresponding to the target audio based on the decoding data.

7. The comment output method according to claim 6, wherein the decoded data further includes comment content;

when the decoded data meet a termination condition, obtaining a comment corresponding to the target audio based on the decoded data, including:

determining the content information quantity of comment content in the decoded data;

and when the content information quantity reaches a preset information quantity, combining comment contents in the decoded data to obtain a comment corresponding to the target audio.

8. The comment output method according to claim 6, wherein the decoded data further includes comment content;

when the decoded data meet a termination condition, obtaining a comment corresponding to the target audio based on the decoded data, including:

detecting the decoded data to obtain a detection result;

and when the detection result determines that the decoded data comprises a preset ending identifier, combining comment contents in the decoded data to obtain a comment corresponding to the target audio.

9. The comment output method of claim 1, wherein obtaining text information corresponding to the target audio and audio melody information comprises:

determining a target song to be output with comments based on song selection operation of a user on a song selection page on a terminal;

acquiring text information and audio melody information corresponding to the target song;

the step of outputting the comment includes:

and displaying the comment on a song comment page of the terminal.

10. A comment output apparatus characterized by comprising:

the acquisition module is used for acquiring text information and audio melody information corresponding to the target audio;

the feature extraction module is used for respectively extracting features of the text information and the audio melody information to obtain a text feature and a plurality of audio melody features;

the first decoding module is used for carrying out semantic decoding on the text features based on a semantic decoding model to obtain decoded data, and the decoded data comprises current state information representing the hidden state of the semantic decoding model;

the fusion module is used for fusing the audio melody characteristics based on the current state information to obtain attention characteristic data representing the attention degree of the audio melody characteristics;

the second decoding module is used for carrying out semantic decoding on the decoded data and the attention feature data based on the semantic decoding model to obtain comments corresponding to the target audio;

and the output module is used for outputting the comments.

11. A computer storage medium having stored thereon a computer program, characterized in that when the computer program is run on a computer, the computer is caused to execute the comment output method according to any one of claims 1 to 9.

Technical Field

The application relates to the technical field of computers, in particular to a comment output method, a comment output device and a computer storage medium.

Background

The song comments are comment texts for summarizing, analyzing, evaluating and the like of the songs, are matched with proper song comments for the songs, and can play the roles of attracting users to listen to the songs, improving the song playing amount, opening the market for the newly listed or popular songs, improving the exposure of new singers, activating music communities, assisting the music communities and the like.

However, the cost of inviting professional lovers to write song reviews is high. In the prior art, a machine learning method can be utilized to analyze the lyrics of a song, the type of the song and the like and output corresponding song comments, however, the matching degree of the song comments obtained by the method for outputting the song comments and the song itself is low.

Disclosure of Invention

The embodiment of the application provides a comment output method, a comment output device and a computer storage medium, which can improve the matching degree between comments and target audio.

The embodiment of the application provides a comment output method, which comprises the following steps:

acquiring text information and audio melody information corresponding to target audio;

respectively extracting the characteristics of the text information and the audio melody information to obtain a text characteristic and a plurality of audio melody characteristics;

semantic decoding is carried out on the text features based on a semantic decoding model to obtain decoded data, and the decoded data comprise current state information representing the hidden state of the semantic decoding model;

based on the current state information, fusing a plurality of audio melody characteristics to obtain attention characteristic data representing the attention degree of the audio melody characteristics;

semantic decoding is carried out on the decoded data and the attention feature data based on the semantic decoding model, and comments corresponding to the target audio are obtained;

and outputting the comment.

Correspondingly, an embodiment of the present application further provides a comment output apparatus, including:

the acquisition module is used for acquiring text information and audio melody information corresponding to the target audio;

the feature extraction module is used for respectively extracting features of the text information and the audio melody information to obtain a text feature and a plurality of audio melody features;

the first decoding module is used for carrying out semantic decoding on the text features based on a semantic decoding model to obtain decoded data, and the decoded data comprises current state information representing the hidden state of the semantic decoding model;

the fusion module is used for fusing the audio melody characteristics based on the current state information to obtain attention characteristic data representing the attention degree of the audio melody characteristics;

the second decoding module is used for carrying out semantic decoding on the decoded data and the attention feature data based on the semantic decoding model to obtain comments corresponding to the target audio;

and the output module is used for outputting the comments.

Optionally, in some embodiments, the feature extraction module may include a first extraction sub-module and a second extraction sub-module, as follows:

the first extraction submodule is used for extracting the characteristics of the text information to obtain text characteristics;

and the second extraction submodule is used for extracting the characteristics of the audio melody information based on various audio melody characteristic extraction methods to obtain the audio melody characteristics corresponding to each audio melody characteristic extraction method.

At this time, the first extraction sub-module may be specifically configured to determine a plurality of types of text sub-information from the text information, and perform feature extraction on the plurality of types of text sub-information respectively to obtain a text sub-feature corresponding to each type of text sub-information.

At this time, the first decoding module may be specifically configured to fuse a plurality of text sub-features in the text feature to obtain an initial text feature, perform semantic decoding on the initial text feature based on a semantic decoding model to obtain decoded data, where the decoded data includes current state information representing a hidden state of the semantic decoding model.

At this time, the fusion module may be specifically configured to obtain a weight corresponding to each audio melody feature based on the current state information, and perform a weighting operation on the plurality of audio melody features based on the weights to obtain attention feature data representing the attention degree of the audio melody features.

Optionally, in some embodiments, the second decoding module may include a determining sub-module, a decoding sub-module, an updating sub-module, a returning sub-module, and an obtaining sub-module, as follows:

a determination submodule for taking said decoded data and said attention feature data as current inputs to said semantic decoding model;

the decoding submodule is used for carrying out semantic decoding on the current input based on the semantic decoding model to obtain data after the semantic decoding;

an update submodule for updating the decoded data based on the semantically decoded data;

a returning submodule, configured to, when the decoded data does not satisfy a termination condition, return to perform a step of performing fusion on a plurality of audio melody features based on the current state information to obtain attention feature data representing a degree of attention to which the audio melody features are focused;

and the obtaining sub-module is used for obtaining the comment corresponding to the target audio based on the decoded data when the decoded data meets the termination condition.

At this time, the obtaining sub-module may be specifically configured to determine the content information amount of the comment content in the decoded data, and when the content information amount reaches a preset information amount, combine the comment content in the decoded data to obtain the comment corresponding to the target audio.

At this time, the obtaining sub-module may be specifically configured to detect the decoded data to obtain a detection result, and when the detection result determines that the decoded data includes a preset end identifier, combine comment contents in the decoded data to obtain a comment corresponding to the target audio.

At this time, the obtaining module may be specifically configured to determine a target song to be commented on based on a song selection operation of the user on the song selection page on the terminal, and obtain text information and audio melody information corresponding to the target song.

In addition, the embodiment of the present application further provides a computer storage medium, where a plurality of instructions are stored, and the instructions are suitable for being loaded by a processor to execute the steps in any one of the comment output methods provided in the embodiment of the present application.

The method and the device for processing the audio melody information can obtain text information and audio melody information corresponding to the target audio, feature extraction is conducted on the text information and the audio melody information respectively to obtain text features and a plurality of audio melody features, semantic decoding is conducted on the text features based on a semantic decoding model to obtain decoding data, the decoding data comprise current state information representing hidden states of the semantic decoding model, the audio melody features are fused based on the current state information to obtain attention feature data representing the attention degree of the audio melody features, semantic decoding is conducted on the decoding data and the attention feature data based on the semantic decoding model to obtain comments corresponding to the target audio, and the comments are output. According to the scheme, the comment corresponding to the target audio frequency can be automatically output by analyzing the text information and the audio melody information corresponding to the target audio frequency, and the matching degree between the comment and the target audio frequency is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic scene diagram of a comment output system provided in an embodiment of the present application;

FIG. 2 is a first flowchart of a comment output method provided by an embodiment of the present application;

FIG. 3 is a second flowchart of a comment output method provided by an embodiment of the present application;

FIG. 4 is an overall framework diagram of a comment output method provided by an embodiment of the present application;

FIG. 5 is a detailed framework diagram of a comment output method provided by an embodiment of the present application;

FIG. 6 is a flow chart of LSTM model decoding provided by an embodiment of the present application;

FIG. 7 is a decoding flow diagram of a decoder provided by an embodiment of the present application;

FIG. 8 is a schematic diagram of a song selection page provided by an embodiment of the present application;

FIG. 9 is a schematic diagram of a comment page provided by an embodiment of the present application;

fig. 10 is a schematic structural diagram of a comment output device provided in an embodiment of the present application;

fig. 11 is a schematic structural diagram of a network device according to an embodiment of the present application.

Detailed Description

Referring to the drawings, wherein like reference numbers refer to like elements, the principles of the present application are illustrated as being implemented in a suitable computing environment. The following description is based on illustrated embodiments of the application and should not be taken as limiting the application with respect to other embodiments that are not detailed herein.

In the description that follows, specific embodiments of the present application will be described with reference to steps and symbols executed by one or more computers, unless otherwise indicated. Accordingly, these steps and operations will be referred to, several times, as being performed by a computer, the computer performing operations involving a processing unit of the computer in electronic signals representing data in a structured form. This operation transforms the data or maintains it at locations in the computer's memory system, which may be reconfigured or otherwise altered in a manner well known to those skilled in the art. The data maintains a data structure that is a physical location of the memory that has particular characteristics defined by the data format. However, while the principles of the application have been described in language specific to above, it is not intended to be limited to the specific form set forth herein, and it will be recognized by those of ordinary skill in the art that various of the steps and operations described below may be implemented in hardware.

The term "module" as used herein may be considered a software object executing on the computing system. The different components, modules, engines, and services described herein may be considered as implementation objects on the computing system. The apparatus and method described herein may be implemented in software, but may also be implemented in hardware, and are within the scope of the present application.

The terms "first", "second", and "third", etc. in this application are used to distinguish between different objects and not to describe a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or modules is not limited to only those steps or modules listed, but rather, some embodiments may include other steps or modules not listed or inherent to such process, method, article, or apparatus.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

An execution subject of the comment output method may be the comment output device provided in the embodiment of the present application, or a network device integrated with the comment output device, where the comment output device may be implemented in a hardware or software manner. The network device may be a smart phone, a tablet computer, a palm computer, a notebook computer, or a desktop computer. Network devices include, but are not limited to, computers, network hosts, a single network server, multiple sets of network servers, or a cloud of multiple servers.

Referring to fig. 1, fig. 1 is a schematic view of an application scenario of a comment output method provided in an embodiment of the present application, taking an example that a comment output device is integrated in a network device, the network device may obtain text information and audio melody information corresponding to a target audio, perform feature extraction on the text information and the audio melody information respectively to obtain a text feature and a plurality of audio melody features, perform semantic decoding on the text feature based on a semantic decoding model to obtain decoded data, where the decoded data includes current state information representing a hidden state of the semantic decoding model, perform fusion on the plurality of audio melody features based on the current state information to obtain attention feature data representing an attention degree of the audio melody features, perform semantic decoding on the decoded data and the attention feature data based on the semantic decoding model to obtain a comment corresponding to the target audio, and outputting the comment.

The comment output method provided by the embodiment of the application relates to a natural language processing direction in the field of artificial intelligence. According to the embodiment of the application, the comment corresponding to the target audio can be generated based on the text information corresponding to the target audio and the audio melody information through a text generation technology.

Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making. The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence software technology mainly comprises a computer vision technology, a machine learning/deep learning direction and the like.

Among them, Natural Language Processing (NLP) is a research on various theories and methods that enable efficient communication between a person and a computer using natural Language. Research in the field of natural language processing will involve natural language, i.e., the language people use daily, and therefore, it is closely linked with the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.

Referring to fig. 2, fig. 2 is a schematic flow chart of a comment output method provided in the embodiment of the present application, which is specifically described by the following embodiment:

201. and acquiring text information and audio melody information corresponding to the target audio.

The target audio may be a song with lyrics, pure music without lyrics, a portion of audio captured from a song, a piece of audio obtained by downloading, or audio obtained by recording, such as a bird song, a speaking sound, a musical instrument sound, and the like, which may be used as the target audio.

The text information is a text representing the song related information in a text form, and the text information may be of various types, for example, the text information may include one or more of a song name, lyrics, a singer name, a lyrics author name, a song type name, and the like.

The audio melody information is a sequence describing audio amplitude in a time dimension and can represent the melody of a song. For example, the audio melody information may be regular sound waves including voice, music, sound effects, and the like.

In practical application, for example, when a song comment corresponding to a certain song needs to be obtained, the song may be used as a target audio, and one or more of a song name, lyrics, a singer name, a lyrics author name, a song author name, and a song type name corresponding to the song may be obtained as text information corresponding to the target audio, and audio data of the song may be obtained as audio melody information corresponding to the target audio.

In an embodiment, the timing sequence between the step of acquiring the text information and the step of acquiring the audio melody information is not limited, for example, the step of acquiring the text information may be before the step of acquiring the audio melody information, or the step of acquiring the audio melody information may be before the step of acquiring the text information, and of course, the step of acquiring the text information and the step of acquiring the audio melody information, and the like may be performed at the same time.

In an embodiment, the comment output method can be applied to a terminal, and a user can determine target audio through song selection operation and display generated comments on a terminal page. Specifically, the step of "acquiring text information and audio melody information corresponding to the target audio" may include:

determining a target song to be output with comments based on song selection operation of a user on a song selection page on a terminal;

and acquiring text information and audio melody information corresponding to the target song.

In practical application, for example, as shown in fig. 8, when a user needs to match comments with a certain song by using the comment output method, the user may click a preset area on a terminal page to send a request for outputting comments to the terminal, after the terminal receives the request from the user, a song selection page may be displayed, the song selection page may include a list of a plurality of songs, and the user may select a song from the song list as a target audio by clicking or the like.

For another example, the song selection page may further include a song title editing area, and the user may edit a title of the song with respect to the song title editing area, and determine a target audio for which a comment needs to be output according to the title edited by the user.

202. And respectively extracting the characteristics of the text information and the audio melody information to obtain the text characteristics and a plurality of audio melody characteristics.

In practical applications, for example, after the text information and the audio melody information corresponding to the target audio are acquired, in order to facilitate subsequent processing of the text information and the audio melody information by using a network model, a text feature corresponding to the text information and a plurality of audio melody features of the same dimension corresponding to the audio melody information may be extracted from the text information and the audio melody information respectively in a feature extraction manner, where the text feature and the audio melody features may be expressed in a form of a vector or a matrix, or the like.

In an embodiment, the accuracy of extracting the audio melody information features can be improved by extracting various types of audio melody features from the audio melody information, so that the matching degree between the comment and the target audio is improved, and therefore, the audio melody features can be extracted by various audio melody feature extraction methods. Specifically, the step of "respectively performing feature extraction on the text information and the audio melody information to obtain a text feature, and a plurality of audio melody features" may include:

performing feature extraction on the text information to obtain text features;

and performing feature extraction on the audio melody information based on various audio melody feature extraction methods to obtain the audio melody features corresponding to each audio melody feature extraction method.

The audio melody feature extraction method may be a method capable of extracting a feature vector from audio melody information, and the audio melody feature extraction method may be various, for example, the audio melody feature extraction method may be FFT (fast Fourier transform), STFT (short-time Fourier transform), MFCC (Mel-scale Frequency Cepstral Coefficients), or the like.

The Fast Fourier Transform (FFT) is a method for efficiently and quickly calculating the discrete Fourier transform by using a computer, and is obtained by improving the discrete Fourier transform according to characteristics of the discrete Fourier transform, such as odd, even, imaginary, real, and the like. The multiplication times required by calculating the discrete Fourier transform can be reduced through the fast Fourier transform algorithm, wherein the more the number N of the transformed sampling points is, the more remarkable the effect of saving the calculation amount of the fast Fourier transform algorithm is.

The short-time Fourier transform (STFT) is a mathematical transform for determining the frequency and phase of a sinusoidal wave in a local area of a time-varying signal, and a fixed window function is used in the STFT. Wherein the resolution can also be changed by means of a re-selection of the window function.

Where MFCC (Mel-scale Frequency Cepstral Coefficients, Mel-Frequency Cepstral Coefficients) are Coefficients that make up Mel-Frequency Cepstral, where the Coefficients of Mel-Frequency Cepstral are linear transforms of the log energy spectrum based on the non-linear Mel scale of sound frequencies. The mel-frequency cepstral coefficients are derived from the cepstrum of the audio segment. Wherein the frequency band division of the mel-frequency cepstrum is equally divided on the mel scale, which is more similar to the human auditory system than the linearly spaced frequency bands used in the normal log cepstrum. Such a non-linear representation can better represent an audio signal in a plurality of fields such as an audio compression field.

In practical application, for example, after the text information corresponding to the target audio is acquired, feature extraction may be performed on the text information to obtain text features. Since there may be various audio melody feature extraction methods for extracting features of audio melody information, and each audio melody feature extraction method can extract audio melody features with different meanings, therefore, in order to improve the accuracy of audio melody feature extraction, the audio melody information can be extracted by various audio melody feature extraction methods, for example, the audio melody information can be extracted by the audio melody feature extraction methods such as fast Fourier transform, short-time Fourier transform, Mel cepstrum coefficient and the like, so as to obtain the audio melody features corresponding to each audio melody feature extraction method, the extracted audio melody features represent different physical meanings, and the audio melody features extracted by each audio melody feature extraction method can be compressed into an audio melody feature vector.

For example, as shown in fig. 5, the audio melody information corresponding to the target audio is input into the audio encoder, and features of the audio melody information may be extracted by three audio melody feature extraction methods, i.e., fast fourier transform, short-time fourier transform, and mel cepstral coefficients, to obtain an audio melody feature vector 1 extracted by fast fourier transform, an audio melody feature vector 2 extracted by short-time fourier transform, and an audio melody feature vector 3 extracted by mel cepstral coefficients.

In an embodiment, since the text information of the target audio may include a plurality of types of text sub-information, and each type of text sub-information represents different types of information of the target audio, different types of text sub-information cannot be mixed together for encoding, but the plurality of types of text sub-information are encoded respectively to obtain text sub-features corresponding to each type of text sub-information. Specifically, the step of "extracting features of the text information to obtain text features" may include:

determining a plurality of types of text sub-information from the text information;

and respectively extracting the characteristics of the text sub-information of the multiple types to obtain the text sub-characteristics corresponding to the text sub-information of each type.

In practical applications, for example, after the text information corresponding to the target audio is obtained, the text information may be classified, for example, the text information may be divided into multiple types of text sub-information, such as a song name, lyrics, a singer name, a word author name, a song author name, and a song type name, where each type of text sub-information represents a specific aspect of the target audio. And then, respectively carrying out feature extraction on the text sub-information of the multiple types, and obtaining text sub-features corresponding to the text sub-information of each type.

For example, as shown in fig. 5, the text information may include three types of text sub-information of lyrics, singer and audio type, the three types of text sub-information of lyrics, singer and audio type are input into a text encoder, and features of the three types of text sub-information are respectively extracted to obtain a lyrics feature vector, a singer feature vector and an audio type feature vector.

In an embodiment, there may be a plurality of methods for extracting features of text information, for example, the features of the text information may be extracted by obtaining a topic vector corresponding to the text, or obtaining a word vector through labeling of a keyword in the text, for example, the features of the text information may be extracted by using models such as LDA (Latent Dirichlet Allocation), word2vec (word vector), doc2vec (para to vector), and the like.

LDA (document topic generation model) is an unsupervised machine learning technology, and can be used for identifying Latent topic information in a large-scale document set or a corpus. The LDA model adopts a bag-of-words method, and each document is regarded as a word frequency vector, so that text information is converted into digital information which is easy to model. Each document represents a probability distribution of topics, and each topic represents a probability distribution of words.

The word2vec (word to vector) model is a correlation model for generating vectors, and the model is a two-layer neural network, which can reconstruct the linguistic word text and guess the input words at adjacent positions. The word2vec model may be used to map each word into a vector, which may represent word-to-word relationships.

Wherein, doc2vec (para to vector) is an unsupervised algorithm, and vector expressions of sentences, paragraphs, documents and the like can be obtained through doc2 vec. The similarity between sentences, paragraphs and documents can be obtained by calculating the distance through the learned vector, and the learned vector is applied to text clustering, and for labeled data, text classification can be performed through a supervised learning method, and the method can be applied to scenes such as emotion analysis problems.

In an embodiment, as shown in fig. 5, the comment output method includes a text editor capable of extracting text features corresponding to text information and an audio editor capable of extracting audio melody features corresponding to audio melody information, and the embodiment of the present application does not limit the specific forms of the text encoder and the audio encoder, and any model that can convert text information or audio melody information into features in vector or matrix form may be applied to the encoder of the embodiment of the present application.

In an embodiment, the timing sequence between the step of acquiring the text feature and the step of acquiring the audio melody feature is not limited, for example, the step of acquiring the text feature may be before the step of acquiring the audio melody feature, or the step of acquiring the audio melody feature may be before the step of acquiring the text feature, or of course, the step of acquiring the text feature and the step of acquiring the audio melody feature may be performed at the same time, and the like.

203. And performing semantic decoding on the text features based on the semantic decoding model to obtain decoded data, wherein the decoded data comprises current state information representing the hidden state of the semantic decoding model.

In practical application, after the text features and the audio melody features are obtained, the text features and the audio melody features can be decoded by a decoder, and finally comments corresponding to the target audio are obtained. Wherein the decoder may be a model capable of converting the fixed vectors generated by the encoding into an output sequence. For example, the decoder may be a decoding part in a seq2seq (Sequence to Sequence) model. The decoder may include a semantic decoding model, the semantic decoding model may decode input data for multiple times to obtain multiple output sequences, and then combine the multiple output sequences output by the semantic decoding model to obtain a comment finally output by the decoder.

Here, Sequence to Sequence (seq 2 seq) is a neural network of an Encoder-Decoder structure, and is called Sequence to Sequence because its input is a Sequence and its output is also a Sequence. seq2seq enables the generation of an output sequence by a specific method, given an input sequence. For example, seq2seq may be applied to the translation field, and if the input sequence may be "Hello", then "Hello" will be output; for another example, seq2seq may also be applied to the field of man-machine conversation, and the input sequence may be "who is you? "I am something".

The semantic decoding model may be a model that is located in a decoder, can summarize historical time points and transmit the summarized historical time points to a current state so as to learn states of all nodes in a sequence. For example, the semantic decoding model may be an RNN (Recurrent Neural Network) model, an LSTM (Long Short-Term Memory) model, or the like.

Here, RNN (Recurrent Neural Network) is a Recurrent Neural Network in which recursion is performed in the direction of evolution of a sequence with sequence data as an input, and all nodes are connected in a chain. The recurrent neural network has the characteristics of memorability, parameter sharing and graphic completeness, and is suitable for learning the nonlinear characteristics of the sequence. Due to the characteristics of the recurrent neural network, the recurrent neural network is applied to the fields of natural language processing such as speech recognition, language modeling and machine translation.

Among them, LSTM (Long Short-Term Memory network) is a time-cycle neural network, and can solve the problem of gradient disappearance existing in a general cycle neural network. Including implicit states h propagated forward at a time at each sequence index position of the long-short term memory networktAnd cell status ciAnd includes a forgetting gate, an input gate, and an output gate at each sequence index position.

The decoded data may be data output after passing through a semantic decoding model. For example, when the semantic decoding model is an LSTM model, and a text feature vector corresponding to text information is input, current state information h representing a hidden state of the semantic decoding model can be obtainedtAnd comment content yiThe comment content may represent a sequence content of an output sequence obtained after the semantic decoding model, and the required comment can be obtained by combining at least one comment content output by the LSTM model.

In practical application, for example, the text feature may be used as initial input data and input into a semantic decoding model, where the semantic decoding model may be an LSTM model, and the input text feature is subjected to semantic decoding based on the semantic decoding model to obtain decoded data, where the decoded data includes current state information h representing a hidden state of the semantic decoding modeltAnd comment content yi

In an embodiment, since the text feature includes a plurality of text sub-features, the comment output method further includes a fusion process of the text sub-features. Specifically, the step "performing semantic decoding on the text feature based on a semantic decoding model to obtain decoded data, where the decoded data includes current state information representing a hidden state of the semantic decoding model", may include:

fusing a plurality of text sub-features in the text features to obtain initial text features;

and performing semantic decoding on the initial text features based on a semantic decoding model to obtain decoded data, wherein the decoded data comprises current state information representing the hidden state of the semantic decoding model.

In practical applications, for example, since the text feature includes a plurality of text sub-features with different meanings, x may be passed through1iRepresenting text sub-features and assigning a weight α to each text sub-featureiThrough xavgRepresenting an initial text feature, initial text feature xavgIncluding fused feature information for a plurality of text sub-features. Wherein the initial text feature xavgThe calculation formula of (c) may be as follows:

xavg=∑(αi·x1i)

wherein, Σ αi1. As shown in fig. 6, after the initial text features are obtained by calculation, the initial text features may be input into the LSTM model, and after the LSTM model is calculated, the current state information h representing the current hidden state of the semantic decoding model may be outputtAnd comment content yi

In an embodiment, the weight corresponding to each text sub-feature in the embodiment of the present application may be adjusted according to an actual situation, for example, the weight corresponding to each text sub-feature may also be obtained by learning according to the actual situation.

In one embodiment, the method of fusing the plurality of text sub-features into the initial text feature is not limited to the weighted average method, as long as the method can fuse the plurality of text sub-features into one initial text feature.

204. And based on the current state information, fusing the multiple audio melody characteristics to obtain attention characteristic data representing the attention degree of the audio melody characteristics.

Wherein, the attention mechanism can be applied in the decoder, when the content of the input is more, the information of the input sequence can be added into the decoder by adding the attention mechanism. For example, the information of the input sequence may be added to the decoder for operation by assigning a weight to each input sequence, wherein each weight represents the amount of attention assigned to the corresponding input sequence. Due to the introduction of the original information, the decoding accuracy can be improved.

In practical application, in the process of commenting the song by the lovers, not only the information of the lyrics, the authors, the types and the like of the song is considered, but also the intuitive audio melody information such as the melody of the song is brought to the commenting of the song. Therefore, the audio melody information of the target audio can be considered by adding the attention mechanism to the decoder, and the comment with a higher matching degree with the target audio can be obtained. For example, the current state information h representing the current hidden state of the semantic decoding model is obtainediAnd then, calculating according to the current state information to obtain the attention allocated to each audio melody feature, and fusing the audio melody features according to the attention allocated to each audio melody feature to obtain attention feature data representing the attention degree of the audio melody feature.

In one embodiment, the attention amount assigned to each audio melody feature may be expressed in the form of a weight. Specifically, the step "based on the current state information, fusing a plurality of audio melody features to obtain attention feature data representing the attention degree of the audio melody features" may include:

acquiring the weight corresponding to each audio melody feature based on the current state information;

and performing weighting operation on the plurality of audio melody characteristics based on the weight to obtain attention characteristic data representing the attention degree of the audio melody characteristics.

In practical applications, for example, the currently calculated current state information h may be usedtAnd a plurality of audio melody features x2iInput to the weight calculation function fattentionIn (1), calculating the feature x of each audio melody2iCorresponding weight βiThen β according to the calculated weightiWeighting and calculating the multiple audio melody characteristics to obtain attention characteristic data h for representing the attention degree of the audio melody characteristicsavgWherein the weight βiAnd attention feature data havgThe calculation formula of (c) may be as follows:

βi=fattention(ht,x2i)

havg=∑(βi·x2i)

wherein the weight calculation function fattentionThe form of the method is not exclusive, and all the weight calculation methods of the attention mechanism are possible.

In one embodiment, since the original sequence information can be added to the decoder by the attention mechanism to improve the decoding accuracy, the original text features can also be added to the decoder to enrich the data in the decoder. For example, can pass xiRepresenting a plurality of audio melody features and a plurality of text sub-features, and then calculating the current state information htAnd a plurality of features xiInput to the weight calculation function fattentionIn (3), a weight β is calculated for each featureiThen β according to the calculated weightiWeighting calculation is carried out on a plurality of characteristics to obtain attention characteristic data h for representing the attention degree of each characteristicavgWherein the weight βiAnd attention feature data havgThe calculation formula of (c) may be as follows:

βi=fattention(ht,xi)

havg=∑(βi·xi)

wherein the weight calculation function fattentionThe form of the method is not exclusive, and all the weight calculation methods of the attention mechanism are possible.

205. And performing semantic decoding on the decoded data and the attention feature data based on a semantic decoding model to obtain comments corresponding to the target audio.

The comment may be interactive information related to the target audio, for example, when the target audio is a song, the comment may be a comment corresponding to the song, and the user may know information about the content, type, lyrics, author, and the like of the song by reading the comment, so as to improve exposure of the song.

In practical applications, for example, the current state information h is obtained through a semantic decoding modeltAnd comment content yiAnd attention characteristic data havgThereafter, parameters in the semantic decoding model may be updated, and then, the current state information h may be updatedtComment content yiAnd attention feature data havgAnd inputting the comment content into the updated semantic decoding model, and continuing to predict the comment content until the prediction is completed to obtain the comment corresponding to the target audio.

In an embodiment, since the decoder seq2seq performs semantic decoding through passing through the LSTM model for multiple times to obtain multiple comment contents, and obtains a final comment according to the multiple comment contents, the comment output method includes a cyclic semantic decoding step. Specifically, the step of performing semantic decoding on the decoded data and the attention feature data based on the semantic decoding model to obtain the comment corresponding to the target audio may include:

taking the decoded data, and the attention feature data as current inputs to the semantic decoding model;

based on the semantic decoding model, performing semantic decoding on the current input to obtain semantic decoded data;

updating decoded data based on the semantically decoded data;

when the decoding data does not meet the termination condition, returning to execute the step of fusing the audio melody characteristics based on the current state information to obtain attention characteristic data representing the attention degree of the audio melody characteristics;

and when the decoding data meet a termination condition, obtaining a comment corresponding to the target audio based on the decoding data.

In practical applications, for example, the current state information h in the decoded data can betComment content yiAnd attention feature data havgAs the current input of the semantic decoding model, then performing semantic decoding through the semantic decoding model to obtain data after semantic decoding, wherein the data after semantic decoding comprises decoded current state information ht+1And decoded comment content yi+1. The decoded current state information h may then be usedt+1As current state information htThe decoded comment content yi+1As comment content yiThat is, the semantically decoded data is used as the decoded data. At this time, when the decoded data does not satisfy the termination condition, it can continue to be according to the current state information htDetermining β a weight corresponding to each audio melody featureiAnd according to the obtained weight βiWeighting the characteristics of a plurality of audio melodies to obtain attention characteristic data havg. Then continuing to decode the current state information h in the datatComment content yiAnd attention feature data havgAs the current input of the semantic decoding model, obtaining the decoded current state through the semantic decoding modelInformation ht+1And decoded comment content yi+1. Until the decoded data meets the termination condition, at this time, comments corresponding to the target audio can be obtained by combining the obtained comment contents.

In an embodiment, the termination condition of the loop may be determined by controlling the number of acquired comment contents. Specifically, the step "when the decoded data satisfies the termination condition, obtaining the comment corresponding to the target audio based on the decoded data" may include:

determining the content information quantity of comment content in the decoded data;

and when the content information quantity reaches a preset information quantity, combining comment contents in the decoded data to obtain a comment corresponding to the target audio.

In practical application, when the number of the comment contents is detected to reach the preset information number, the decoded data can be considered to meet the termination condition, and the obtained comment contents are combined to obtain the comment corresponding to the target audio. For example, the preset information amount may be set to 20, and when the comment content y is acquired0、y1、y2…y20Thereafter, the number of the comment contents reaches 20 at this time, the decoded data can be considered to satisfy the termination condition at this time, and then the comment content y can be set0、y1、y2…y20And combining to obtain the comment corresponding to the target audio.

In an embodiment, the end condition of the loop may also be determined by detecting whether a preset end flag is included in the decoded data. Specifically, the step "when the decoded data satisfies the termination condition, obtaining the comment corresponding to the target audio based on the decoded data" may include:

detecting the decoded data to obtain a detection result;

and when the detection result determines that the decoded data comprises a preset ending identifier, combining comment contents in the decoded data to obtain a comment corresponding to the target audio.

In practical applications, for example, after the decoded data is obtained, the decoded data may be detected, and when it is detected that the decoded data includes the preset end identifier "end", at this time, the decoded data may be considered to satisfy the termination condition, and the obtained comment contents are combined to obtain a comment corresponding to the target audio.

In an embodiment, the termination condition is not limited to the two termination conditions specifically described above, in the embodiment of the present application, the termination condition is not limited too much, and the termination condition may be adjusted accordingly according to actual requirements, as long as the requirement that the comment meeting the requirement can be obtained when the loop is terminated is met.

In an embodiment, for example, as shown in fig. 7, a plurality of text sub-features in the text feature may be subjected to a weighting operation to obtain an initial text feature xavgThen the initial text feature xavgInputting the current state information h into an LSTM model to obtain current state information h0Comment content y0. At this time, the current state information h can be used0Obtaining the weight corresponding to each audio melody feature, and carrying out weighting operation on a plurality of audio melody features to obtain attention feature data havgAnd updates the parameters in the LSTM model. Then, the current state information h may be set0Comment content y0And attention feature data havgInputting the current state information h into an LSTM model to obtain current state information h1Comment content y1. At this time, if the comment content does not satisfy the termination condition, the attention feature data h is continuedavgUntil the output comment content meets the termination condition, at this time, the comment content y can be obtained according to the obtained comment content0、y1...ynAnd obtaining the comment corresponding to the target audio. And after semantic decoding is performed through the LSTM model each time, parameters in the LSTM model need to be updated, and then the semantic decoding is continued by using the updated LSTM model.

In an embodiment, the rectangles in fig. 7 may represent LSTM models, and although fig. 7 includes a plurality of rectangles, in an actual decoder, a plurality of LSTM models may not necessarily be included, but the same LSTM model may also be subjected to a plurality of parameter updates, and for convenience of description, the LSTM model subjected to a plurality of updates is represented by a plurality of rectangles in the figure.

206. And outputting the comment.

In practical applications, for example, after a comment corresponding to a target audio is acquired, the comment can be output so that a user can view the comment and know the content of the target audio according to the comment.

In one embodiment, the comment output method can be applied to various scenes, for example, in music playing software, the comment output method can be applied to automatically generate comments for a song, so that a user is guided to listen to the song and share the song. For another example, in a music recommendation scenario, the comment output method may be applied to automatically generate a comment for a song, and the comment is used as a reason for recommending the song, thereby attracting a user to listen to the song. For another example, in music social software, the comment output method can be applied to automatically generate comments for songs, so that the purposes of guiding community public opinion and activating community atmosphere are achieved.

In an embodiment, for example, as shown in fig. 9, after the terminal generates the comment corresponding to the target audio by the comment generating method, the comment may be displayed on a song comment page of the terminal. For another example, the song comment page may also be a sub-page on the terminal page, and the comment content corresponding to the target audio is displayed on the sub-page.

As can be seen from the above, in the embodiment of the application, text information and audio melody information corresponding to a target audio may be obtained, feature extraction is performed on the text information and the audio melody information respectively to obtain a text feature and a plurality of audio melody features, semantic decoding is performed on the text feature based on a semantic decoding model to obtain decoded data, the decoded data includes current state information representing a hidden state of the semantic decoding model, the plurality of audio melody features are fused based on the current state information to obtain attention feature data representing a degree of attention to the audio melody features, the decoded data and the attention feature data are subjected to semantic decoding based on the semantic decoding model to obtain a comment corresponding to the target audio, and the comment is output. According to the scheme, the text information and the audio melody information corresponding to the target audio frequency can be encoded through the encoder, the text information is decoded through the decoder, the audio melody information is added into the decoder through the attention mechanism, the text information and the audio melody information in the target audio frequency can be considered by the decoder, comments corresponding to the target audio frequency are automatically output, and therefore the matching degree between the comments and the target audio frequency is improved.

According to the method described in the foregoing embodiment, the following will explain in further detail by way of example that the comment output apparatus is specifically integrated in a network device.

Referring to fig. 3, a specific flow of the comment output method according to the embodiment of the present application may be as follows:

301. the network equipment acquires the text sub-information of a plurality of types and the audio melody information corresponding to the target song.

In practical applications, for example, a plurality of types of text sub-information corresponding to the target song, such as a song name, lyrics, a singer name, a lyrics author name, a song author name, and a song type name, may be acquired, and audio melody information representing the melody of the target song may be acquired.

In an embodiment, the timing sequence between the step of acquiring the text sub-information and the step of acquiring the audio melody information is not limited, for example, the step of acquiring the text sub-information may be before the step of acquiring the audio melody information, or the step of acquiring the audio melody information may be before the step of acquiring the text sub-information, or of course, the step of acquiring the text sub-information and the step of acquiring the audio melody information may be performed at the same time, and the like.

302. The network equipment extracts the text sub-features of the text sub-information of multiple kinds through the text encoder.

In practical application, for example, a plurality of types of text sub-information may be input into a text encoder, and a text sub-feature corresponding to each text sub-information is extracted by a text feature extraction method preset in the text encoder. The text feature extraction method may be a method for extracting text sub-features corresponding to text sub-information through LDA (Latent dirichletaltocalization), word2vec (word to vector), doc2vec, and other models.

303. The network device extracts a plurality of audio melody features of the audio melody information through the audio encoder.

In practical applications, for example, the audio melody information may be input into an audio encoder, and a plurality of audio sub-features corresponding to each audio melody feature extraction method may be extracted by a plurality of audio melody feature extraction methods preset in the audio encoder. The audio melody feature extraction method may be various, and for example, the audio melody feature extraction method may be FFT (fast Fourier transform), STFT (short-time Fourier transform), MFCC (Mel-scale Frequency Cepstral Coefficients), or the like.

In an embodiment, the timing sequence between the step of obtaining the text sub-feature and the step of obtaining the audio melody feature is not limited, for example, the step of obtaining the text sub-feature may be before the step of obtaining the audio melody feature, or the step of obtaining the audio melody feature may be before the step of obtaining the text sub-feature, or of course, the step of obtaining the text sub-feature and the step of obtaining the audio melody feature may be performed at the same time, and the like.

304. The network device fuses the plurality of text sub-features into an initial text feature.

In practical applications, for example, x may be passed1iRepresenting text sub-features and assigning a weight α to each text sub-featureiThrough xavgRepresenting the initial text feature, and fusing a plurality of text sub-features into the initial text feature by a weighted average method. Wherein the initial text feature xavgThe calculation formula of (c) may be as follows:

xavg=∑(αi·x1i)

wherein, Σ αi=1。

305. The network equipment inputs the initial text characteristics into a semantic decoding model for semantic decoding to obtain current state information hiAnd comment content yi

In practical application, for example, the initial text features may be input into the LSTM model, and the current state information h representing the current hidden state of the LSTM model is obtained through the operation of the LSTM modeltAnd comment content yi. The parameters in the LSTM model are then updated.

306. The network equipment is according to the current state information hiDetermining β a weight corresponding to each audio melody featurei

In practical applications, for example, the currently calculated current state information h may be usedtAnd a plurality of audio melody features x2iAre respectively input into the weight calculation function fattentionIn this way, the weights β corresponding to the features of each audio melody are calculated respectivelyiWherein the weight βiThe calculation formula of (c) may be as follows:

βi=fattention(ht,x2i)

307. the network device is based on the weight βiFusing the characteristics of a plurality of audio melodies to obtain attention characteristic data havg

In practical applications, for example, the weights β may be calculated according to the calculated weightsiWeighting and calculating the multiple audio melody characteristics to obtain attention characteristic data h for representing the attention degree of the audio melody characteristicsavgWherein the attention feature data havgThe calculation formula of (c) may be as follows:

havg=∑(βi·x2i)

308. the network equipment transmits the current state information hiComment content yiAttention feature data havgInputting the current state information into a semantic decoding model for semantic decoding to obtain current state information hi+1And comment content yi+1

In the practical application of the method, the material is,for example, the current state information h may be usedtComment content yiAnd attention feature data havgAs the current input of the semantic decoding model, then the semantic decoding is carried out through the semantic decoding model to obtain the current state information ht+1And comment content yi+1. The parameters in the LSTM model are then updated.

309. When the comment content does not meet the termination condition, the network equipment returns to execute the comment according to the current state information hiDetermining β a weight corresponding to each audio melody featureiThe step (2).

In practical applications, for example, the preset information quantity may be preset to 20, and when the quantity of the obtained comment content does not reach 20, the comment content at this time may be considered to not satisfy the termination condition, that is, the obtaining step of the comment content may be continued. At this point, execution may return to execute according to current state information hiDetermining β a weight corresponding to each audio melody featureiThen β according to the weightiFusing the characteristics of a plurality of audio melodies to obtain attention characteristic data havg. The current state information hiComment content yiAttention feature data havgInputting the current state information into a semantic decoding model for semantic decoding to obtain current state information hi+1Comment content yi+1At this time, it is determined whether the comment content satisfies the termination condition.

In practical application, for example, after obtaining the comment content, the comment content may be detected, and when it is detected that the comment content does not include the preset end identifier "end", the comment content at this time may be considered to not satisfy the termination condition, that is, the obtaining step of the comment content may be continued. At this point, execution may return to execute according to current state information hiDetermining β a weight corresponding to each audio melody featureiThen β according to the weightiFusing the characteristics of a plurality of audio melodies to obtain attention characteristic data havgCurrent state information hiComment content yiAttention feature data havgInputting the current state information into a semantic decoding model for semantic decoding to obtain current state information hi+1Comment content yi+1At this time, it is determined whether the comment content satisfies the termination condition.

310. When the comment content meets the termination condition, the network equipment determines the song comment corresponding to the target song based on the plurality of comment contents.

In practical applications, for example, as shown in fig. 4, the preset information amount may be preset to 20, and when the number of the obtained comment contents reaches 20, the comment contents at this time may be considered to satisfy the termination condition, that is, the obtaining step of the comment contents does not need to be continued. At this time, the obtained multiple comment contents may be combined, so as to determine the song comment corresponding to the target song.

In practical application, for example, after obtaining the comment content, the comment content may be detected, and when detecting that the comment content includes the preset end identifier "end", the comment content at this time may be considered to satisfy the termination condition, that is, the obtaining step of the comment content does not need to be continued. At this time, the obtained multiple comment contents may be combined, so as to determine the song comment corresponding to the target song.

As can be seen from the above, in the embodiment of the present application, text information and audio melody information corresponding to a target song may be obtained, feature extraction may be performed on the text information and the audio melody information, respectively, to obtain a text feature, and a plurality of audio melody features, a semantic decoding may be performed on the text feature based on a semantic decoding model to obtain decoded data, the decoded data may include current state information representing a hidden state of the semantic decoding model, the plurality of audio melody features may be fused based on the current state information to obtain attention feature data representing a degree of attention to the audio melody features, the decoded data and the attention feature data may be subjected to semantic decoding based on the semantic decoding model to obtain a comment corresponding to the target song, and a song comment may be output. According to the scheme, the text information and the audio melody information corresponding to the target song can be encoded through the encoder, the text information is decoded through the decoder, the audio melody information is added into the decoder through the attention mechanism, the decoder can give consideration to the text information and the audio melody information in the target song, the song comment corresponding to the target song is automatically generated, and therefore the matching degree between the song comment and the target song is improved.

In order to better implement the above method, an embodiment of the present application may further provide a comment output apparatus, where the comment output apparatus may be specifically integrated in a network device, and the network device may include a server, a terminal, and the like, where the terminal may include: a mobile phone, a tablet Computer, a notebook Computer, or a Personal Computer (PC).

For example, as shown in fig. 10, the comment output apparatus may include an acquisition module 101, a feature extraction module 102, a first decoding module 103, a fusion module 104, a second decoding module 105, and an output module 106, as follows:

the acquiring module 101 is configured to acquire text information and audio melody information corresponding to a target audio;

the feature extraction module 102 is configured to perform feature extraction on the text information and the audio melody information respectively to obtain a text feature, and a plurality of audio melody features;

the first decoding module 103 is configured to perform semantic decoding on the text features based on a semantic decoding model to obtain decoded data, where the decoded data includes current state information representing a hidden state of the semantic decoding model;

a fusion module 104, configured to fuse the multiple audio melody characteristics based on the current state information to obtain attention characteristic data representing the attention degree of the audio melody characteristics;

a second decoding module 105, configured to perform semantic decoding on the decoded data and the attention feature data based on the semantic decoding model to obtain a comment corresponding to the target audio;

an output module 106, configured to output the comment.

In an embodiment, the feature extraction module 102 may include a first extraction submodule 1021 and a second extraction submodule 1022, as follows:

a first extraction submodule 1021, configured to perform feature extraction on the text information to obtain a text feature;

the second extraction sub-module 1022 is configured to perform feature extraction on the audio melody information based on multiple audio melody feature extraction methods, so as to obtain audio melody features corresponding to each audio melody feature extraction method.

In an embodiment, the first extraction sub-module 1021 may be specifically configured to:

determining a plurality of types of text sub-information from the text information;

and respectively extracting the characteristics of the text sub-information of the multiple types to obtain the text sub-characteristics corresponding to the text sub-information of each type.

In an embodiment, the first decoding module 103 may be specifically configured to:

fusing a plurality of text sub-features in the text features to obtain initial text features;

and performing semantic decoding on the initial text features based on a semantic decoding model to obtain decoded data, wherein the decoded data comprises current state information representing the hidden state of the semantic decoding model.

In an embodiment, the fusion module 104 may be specifically configured to:

acquiring the weight corresponding to each audio melody feature based on the current state information;

and performing weighting operation on the plurality of audio melody characteristics based on the weight to obtain attention characteristic data representing the attention degree of the audio melody characteristics.

In an embodiment, the second decoding module 105 may include a determination sub-module 1051, a decoding sub-module 1052, an update sub-module 1053, a return sub-module 1054, and an acquisition sub-module 1055, as follows:

a determination submodule 1051 for taking said decoded data and said attention characteristic data as current inputs of said semantic decoding model;

a decoding submodule 1052, configured to perform semantic decoding on the current input based on the semantic decoding model, so as to obtain semantic decoded data;

an update sub-module 1053 for updating the decoded data based on the semantically decoded data;

a returning submodule 1054, configured to, when the decoded data does not meet the termination condition, return to perform a step of fusing the multiple audio melody features based on the current state information to obtain attention feature data representing the attention degree of the audio melody features;

the obtaining sub-module 1055 is configured to, when the decoded data meets a termination condition, obtain a comment corresponding to the target audio based on the decoded data.

In an embodiment, the obtaining sub-module 1055 may specifically be configured to:

determining the content information quantity of comment content in the decoded data;

and when the content information quantity reaches a preset information quantity, combining comment contents in the decoded data to obtain a comment corresponding to the target audio.

In an embodiment, the obtaining sub-module 1055 may specifically be configured to:

detecting the decoded data to obtain a detection result;

and when the detection result determines that the decoded data comprises a preset ending identifier, combining comment contents in the decoded data to obtain a comment corresponding to the target audio.

In an embodiment, the obtaining module 101 may be specifically configured to:

determining a target song to be output with comments based on song selection operation of a user on a song selection page on a terminal;

and acquiring text information and audio melody information corresponding to the target song.

27页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种个性化视频推荐方法及系统

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!