Audio processing method and device in video

文档序号:306962 发布日期:2021-11-26 浏览:26次 中文

阅读说明:本技术 一种视频中的音频处理方法及装置 (Audio processing method and device in video ) 是由 李斌超 于 2021-09-02 设计创作,主要内容包括:本发明实施例提供了一种视频中的音频处理方法及装置,在播放目标视频的过程中,当接收到音频屏蔽指令时,对包含目标视频中的多个目标人物的声音的目标音频进行分离处理,得到多个目标人物的各待处理音频;基于每一待处理音频的声纹特征,从预先基于多个目标人物的样本音频生成的,每一目标人物对应的待匹配声纹模型中,确定与该待处理音频相匹配的目标声纹模型;确定该待处理音频在目标视频中所属的目标人物为训练目标声纹模型所采用的样本音频所属的目标人物;在播放目标视频的过程中,对应播放屏蔽用户指示的待屏蔽人物对应的待处理音频后的其他待处理音频。基于上述处理,可以屏蔽用户指示的特定人物的声音,满足用户的个性化需求。(The embodiment of the invention provides an audio processing method and device in a video, wherein in the process of playing a target video, when an audio shielding instruction is received, the target audio containing the sound of a plurality of target characters in the target video is separated to obtain each audio to be processed of the plurality of target characters; determining a target voiceprint model matched with the audio to be processed from a voiceprint model to be matched, which is generated in advance based on the voiceprint characteristics of the audio to be processed and corresponds to each target character, from the sample audio of the plurality of target characters; determining a target character to which the audio to be processed belongs in the target video as a target character to which a sample audio adopted for training a target voiceprint model belongs; and correspondingly playing other to-be-processed audios after the to-be-processed audio corresponding to the to-be-shielded character indicated by the shielding user in the process of playing the target video. Based on the processing, the voice of the specific character indicated by the user can be shielded, and the personalized requirements of the user can be met.)

1. An audio processing method in video, which is applied to a client and comprises the following steps:

in the process of playing a target video, when an audio shielding instruction for the target video is received, separating the target audio containing the sounds of a plurality of target characters in the target video to obtain each audio to be processed contained in the target audio; wherein one audio to be processed represents the sound emitted by the same target character in the target video;

for each audio to be processed, determining a voiceprint model matched with the audio to be processed as a target voiceprint model from voiceprint models to be matched, which are generated in advance based on the voiceprint characteristics of the audio to be processed and are corresponding to each target figure;

determining a target character to which the audio to be processed belongs in the target video, wherein the target character belongs to a sample audio adopted for training the target voiceprint model;

and correspondingly playing other to-be-processed audios after the to-be-processed audio corresponding to the to-be-shielded character indicated by the shielding user in the process of playing the target video.

2. The method of claim 1, wherein the voiceprint characteristics of a piece of audio to be processed comprise spectral characteristics of each audio frame in the piece of audio to be processed.

3. The method according to claim 1, wherein the determining, for each audio to be processed, a voiceprint model matching the audio to be processed from a voiceprint model to be matched, which is generated in advance based on the sample audio of the multiple target characters and corresponds to each target character, as the target voiceprint model, comprises:

for each audio to be processed, respectively calculating the similarity between the voiceprint features of the audio to be processed and a voiceprint model to be matched, which is generated in advance based on the sample audio of each target figure and corresponds to the target figure;

and determining the voiceprint model with the maximum similarity with the audio to be processed from the voiceprint models to be matched, and obtaining the voiceprint model matched with the audio to be processed as a target voiceprint model.

4. The method of claim 3, wherein the calculating, for each audio to be processed, a similarity between the voiceprint feature of the audio to be processed and the voiceprint model to be matched, which is generated in advance based on the sample audio of each target person, comprises:

and respectively calculating the log likelihood probability of the voiceprint feature of the audio to be processed, which is generated aiming at the sample audio based on each target character in advance and is generated by aiming at each audio to be processed, of the voiceprint model to be matched corresponding to the target character, and taking the log likelihood probability as the similarity of the voiceprint feature of the audio to be processed and the voiceprint model to be matched.

5. The method of claim 1, wherein the training step of the voiceprint model to be matched corresponding to each target person comprises:

acquiring voiceprint characteristics of preset sample audio;

training a Gaussian mixture model of an initial structure based on an expectation maximization algorithm and the voiceprint characteristics of the preset sample audio to obtain an alternative network model;

and aiming at each target character, based on the self-adaptive algorithm and the voiceprint characteristics of the sample audio of the target character, adjusting the model parameters of the alternative network model to obtain the voiceprint model to be matched corresponding to the target character.

6. The method of claim 1, wherein before playing other to-be-processed audio after the to-be-processed audio corresponding to the to-be-masked character indicated by the masking user in the process of playing the target video, the method further comprises:

displaying the respective character identifications of a plurality of target characters in the target video in a display interface of the client;

when a person selection instruction input by a user is received, determining that the person indicated by the person selection instruction identifies a target person belonging to the target video, and using the target person as a person to be shielded indicated by the user.

7. An audio processing apparatus in video, the apparatus being applied to a client, the apparatus comprising:

the separation module is used for separating target audio containing sounds of a plurality of target characters of a target video when an audio shielding instruction aiming at the target video is received in the process of playing the target video to obtain each audio to be processed contained in the target audio; wherein one audio to be processed represents the sound emitted by the same target character in the target video;

the first determining module is used for determining a voiceprint model matched with the audio to be processed as a target voiceprint model from voiceprint models to be matched, which are generated in advance based on the sample audios of the multiple target characters and correspond to each target character, aiming at each audio to be processed based on the voiceprint characteristics of the audio to be processed;

the second determining module is used for determining a target character to which the audio to be processed belongs in the target video and is a target character to which a sample audio adopted for training the target voiceprint model belongs;

and the playing module is used for correspondingly playing other to-be-processed audios after the to-be-processed audio corresponding to the to-be-shielded character indicated by the shielding user in the process of playing the target video.

8. The apparatus of claim 7, wherein the voiceprint characteristics of a to-be-processed audio comprise spectral characteristics of audio frames in the to-be-processed audio.

9. The client is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing the communication between the processor and the memory through the communication bus;

a memory for storing a computer program;

a processor for implementing the method steps of any of claims 1-6 when executing a program stored in the memory.

10. A computer-readable storage medium, characterized in that a computer program is stored in the computer-readable storage medium, which computer program, when being executed by a processor, carries out the method steps of any one of claims 1 to 6.

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a method and an apparatus for processing audio in a video.

Background

With the development of computer technology, more and more functions are provided to users by clients, for example, users can watch videos through the clients. When the client plays the target video, the client may play the target audio corresponding to the target video synchronously, for example, when the target video is a tv show, the client may play the audio of the character dialog in the tv show synchronously.

However, if the user dislikes the sound of a certain character in the target video, the user can only reduce the volume of the target audio as a whole, i.e., in the related art, it is impossible to mask the sound of the specific character.

Disclosure of Invention

The embodiment of the invention aims to provide a method and a device for processing audio in a video, so as to shield the sound of a specific character indicated by a user and meet the personalized requirements of the user. The specific technical scheme is as follows:

in a first aspect of the present invention, there is provided a method for processing audio in a video, where the method is applied to a client, and the method includes:

in the process of playing a target video, when an audio shielding instruction for the target video is received, separating the target audio containing the sounds of a plurality of target characters in the target video to obtain each audio to be processed contained in the target audio; wherein one audio to be processed represents the sound emitted by the same target character in the target video;

for each audio to be processed, determining a voiceprint model matched with the audio to be processed as a target voiceprint model from voiceprint models to be matched, which are generated in advance based on the voiceprint characteristics of the audio to be processed and are corresponding to each target figure;

determining a target character to which the audio to be processed belongs in the target video, wherein the target character belongs to a sample audio adopted for training the target voiceprint model;

and correspondingly playing other to-be-processed audios after the to-be-processed audio corresponding to the to-be-shielded character indicated by the shielding user in the process of playing the target video.

Optionally, the voiceprint feature of one audio to be processed includes a spectral feature of each audio frame in the audio to be processed.

Optionally, the determining, for each audio to be processed, a voiceprint model matched with the audio to be processed from the to-be-matched voiceprint models corresponding to each target person, which are generated in advance based on the voiceprint features of the audio to be processed, and are used as the target voiceprint model includes:

for each audio to be processed, respectively calculating the similarity between the voiceprint features of the audio to be processed and a voiceprint model to be matched, which is generated in advance based on the sample audio of each target figure and corresponds to the target figure;

and determining the voiceprint model with the maximum similarity with the audio to be processed from the voiceprint models to be matched, and obtaining the voiceprint model matched with the audio to be processed as a target voiceprint model.

Optionally, the calculating, for each audio to be processed, a similarity between the voiceprint feature of the audio to be processed and a voiceprint model to be matched, which is generated in advance based on a sample audio of each target person, and corresponds to the target person includes:

and respectively calculating the log likelihood probability of the voiceprint feature of the audio to be processed, which is generated aiming at the sample audio based on each target character in advance and is generated by aiming at each audio to be processed, of the voiceprint model to be matched corresponding to the target character, and taking the log likelihood probability as the similarity of the voiceprint feature of the audio to be processed and the voiceprint model to be matched.

Optionally, the training step of the voiceprint model to be matched corresponding to each target person includes:

acquiring voiceprint characteristics of preset sample audio;

training a Gaussian mixture model of an initial structure based on an expectation maximization algorithm and the voiceprint characteristics of the preset sample audio to obtain an alternative network model;

and aiming at each target character, based on the self-adaptive algorithm and the voiceprint characteristics of the sample audio of the target character, adjusting the model parameters of the alternative network model to obtain the voiceprint model to be matched corresponding to the target character.

Optionally, in the process of playing the target video, before other to-be-processed audio corresponding to the to-be-masked character indicated by the masking user is correspondingly played, the method further includes:

displaying the respective character identifications of a plurality of target characters in the target video in a display interface of the client;

when a person selection instruction input by a user is received, determining that the person indicated by the person selection instruction identifies a target person belonging to the target video, and using the target person as a person to be shielded indicated by the user.

In a second aspect of the present invention, there is also provided an apparatus for processing audio in video, the apparatus being applied to a client, the apparatus including:

the separation module is used for separating target audio containing sounds of a plurality of target characters of a target video when an audio shielding instruction aiming at the target video is received in the process of playing the target video to obtain each audio to be processed contained in the target audio; wherein one audio to be processed represents the sound emitted by the same target character in the target video;

the first determining module is used for determining a voiceprint model matched with the audio to be processed as a target voiceprint model from voiceprint models to be matched, which are generated in advance based on the sample audios of the multiple target characters and correspond to each target character, aiming at each audio to be processed based on the voiceprint characteristics of the audio to be processed;

the second determining module is used for determining a target character to which the audio to be processed belongs in the target video and is a target character to which a sample audio adopted for training the target voiceprint model belongs;

and the playing module is used for correspondingly playing other to-be-processed audios after the to-be-processed audio corresponding to the to-be-shielded character indicated by the shielding user in the process of playing the target video.

Optionally, the voiceprint feature of one audio to be processed includes a spectral feature of each audio frame in the audio to be processed.

Optionally, the first determining module is specifically configured to, for each audio to be processed, respectively calculate similarity between a voiceprint feature of the audio to be processed and a voiceprint model to be matched, which is generated in advance based on a sample audio of each target person, and corresponds to the target person;

and determining the voiceprint model with the maximum similarity with the audio to be processed from the voiceprint models to be matched, and obtaining the voiceprint model matched with the audio to be processed as a target voiceprint model.

Optionally, the first determining module is specifically configured to, for each audio to be processed, respectively calculate a log likelihood probability of a voiceprint feature of the audio to be processed, generated for a sample audio based on each target person in advance, of a voiceprint model to be matched corresponding to the target person, as a similarity between the voiceprint feature of the audio to be processed and the voiceprint model to be matched.

Optionally, the apparatus further comprises:

the training module is used for acquiring the voiceprint characteristics of the preset sample audio;

training a Gaussian mixture model of an initial structure based on an expectation maximization algorithm and the voiceprint characteristics of the preset sample audio to obtain an alternative network model;

and aiming at each target character, based on the self-adaptive algorithm and the voiceprint characteristics of the sample audio of the target character, adjusting the model parameters of the alternative network model to obtain the voiceprint model to be matched corresponding to the target character.

Optionally, the apparatus further comprises:

the processing module is used for displaying the respective character identifications of a plurality of target characters in the target video in a display interface of the client before the playing module correspondingly plays other to-be-processed audio after the to-be-processed audio corresponding to the to-be-shielded character indicated by the shielding user in the process of playing the target video;

when a person selection instruction input by a user is received, determining that the person indicated by the person selection instruction identifies a target person belonging to the target video, and using the target person as a person to be shielded indicated by the user.

In another aspect of the implementation of the present invention, there is also provided a client, including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus;

a memory for storing a computer program;

and the processor is used for realizing the steps of the audio processing method in the video when executing the program stored in the memory.

In yet another aspect of the present invention, there is also provided a computer-readable storage medium, in which a computer program is stored, and the computer program, when executed by a processor, implements any of the audio processing methods in video described above.

In yet another aspect of the present invention, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform any of the above described methods of audio processing in video.

In the method for processing audio in video provided by the embodiment of the invention, in the process of playing the target video, when an audio shielding instruction for the target video is received, the target audio containing the sounds of a plurality of target characters in the target video is separated to obtain each audio to be processed contained in the target audio; one audio to be processed represents the sound made by the same target character in the target video; for each audio to be processed, determining a voiceprint model matched with the audio to be processed as a target voiceprint model from voiceprint models to be matched, which are generated in advance based on sample audios of a plurality of target characters and correspond to each target character based on voiceprint characteristics of the audio to be processed; determining a target character to which the audio to be processed belongs in the target video, wherein the target character belongs to a sample audio adopted for training a target voiceprint model; and correspondingly playing other to-be-processed audios after the to-be-processed audio corresponding to the to-be-shielded character indicated by the shielding user in the process of playing the target video.

Based on the processing, other to-be-processed audio after the to-be-processed audio corresponding to the to-be-shielded character indicated by the user is shielded can be played according to the indication of the user, namely, the sound of the specific character indicated by the user can be shielded, and the personalized requirement of the user is met.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below.

Fig. 1 is a flowchart of an audio processing method in video according to an embodiment of the present invention;

FIG. 2 is a flow chart of a method for training a voiceprint model provided in an embodiment of the invention;

fig. 3 is a flow chart of another method for audio processing in video provided in an embodiment of the invention;

fig. 4 is a flow chart of another method for audio processing in video provided in an embodiment of the invention;

fig. 5 is a block diagram of an audio processing apparatus in video according to an embodiment of the present invention;

fig. 6 is a structural diagram of a client provided in an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be described below with reference to the drawings in the embodiments of the present invention.

In the prior art, if a user does not like the sound of a certain person in a target video, the user can only reduce the volume of the whole target audio, that is, in the prior art, the sound of the specific person cannot be shielded.

In order to solve the above problem, referring to fig. 1, fig. 1 is a flowchart of an audio processing method in a video, which may be applied to a client, according to an embodiment of the present invention, where the method may include the following steps:

s101: in the process of playing the target video, when an audio shielding instruction for the target video is received, the target audio containing the sounds of a plurality of target characters in the target video is separated, and each audio to be processed contained in the target audio is obtained.

Wherein one audio to be processed represents the sound of the same target character in the target video.

S102: and for each audio to be processed, determining a voiceprint model matched with the audio to be processed as a target voiceprint model from voiceprint models to be matched, which are generated in advance based on the voiceprint characteristics of the audio to be processed and are corresponding to each target character, in the sample audio.

S103: and determining a target character to which the audio to be processed belongs in the target video, wherein the target character belongs to a sample audio adopted for training the target voiceprint model.

S104: and correspondingly playing other to-be-processed audios after the to-be-processed audio corresponding to the to-be-shielded character indicated by the shielding user in the process of playing the target video.

Based on the audio processing method in the video provided by the embodiment of the invention, other audio to be processed after the audio to be processed corresponding to the character to be shielded and indicated by the user is shielded can be played according to the indication of the user, namely, the sound of a specific character indicated by the user can be shielded, the personalized requirement of the user is met, and the user experience is improved.

For step S101, during the process of playing the target video, the client may synchronously play the audio file corresponding to the target video, and if the user does not like the sound of a certain target person in the target video, the user may input an audio masking instruction for the target video to the client to mask the sound of the target person.

Correspondingly, when the client receives the audio shielding instruction, the target audio can be determined from the audio file corresponding to the target video. For example, the client may determine that an audio file corresponding to the target video is the target audio; or, the client may determine that the unplayed part of the audio file corresponding to the target video is the target audio.

Further, the client may perform a separation process on the target audio to separate an audio portion (i.e., the audio to be processed in the embodiment of the present invention) of each target person in the target video from the target audio.

In one implementation, the client may determine the number of target persons (which may be referred to as a target number) in the target video based on an MDL (Minimum Description Length) algorithm. Then, the client can perform separation processing on the target audio based on a Fast Independent Analysis (Fast Independent Analysis) algorithm and the target number to obtain the target number of audio to be processed.

It is understood that the separated to-be-processed audios are audios of target characters in the target video, and each to-be-processed audio represents a sound emitted by the same target character in the target video.

It is understood that, currently, only the to-be-processed audio of a plurality of target persons is separated from the target audio, but the corresponding relationship between each target person and each to-be-processed audio is not determined, that is, which to-be-processed audio is the audio of each target person is not determined. For example, the 3 target characters corresponding to the target audio are: person a, person B, and person C. Separating the target audio to obtain: audio to be processed 1, audio to be processed 2 and audio to be processed 3. The 3 pieces of audio to be processed are the audio of the 3 people, but the corresponding relationship between the 3 people and the 3 pieces of audio to be processed is not determined, that is, it is not determined whether the audio of the person a is the audio 1 to be processed, the audio 2 to be processed, or the audio 3 to be processed, and similarly, it is not determined which audio of the person B and the person C is the audio to be processed.

For step S102, the voiceprint feature of one audio to be processed may include a spectral feature of each audio frame in the audio to be processed. The spectral feature of an audio frame may be, but is not limited to, MFCC (Mel-Frequency Cepstral Coefficients) of the audio frame, or LPCC (Linear Predictive cepstrum coefficient) of the audio frame, or PLP (Perceptual Linear Prediction) of the audio frame.

In one implementation, when the voiceprint feature of each to-be-processed audio includes a mel-frequency cepstrum coefficient of each audio frame in the to-be-processed audio, the client may calculate the mel-frequency cepstrum coefficient of each audio frame in the to-be-processed audio by the following method.

The client can perform pre-emphasis processing on the audio to be processed so as to increase the frequency of high-frequency voice in the audio to be processed, and obtain the pre-emphasized audio to be processed. The client may perform framing processing on the pre-emphasized to-be-processed audio based on a preset window function (e.g., a rectangular window function, a hanning window function, etc.), so as to obtain each audio frame in the to-be-processed audio.

Then, for each audio frame, the client may perform FFT (Fast Fourier Transform) processing on the audio frame to obtain a frequency domain signal corresponding to the audio frame. Further, the power spectrum of the audio frame is calculated based on the frequency domain signal corresponding to the audio frame, and the power spectrum of the audio frame is filtered based on a Mel (Mel) frequency filter, so that a Mel frequency spectrum of the audio frame is obtained.

Furthermore, the client may perform logarithm processing on the Mel spectrum corresponding to the audio frame, and perform DCT (Discrete Cosine Transform) processing on the logarithm of the Mel spectrum of the audio frame to obtain the Mel-frequency cepstrum coefficient of the audio frame.

In one implementation, the client may locally pre-store a plurality of voiceprint models to be matched, where each voiceprint model to be matched is generated in advance based on sample audio of a plurality of target characters in the target video. One target character corresponds to one voiceprint model to be matched, and the corresponding voiceprint model to be matched is obtained by training based on the sample audio of the target character.

In an embodiment of the present invention, referring to fig. 2, fig. 2 is a flowchart of a method for training a voiceprint model provided in an embodiment of the present invention, where the method may include the following steps:

s201: and acquiring the voiceprint characteristics of the preset sample audio.

S202: and training the Gaussian mixture model of the initial structure based on the expectation maximization algorithm and the voiceprint characteristics of the preset sample audio to obtain an alternative network model.

S203: and aiming at each target character, based on the self-adaptive algorithm and the voiceprint characteristics of the sample audio of the target character, adjusting the model parameters of the alternative network model to obtain the voiceprint model to be matched corresponding to the target character.

The adaptive algorithm may be a MAP (Maximum a Posteriori) algorithm, or the adaptive algorithm may be a MLLR (Maximum linear likelihood regression) algorithm, but is not limited thereto. The gaussian mixture Model of the initial structure may be UBM (Universal Background Model).

In one implementation, the client may obtain a plurality of preset sample audios, and extract a voiceprint feature of each preset sample audio. The client may adjust model parameters (e.g., a weight parameter, a mean parameter, and a variance parameter) of the gaussian mixture model of the initial structure based on an EM (Expectation-Maximization Algorithm) and the voiceprint characteristics of each preset sample audio until the calculated log likelihood probability of the voiceprint characteristics of each preset sample audio for the gaussian mixture model of the initial structure is a maximum value based on the adjusted model parameters, which indicates that the gaussian mixture model of the initial structure reaches a convergence state, thereby obtaining an alternative network model.

Then, for each target person, the client may obtain a sample audio of the target person and extract a voiceprint feature of the sample audio of the target person. Then, the client may adjust Model parameters (i.e., a weight parameter, a mean parameter, and a variance parameter) of the candidate network Model based on the adaptive algorithm and the voiceprint feature of the sample audio of the target character until the calculated loglikelihood probability of the voiceprint feature of each sample audio of the target character for the candidate network Model is a maximum value based on the adjusted Model parameters, which indicates that the candidate network Model reaches a convergence state, and obtains a trained GMM (Gaussian Mixture Model) corresponding to the target character as the voiceprint Model to be matched corresponding to the target character.

Then, for each audio to be processed, the client may determine, from the voiceprint models to be matched, a voiceprint model (i.e., a target voiceprint model) that matches the audio to be processed based on the voiceprint features of the audio to be processed, that is, determine, from the voiceprint models to be matched, a voiceprint model to be matched that is obtained based on a sample audio training of a person to which the audio to be processed belongs.

In an embodiment of the present invention, on the basis of fig. 1, referring to fig. 3, step S102 may include the following steps:

s1021: and respectively calculating the similarity between the voiceprint characteristics of the audio to be processed and the voiceprint model to be matched, which is generated in advance based on the sample audio of each target character, for each audio to be processed.

S1022: and determining the voiceprint model with the maximum similarity with the audio to be processed from the voiceprint models to be matched, and obtaining the voiceprint model matched with the audio to be processed as a target voiceprint model.

In one implementation, for each audio to be processed, the client determines a feature matrix containing voiceprint features of the audio to be processed. For each voiceprint model to be matched, the client can determine a feature matrix of the voiceprint model to be matched. Furthermore, the client may calculate a similarity between the feature matrix corresponding to the audio to be processed and the feature matrix of the voiceprint model to be matched, as a similarity between the audio to be processed and the voiceprint model to be matched.

In another implementation, step S1021 may include the following steps: and respectively calculating the log likelihood probability of the voiceprint feature of the audio to be processed, which is generated aiming at the sample audio based on each target character in advance and is generated by aiming at each audio to be processed, of the voiceprint model to be matched corresponding to the target character, and taking the log likelihood probability as the similarity of the voiceprint feature of the audio to be processed and the voiceprint model to be matched.

The greater the similarity between one audio to be processed and one voiceprint model to be matched is, the higher the probability that the voiceprint model to be matched is obtained by training the sample audio based on the figure to which the audio to be processed belongs is.

Therefore, for each audio to be processed, the client may determine the voiceprint model with the maximum similarity to the audio to be processed from the voiceprint models to be matched, so as to obtain the target voiceprint model matched with the audio to be processed.

For each audio to be processed, the client may determine, for step S103 and step S104, a target character to which the audio to be processed belongs in the target video, and is a target character to which the sample audio adopted for training the target voiceprint model belongs. Furthermore, in the process of playing the target video, the client may play other to-be-processed audio after shielding the to-be-processed audio corresponding to the to-be-shielded character indicated by the user, that is, the client may play other to-be-processed audio except the to-be-processed audio corresponding to the to-be-shielded character indicated by the user.

In an implementation manner, the audio shielding instruction may carry a character identifier of a character to be shielded, and when the client receives the audio shielding instruction, the client may determine the character to be shielded from a plurality of target characters in the target video. Furthermore, the client can determine the audio to be processed corresponding to the character to be shielded, and when the client plays the target video, the client can shield the sound of the character to be shielded, that is, the client can play other audio to be processed except the audio to be processed corresponding to the character to be shielded.

In another implementation, on the basis of fig. 1, referring to fig. 4, before step S104, the method may further include the following steps:

s105: and displaying the individual character identification of the plurality of target characters in the target video in a display interface of the client.

S106: when a person selection instruction input by a user is received, determining that the person indicated by the person selection instruction identifies a target person belonging to the target video as a person to be shielded indicated by the user.

The person identifier of a person may be the name of the person or may be an image of the person, but is not limited thereto.

After determining the target person to which each audio to be processed belongs in the target video, the person identifiers of the multiple target persons in the target video can be displayed in the display interface of the client. The user can select the character identification of the target character needing to shield the sound from the plurality of character identifications displayed by the client so as to input a character selection instruction to the client.

Correspondingly, when the client receives the person selection instruction, the client can determine that the target person to which the person identification indicated by the person selection instruction belongs is the person to be shielded. Furthermore, when the client plays the target video, the sound of the character to be shielded can be shielded, that is, the client can play other audio to be processed except the audio to be processed corresponding to the character to be shielded.

In an embodiment of the present invention, the target audio may be human voice audio separated from the original audio, and the original audio further includes accompaniment audio. When the character identification is displayed, the client can also display the identification of the accompaniment audio. Further, the user may also choose to mask accompaniment sounds. Correspondingly, the client can directly play the target audio, namely the client can mask the accompaniment sound. In addition, the user may also choose to mask the sound of all target characters. Accordingly, the client can directly play the accompaniment audio separated from the original audio.

In one embodiment of the present invention, the target video may be a commentary video, for example, a movie drama commentary video, a science commentary video, or the like. When the client plays the commentary video, the client can simultaneously play the audio file corresponding to the commentary video. The audio file corresponding to the commentary video comprises commentary audio and accompaniment audio. The commentary audio is audio of a commentary character in the commentary video. The commentator person in the commentary video may be one commentator person, or the commentator person in the commentary video may be plural.

During the process of watching the commentary video, if the user does not like the sound of the commentary character in the commentary video, the user can input an audio shielding instruction aiming at the commentary video to the client. Correspondingly, when the client receives the audio shielding instruction, the audio file corresponding to the commentary video can be separated to obtain the commentary audio and the accompaniment audio.

When the commentary video contains a commentary character, the client can directly shield the commentary audio, namely the client can only play the accompaniment audio in the process of playing the commentary video.

When the commentary video comprises a plurality of commentary characters, the client can separate the commentary audio to obtain a plurality of to-be-processed audios, and one to-be-processed audio is the audio of the same commentary character in the commentary video. For each audio to be processed, the client may determine, based on the voiceprint features of the audio to be processed, a voiceprint model matched with the audio to be processed as a target voiceprint model from voiceprint models to be matched, which are generated in advance based on sample audios of a plurality of explanatory characters and correspond to each explanatory character.

Then, the client may determine the commentary figure to which the audio to be processed belongs in the commentary video, and display the respective character identifications of the plurality of commentary figures in a display interface of the client, wherein the commentary figure is the commentary figure to which the sample audio adopted for training the target voiceprint model belongs.

The user can select the identification of the narration person needing to shield the sound from the display of the client-side various person identifications so as to input a person selection instruction to the client-side. Correspondingly, when the client receives the character selection instruction, the explanation character to which the character identification selected by the user belongs can be determined as the character to be shielded.

Furthermore, the client may correspondingly play other to-be-processed audio after the to-be-processed audio corresponding to the to-be-masked character indicated by the masking user in the process of playing the narration video, that is, the client may only play other to-be-processed audio after the to-be-processed audio corresponding to the to-be-masked character.

In one embodiment of the invention, when the user selects the personal identification displayed by the client, the wrong personal identification may be selected due to misoperation. In order to avoid shielding wrong sound due to user misoperation, when a character selection instruction input by a user is received, the client can play the audio to be processed of the target character to which the character identifier selected by the user belongs, and display a reminding message to remind the user whether to confirm that the sound of the target character to which the selected character identifier belongs is shielded.

If the voice of the target person to which the selected person identifier belongs needs to be masked, the user can input a confirmation masking instruction to the client. Correspondingly, when a confirmation shielding instruction input by the user is received, the client can determine that the target person to which the person identifier selected by the user belongs is the person to be shielded, and play other audio to be processed except the audio to be processed of the person to be shielded.

If the user selects the wrong personal identification, i.e., the user does not need to mask the sound of the target person to which the selected personal identification belongs, the user may input a unmasking instruction to the client. Accordingly, when a user input unmasking instruction is received, the client can play the target audio.

Based on the processing, the phenomenon that wrong sound is shielded due to misoperation of the user can be avoided, and the user experience can be improved.

Corresponding to the method embodiment of fig. 1, referring to fig. 5, fig. 5 is a block diagram of an audio processing apparatus in a video, where the apparatus is applied to a client, and the apparatus includes:

a separation module 501, configured to, in a process of playing a target video, when an audio shielding instruction for the target video is received, separate a target audio that includes sounds of multiple target characters in the target video to obtain each to-be-processed audio included in the target audio; wherein one audio to be processed represents the sound emitted by the same target character in the target video;

a first determining module 502, configured to determine, for each to-be-processed audio, a voiceprint model matched with the to-be-processed audio from voiceprint models to be matched, which are generated in advance based on sample audios of the multiple target characters and correspond to each target character, as a target voiceprint model based on a voiceprint feature of the to-be-processed audio;

a second determining module 503, configured to determine a target person to which the to-be-processed audio belongs in the target video, where the target person is a target person to which a sample audio adopted for training the target voiceprint model belongs;

the playing module 504 is configured to correspondingly play other to-be-processed audio after the to-be-processed audio corresponding to the to-be-shielded character indicated by the shielding user in the process of playing the target video.

Optionally, the voiceprint feature of one audio to be processed includes a spectral feature of each audio frame in the audio to be processed.

Optionally, the first determining module 502 is specifically configured to, for each audio to be processed, respectively calculate similarity between a voiceprint feature of the audio to be processed and a voiceprint model to be matched, which is generated in advance based on a sample audio of each target person, and corresponds to the target person;

and determining the voiceprint model with the maximum similarity with the audio to be processed from the voiceprint models to be matched, and obtaining the voiceprint model matched with the audio to be processed as a target voiceprint model.

Optionally, the first determining module 502 is specifically configured to, for each audio to be processed, respectively calculate a log likelihood probability of a voiceprint feature of the audio to be processed, generated for a sample audio based on each target person in advance, of a voiceprint model to be matched corresponding to the target person, as a similarity between the voiceprint feature of the audio to be processed and the voiceprint model to be matched.

Optionally, the apparatus further comprises:

the training module is used for acquiring the voiceprint characteristics of the preset sample audio;

training a Gaussian mixture model of an initial structure based on an expectation maximization algorithm and the voiceprint characteristics of the preset sample audio to obtain an alternative network model;

and aiming at each target character, based on the self-adaptive algorithm and the voiceprint characteristics of the sample audio of the target character, adjusting the model parameters of the alternative network model to obtain the voiceprint model to be matched corresponding to the target character.

Optionally, the apparatus further comprises:

a processing module, configured to display, in a display interface of the client, respective character identifiers of multiple target characters in the target video before the playing module 504 correspondingly plays other to-be-processed audio after the to-be-processed audio corresponding to the to-be-shielded character indicated by the shielding user in the process of playing the target video;

when a person selection instruction input by a user is received, determining that the person indicated by the person selection instruction identifies a target person belonging to the target video, and using the target person as a person to be shielded indicated by the user.

Based on the audio processing device in the video provided by the embodiment of the invention, other audio to be processed after the audio to be processed corresponding to the character to be shielded indicated by the user is shielded can be played according to the indication of the user, namely, the sound of a specific character indicated by the user can be shielded, and the personalized requirement of the user is met.

The embodiment of the present invention further provides a client, as shown in fig. 6, including a processor 601, a communication interface 602, a memory 603, and a communication bus 604, where the processor 601, the communication interface 602, and the memory 603 complete mutual communication through the communication bus 604,

a memory 603 for storing a computer program;

the processor 601 is configured to implement the following steps when executing the program stored in the memory 603:

in the process of playing a target video, when an audio shielding instruction for the target video is received, separating the target audio containing the sounds of a plurality of target characters in the target video to obtain each audio to be processed contained in the target audio; wherein one audio to be processed represents the sound emitted by the same target character in the target video;

for each audio to be processed, determining a voiceprint model matched with the audio to be processed as a target voiceprint model from voiceprint models to be matched, which are generated in advance based on the voiceprint characteristics of the audio to be processed and are corresponding to each target figure;

determining a target character to which the audio to be processed belongs in the target video, wherein the target character belongs to a sample audio adopted for training the target voiceprint model;

and correspondingly playing other to-be-processed audios after the to-be-processed audio corresponding to the to-be-shielded character indicated by the shielding user in the process of playing the target video.

The communication bus mentioned in the above-mentioned client terminal may be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus, etc. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.

The communication interface is used for communication between the client and other devices.

The Memory may include a Random Access Memory (RAM) or a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.

The Processor may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the device can also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component.

Based on the client provided by the embodiment of the invention, other to-be-processed audio after the to-be-processed audio corresponding to the to-be-shielded character indicated by the user is shielded can be played according to the indication of the user, namely, the sound of the specific character indicated by the user can be shielded, and the personalized requirement of the user is met.

In still another embodiment of the present invention, a computer-readable storage medium is further provided, in which a computer program is stored, and the computer program, when executed by a processor, implements the audio processing method in video described in any of the above embodiments.

In a further embodiment of the present invention, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform the method of audio processing in video as described in any of the above embodiments.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus, client, computer-readable storage medium, and computer program product embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and for relevant points, reference may be made to some descriptions of the method embodiments.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

17页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:情绪特征的确定方法和装置、电子设备、存储介质

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!