Audio playing method and electronic equipment

文档序号:172932 发布日期:2021-10-29 浏览:34次 中文

阅读说明:本技术 音频播放方法和电子设备 (Audio playing method and electronic equipment ) 是由 许志明 于 2021-07-21 设计创作,主要内容包括:本申请公开了一种音频播放方法和电子设备,属于人工智能领域。该方法包括:首先基于针对目标视频聊天场景所获取的人脸图像信息,确定与目标视频聊天场景对应的背景音乐音频信息;以及,基于针对目标视频聊天场景所获取的用户语音信息,确定目标视频聊天场景中目标用户的语音特征参数信息;然后基于语音特征参数信息,调整背景音乐音频信息的音频特征参数信息,并基于音频特征参数信息播放背景音乐音频信息。(The application discloses an audio playing method and electronic equipment, and belongs to the field of artificial intelligence. The method comprises the following steps: firstly, determining background music audio information corresponding to a target video chat scene based on face image information acquired aiming at the target video chat scene; determining voice characteristic parameter information of a target user in the target video chat scene based on the user voice information acquired aiming at the target video chat scene; and then, based on the voice characteristic parameter information, adjusting the audio characteristic parameter information of the background music audio information, and playing the background music audio information based on the audio characteristic parameter information.)

1. An audio playing method, the method comprising:

determining background music audio information corresponding to a target video chat scene based on face image information acquired aiming at the target video chat scene; and the number of the first and second groups,

determining voice characteristic parameter information of a target user in the target video chat scene based on the user voice information acquired aiming at the target video chat scene;

and adjusting the audio characteristic parameter information of the background music audio information based on the voice characteristic parameter information, and playing the background music audio information based on the audio characteristic parameter information.

2. The method of claim 1, wherein the determining background music audio information corresponding to the target video chat scene based on the face image information obtained for the target video chat scene comprises:

determining user mouth shape information based on face image information acquired aiming at a target video chat scene;

determining a user voice phoneme sequence based on the user mouth shape information;

and selecting background music audio information matched with the user voice phoneme sequence based on the lyric phoneme sequence of each alternative background music.

3. The method of claim 1, wherein the determining voice feature parameter information of the target user in the target video chat scene based on the user voice information obtained for the target video chat scene comprises:

preprocessing the user voice information acquired aiming at the target video chat scene to obtain preprocessed user voice information;

extracting voice characteristic parameter information of a target user in the target video chat scene from the preprocessed user voice information, wherein the voice characteristic parameter information comprises: time domain characteristic parameter information and/or frequency domain characteristic parameter information.

4. The method of claim 3, wherein the time-domain feature parameter information comprises: the information of the voice duration, the information of the pitch period and the information of the short-time energy spectrum, and the information of the frequency domain characteristic parameters comprises: mel-frequency cepstral coefficients;

the adjusting the audio characteristic parameter information of the background music audio information based on the voice characteristic parameter information includes:

judging whether the tone of the background music audio information is matched with the tone of the user voice information or not based on the Mel frequency cepstrum coefficient;

if the judgment result is yes, adjusting a first audio characteristic parameter, used for representing the playing rhythm speed, of the background music audio information based on the voice duration information; and the number of the first and second groups,

adjusting a second audio characteristic parameter, used for representing the playing frequency, of the background music audio information based on the pitch period information; and the number of the first and second groups,

and adjusting a third audio characteristic parameter of the background music audio information for representing the playing volume based on the short-time energy spectrum information.

5. The method of claim 3, wherein the pre-processing the user voice information obtained for the target video chat scenario to obtain pre-processed user voice information comprises:

judging whether the user voice information contains noise audio information or not;

if so, performing denoising processing on the user voice information based on the noise category of the noise audio information to obtain denoised user voice information;

and determining the preprocessed user voice information based on the de-noised user voice information.

6. An audio playback apparatus, comprising:

the background music determining module is used for determining background music audio information corresponding to a target video chat scene based on the face image information acquired aiming at the target video chat scene;

the voice characteristic parameter determining module is used for determining voice characteristic parameter information of a target user in the target video chat scene based on the user voice information acquired aiming at the target video chat scene;

the audio characteristic parameter adjusting module is used for adjusting the audio characteristic parameter information of the background music audio information based on the voice characteristic parameter information; and playing the background music audio information based on the audio characteristic parameter information.

7. The apparatus of claim 6, wherein the background music determination module is specifically configured to:

determining user mouth shape information based on face image information acquired aiming at a target video chat scene;

determining a user voice phoneme sequence based on the user mouth shape information;

and selecting background music audio information matched with the user voice phoneme sequence based on the lyric phoneme sequence of each alternative background music.

8. The apparatus of claim 6, wherein the speech feature parameter determining module is specifically configured to:

preprocessing the user voice information acquired aiming at the target video chat scene to obtain preprocessed user voice information;

extracting voice characteristic parameter information of a target user in the target video chat scene from the preprocessed user voice information, wherein the voice characteristic parameter information comprises: time domain characteristic parameter information and/or frequency domain characteristic parameter information.

9. The apparatus of claim 8, wherein the time-domain feature parameter information comprises: the information of the voice duration, the information of the pitch period and the information of the short-time energy spectrum, and the information of the frequency domain characteristic parameters comprises: mel-frequency cepstral coefficients;

the audio characteristic parameter adjusting module is specifically configured to:

judging whether the tone of the background music audio information is matched with the tone of the user voice information or not based on the Mel frequency cepstrum coefficient;

if the judgment result is yes, adjusting a first audio characteristic parameter, used for representing the playing rhythm speed, of the background music audio information based on the voice duration information; and the number of the first and second groups,

adjusting a second audio characteristic parameter, used for representing the playing frequency, of the background music audio information based on the pitch period information; and the number of the first and second groups,

and adjusting a third audio characteristic parameter of the background music audio information for representing the playing volume based on the short-time energy spectrum information.

10. The apparatus of claim 8, wherein the speech feature parameter determination module is further specifically configured to:

judging whether the user voice information contains noise audio information or not;

if so, performing denoising processing on the user voice information based on the noise category of the noise audio information to obtain denoised user voice information;

and determining the preprocessed user voice information based on the de-noised user voice information.

11. An electronic device, comprising: processor, memory and a program or instructions stored on the memory and executable on the processor, which when executed by the processor implement the steps of the audio playback method as claimed in any one of claims 1 to 5.

12. A readable storage medium, on which a program or instructions are stored, which when executed by a processor implement the steps of the audio playback method according to any one of claims 1 to 5.

Technical Field

The application belongs to the field of artificial intelligence, and particularly relates to an audio playing method and electronic equipment.

Background

At present, with the rapid development of the field of artificial intelligence, various social products are developed, wherein social products such as video chat are increasing, and a communication mode of using video to perform voice chat becomes a part of people's daily life gradually, so that great convenience is brought to people who cannot communicate face to face.

When a user uses video to perform voice chat, the problem that the user experience is poor due to the fact that a video chat scene is too single exists, and therefore the user often selects a music matching function to improve the video chat pleasure. However, in the prior art, the music matching mode during the video chat is mainly a mode of manually searching and playing songs, so that the problem that the personalized requirements of users cannot be met due to a single mode during the music matching of the video chat may exist.

Disclosure of Invention

The embodiment of the application aims to provide an audio playing method and electronic equipment, and the audio playing method and the electronic equipment can solve the problem that in the prior art, the music matching mode in the video chat is mainly a mode of manually searching songs and playing the songs, so that the mode is single when the music matching is performed on the video chat, and the personalized requirements of users cannot be met.

In a first aspect, an embodiment of the present application provides an audio playing method, where the method includes:

determining background music audio information corresponding to a target video chat scene based on face image information acquired aiming at the target video chat scene; and the number of the first and second groups,

determining voice characteristic parameter information of a target user in the target video chat scene based on the user voice information acquired aiming at the target video chat scene;

and adjusting the audio characteristic parameter information of the background music audio information based on the voice characteristic parameter information, and playing the background music audio information based on the audio characteristic parameter information.

In a second aspect, an embodiment of the present application provides an audio playing apparatus, including:

the background music determining module is used for determining background music audio information corresponding to a target video chat scene based on the face image information acquired aiming at the target video chat scene;

the voice characteristic parameter determining module is used for determining voice characteristic parameter information of a target user in the target video chat scene based on the user voice information acquired aiming at the target video chat scene;

the audio characteristic parameter adjusting module is used for adjusting the audio characteristic parameter information of the background music audio information based on the voice characteristic parameter information; and playing the background music audio information based on the audio characteristic parameter information.

In a third aspect, an embodiment of the present application provides an electronic device, including: a processor, a memory and a program or instructions stored on the memory and executable on the processor, which when executed by the processor, implement the steps of the audio playback method according to the first aspect.

In a fourth aspect, an embodiment of the present application provides a chip, where the chip includes a processor and a communication interface, where the communication interface is coupled to the processor, and the processor is configured to execute a program or instructions to implement the audio playing method according to the first aspect.

According to the audio playing method and the electronic equipment provided by the embodiment of the application, firstly, the face image information and the user voice information of a target user in a target video chat scene are obtained; the mouth shape change information of the target user can be identified from the face image information, so that the chat content of the target user can be determined, and the background music audio information corresponding to the target video chat scene can be further determined; the voice characteristics of the target user can be recognized from the user voice information of the target user, so that the voice characteristic parameter information of the target user can be determined; then, based on the voice characteristic parameter information, the audio characteristic parameter information of the determined background music audio information is adjusted, so that the background music audio information is played based on the adjusted audio characteristic parameter information, namely, the background music is automatically matched based on the face image information, and meanwhile, the audio characteristic parameter information of the background music is automatically adjusted based on the voice information of the user, so that not only is the background music matched with the current chat topic integrated in the video chat realized, but also the audio and the audio characteristic parameters of the background music can be intelligently adjusted based on the chat content of the user and the voice characteristic parameters of the user, so that the background music is more matched with the chat content of the user and the voice characteristics of the user, and the use experience of the user in the video chat process is improved.

Drawings

Fig. 1 is a schematic application scenario diagram of an audio playing method provided in an embodiment of the present application;

fig. 2 is a first flowchart of an audio playing method provided in an embodiment of the present application;

fig. 3 is a schematic interface diagram of a first audio playing method provided in an embodiment of the present application;

fig. 4 is a second flowchart of an audio playing method provided in an embodiment of the present application;

fig. 5 is a third flowchart illustrating an audio playing method according to an embodiment of the present application;

fig. 6 is a schematic interface diagram of a second audio playing method provided in the embodiment of the present application;

fig. 7 is a schematic diagram of a third interface of an audio playing method provided in the embodiment of the present application;

fig. 8 is a schematic block diagram of an audio playing apparatus according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of an electronic device provided in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described clearly below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments that can be derived by one of ordinary skill in the art from the embodiments given herein are intended to be within the scope of the present disclosure.

The terms first, second and the like in the description and in the claims of the present application are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that embodiments of the application may be practiced in sequences other than those illustrated or described herein, and that the terms "first," "second," and the like are generally used herein in a generic sense and do not limit the number of terms, e.g., the first term can be one or more than one. In addition, "and/or" in the specification and claims means at least one of connected objects, a character "/" generally means that a preceding and succeeding related objects are in an "or" relationship.

The following describes in detail an audio playing method and an electronic device provided in the embodiments of the present application with reference to the accompanying drawings through specific embodiments and application scenarios thereof.

Fig. 1 is a schematic view of an application scenario of an audio playing method provided in an embodiment of the present application, as shown in fig. 1, including: the background server can be a cloud background server or a background server for video chat music based on face image information and user voice information of the client, wherein the specific implementation process of the audio playing method is as follows:

the method comprises the steps of collecting face image information and user voice information of a target user in a target video chat scene, determining background music audio information corresponding to the target video chat scene based on the collected face image information and the user voice information, adjusting audio characteristic parameter information of the background music audio information, and finally playing the adjusted background music audio information.

Specifically, the process of determining background music audio information based on the face image information and the user voice information and adjusting the audio characteristic parameter information of the background music audio information may be executed by a background server or a client; in addition, the background music audio information may also be determined by the client, and then the background server adjusts the audio characteristic parameter information of the scene music audio information, where any feasible deformation manner is within the protection scope of the present application and is not described herein again.

(1) Aiming at the condition that the client determines the background music and adjusts the audio characteristic parameter information of the background music, the audio playing method comprises the following specific implementation processes:

the method comprises the steps that a client collects face image information and user voice information of a target user in a target video chat scene; the client includes at least one of a video chat calling party and a video chat called party, and correspondingly, the facial image information may include: the image of the face of the calling party of the video chat and/or the called party of the video chat, and the user voice information may include: the voice information of the video chat calling party and/or the video chat called party, the facial image information corresponding to the user voice information, the facial image information may include: the mouth shape information of the target user when the voice information of the target user is sent out;

the client determines background music audio information corresponding to a target video chat scene based on the acquired face image information of the target user; determining voice characteristic parameter information of the target user in the target video chat scene based on the acquired user voice information of the target user;

after determining background music audio information matched with the current chat topic and voice characteristic parameter information of a target user, the client adjusts the audio characteristic parameter information of the background music audio information based on the voice characteristic parameter information;

and the client plays the background music audio information based on the adjusted audio characteristic parameter information.

When the client determines the background music and adjusts the audio characteristic parameter information of the background music, the client can determine the audio information of the background music based on the face image information acquired by the home terminal and adjust the audio characteristic parameter information of the audio information of the background music based on the user voice information acquired by the home terminal; or determining background music audio information based on the face image information acquired by the home terminal and the face image information acquired by the opposite terminal, and adjusting audio characteristic parameter information of the background music audio information based on the user voice information acquired by the home terminal and the user voice information acquired by the opposite terminal; the method can also be used for determining background music audio information based on the face image information acquired by the home terminal and the face image information acquired by the opposite terminal simultaneously, and adjusting the audio characteristic parameter information of the background music audio information on the basis of the user voice information acquired by the home terminal respectively at a video chat calling party or a video chat called party; for the client side as a video chat calling party, the home terminal is a video chat calling party, the opposite terminal is a video chat called party, and the face image information and the user voice information of the opposite terminal can be sent to the video chat calling party or the video chat called party through the background server.

(2) Aiming at the situation that a background server determines background music and adjusts audio characteristic parameter information of the background music, the audio playing method comprises the following specific implementation processes:

the method comprises the steps that a client acquires face image information and user voice information of a target user in a target video chat scene, and sends the face image information and the user voice information of the target user to a background server; the client includes at least one of a video chat calling party and a video chat called party, and correspondingly, the facial image information may include: the image of the face of the calling party of the video chat and/or the called party of the video chat, and the user voice information may include: the voice information of the video chat calling party and/or the video chat called party, the facial image information corresponding to the user voice information, the facial image information may include: the mouth shape information of the target user when the voice information of the target user is sent out;

the background server side determines background music audio information corresponding to a target video chat scene based on the facial image information of the target user uploaded by the client side; determining voice characteristic parameter information of the target user in the target video chat scene based on the user voice information of the target user uploaded by the client;

after determining background music audio information matched with the current chat topic and voice characteristic parameter information of a target user, the background server adjusts the audio characteristic parameter information of the background music audio information based on the voice characteristic parameter information;

and the background server sends the adjusted background music audio information to the video chat calling party and the video chat called party, and correspondingly, the video chat calling party and the video chat called party play the background music audio information based on the adjusted audio characteristic parameter information.

Wherein, it should be noted that the background server can be an instant messaging server corresponding to the video chat application, or an independent audio playing server, aiming at the situation that the background server is the instant messaging server corresponding to the video chat application, the background server not only sends the adjusted background music audio information to the video chat calling party and the video chat called party, but also sends the user voice information and the face image information of the video chat calling party to the video chat called party, so as to display the user voice information and the face image information of the video chat calling party at the client side of the video chat called party, and transmitting the user voice information and the face image information of the called party of the video chat to the calling party of the video chat, so that the user voice information and the face image information of the called party of the video chat are displayed at the client side of the calling party of the video chat.

Fig. 2 is a first flowchart of an audio playing method according to an embodiment of the present application, where the method in fig. 2 can be executed by a client, that is, by at least one of the video chat calling party and the video chat called party in fig. 1, and can also be executed by the client and a background server participating together, that is, by at least one of the video chat calling party and the video chat called party in fig. 1 interacting and executing information with the background server, as shown in fig. 2, the method at least includes the following steps:

s101, determining background music audio information corresponding to a target video chat scene based on face image information acquired aiming at the target video chat scene;

specifically, before executing S101, as shown in fig. 3, a "start music score" button is set on an interface of a target video chat scene, a user may select whether to start an intelligent music score according to a requirement of the user, and if the user selects to start the intelligent music score, the "start music score" button is pressed, where when the method in fig. 2 is executed by a client, the client directly triggers and executes step S101 after detecting that the user selects to start the intelligent music score; correspondingly, when the method in fig. 2 is executed by a background server, the client sends a video chat music matching request to the background server to trigger the background server to execute step S101, where the video chat music matching request may be sent by the client before sending the face image information to the background server, or may be sent by the client together when sending the face image information to the background server; on the contrary, if the user does not select to start the intelligent score, the client does not send a video chat score request to the background server; specifically, after the client detects that the user presses a 'start music score' button, namely the client detects that the user starts intelligent music score, the client sends face image information acquired in the target video chat scene to the background server, and the background server determines background music audio information corresponding to the target video chat scene.

The face image information may include: at least one of the face image information of the video chat calling party and the video chat called party; specifically, in the process of determining the background music audio information matched with the current chat topic, only the face image information of the video chat calling party or the video chat called party can be considered, and the face image information of the video chat calling party and the video chat called party can also be considered at the same time;

for example, in the case of considering only the facial image information of the video chat calling party or the video chat called party, the selected facial image information of which party is selected may be determined based on the number of facial image information or mouth shape change information carried by the facial image information to determine the background music audio information matching the current chat topic; specifically, the party with the largest number of face image information acquired within a preset time period may be used as a background music determination reference party, and then the face image information of the reference party is determined based on the background music to determine background music audio information; or determining mouth shape change information based on the acquired face image information of the video chat calling party and the video chat called party in a preset time period, identifying a party with the fastest mouth shape change based on the mouth shape change information as a background music determination reference party, and determining the face image information of the reference party based on the background music to determine background music audio information;

for another example, for a situation that face image information of a video chat calling party and face image information of a video chat called party are considered at the same time, first background music audio information matched with the first face image information may be determined based on the acquired first face image information of the video chat calling party; determining second background music audio information matched with the second facial image information based on the acquired second facial image information of the called party in the video chat; if the determined first background music audio information is the same as the second background music audio information, determining the first background music audio information or the second background music audio information as the background music audio information corresponding to the target video chat scene; if the determined first background music audio information is different from the second background music audio information, comparing first mouth shape change information corresponding to the first face image information and second mouth shape change information corresponding to the second face image information, which are acquired within a preset time, and if the mouth shape change degree corresponding to the first mouth shape change information is higher than the mouth shape change degree corresponding to the second mouth shape change information, determining the first background music audio information matched with the first face image information as the background music audio information corresponding to the target video chat scene; otherwise, determining second background music audio information matched with the second face image information as background music audio information corresponding to the target video chat scene; or, the first face image information of the video chat calling party and the second face image information of the video chat called party are obtained again, the first background music audio information is determined based on the updated first face image information, and the second background music audio information is determined based on the updated second face image information until the first background music audio information is the same as the second background music audio information.

In specific implementation, the same background music may be selected at a plurality of clients, or different background music may be selected at different clients, for example, the first background music audio information matched with the first face image information is used as the background music audio information of the video chat calling party, and the second background music audio information matched with the second face image information is used as the background music audio information of the video chat called party.

In the embodiment of the invention, the same background music is selected for a plurality of clients, and under the condition that the background music audio information matched with the current chat topic is determined based on the face image information of a plurality of parties, the problem that the background music audio information matched with the face image information of the plurality of parties participating in the video chat is different possibly occurs, so that the background music audio information determined by the party with the fastest mouth shape change (namely the party with the most speaking content) corresponding to the face image information can be selected as the background music audio information corresponding to the target video chat scene in a mode of comparing the mouth shape change information corresponding to the face image information acquired by each party in a preset time period; or, by re-acquiring the face image information of multiple parties, re-comparing the background music audio information from multiple parties until the background music audio information from multiple parties is the same, or by matching the background music audio information corresponding to the face image information acquired by the local terminal of different clients, the accuracy of the determined background music matched with the current chat topic is improved.

S102, determining voice characteristic parameter information of a target user in a target video chat scene based on user voice information acquired aiming at the target video chat scene;

wherein, the user voice information may include: user voice information of at least one of a video chat calling party and a video chat called party; the voice characteristic parameter information may include at least one of voice duration information, pitch period information, and a mel frequency cepstrum coefficient of short-time energy spectrum information; specifically, the voice feature parameter information of the target user can be obtained by performing voice feature recognition processing on the voice information of the user.

S103, adjusting audio characteristic parameter information of the background music audio information based on the voice characteristic parameter information of the target user in the target video chat scene, and playing the background music audio information corresponding to the target video chat scene based on the audio characteristic parameter information;

after determining the voice characteristic parameter information of the target user, the audio characteristic parameter information of the background music audio information can be adjusted based on the voice characteristic parameter information, and then the adjusted background music audio information is sent to the video chat calling party and the video chat called party.

In a specific implementation, the audio characteristic parameter information of the background music audio information may be adjusted based on the user speech information by using a pre-trained background music fusion model, specifically, the background music audio information determined in S101 and the obtained user speech information are used as input information and input to the pre-trained background music fusion model, and the background music fusion model is used to perform speech characteristic parameter recognition on the obtained user speech information to determine the speech characteristic parameter information of the target user, and then the audio characteristic parameter information of the background music audio information determined in S101 is adjusted based on the speech characteristic parameter information to obtain the background music audio information (which may also be referred to as a fusion sound) after the audio characteristic parameter adjustment.

The background sound fusion model can be obtained by training in the following way:

acquiring first training sample data, wherein the first training sample data comprises a plurality of background sound fusion model training samples, and each background sound fusion model training sample represents the corresponding relation among the historical user voice information, the historical voice characteristic parameter information and the historical audio characteristic parameter information;

performing iterative training and updating on preset background sound fusion model parameters by adopting a machine learning method based on the first training sample data to obtain updated model parameters until a target function corresponding to the background sound fusion model is converged, and further obtaining a trained background sound fusion model; the background sound fusion model is used for recognizing the voice characteristic parameter information based on the user voice information and adjusting the audio characteristic parameters of the background music audio information based on the voice characteristic parameter information.

It should be noted that, for the case that the method in fig. 2 is executed by the client, the background sound fusion model may be obtained by training at the background server, and then the trained background sound fusion model is deployed at the client, so that the client can directly utilize the pre-trained background sound fusion model, recognize the speech feature parameter information based on the speech information of the user, and adjust the audio feature parameter of the background music audio information based on the speech feature parameter information.

Wherein, the user voice information may include: user voice information of at least one of a video chat calling party and a video chat called party; specifically, in the process of determining the voice characteristic parameter information of the user, only the user voice information of the video chat calling party or the video chat called party obtained within a preset time period after the video chat starts may be considered, or the user voice information of the video chat calling party and the user voice information of the video chat called party obtained within the preset time period after the video chat starts may be considered at the same time;

for example, for a case that only the user voice information of the video chat calling party or the video chat called party is considered, the voice feature parameter information of the user may be determined based on the first received user voice information meeting the preset condition, specifically, if the first received user voice information with the valid voice duration greater than the preset threshold comes from the video chat calling party, the voice feature parameter information of the user is determined based on the user voice information of the video chat calling party; the voice feature parameter information of the user can also be determined by determining which party of the user voice information is selected based on the duration of the effective voice contained in the user voice information; specifically, the party with the largest duration of the valid voice contained in the user voice information acquired within the preset time period may be used as the voice feature determination reference party, and then the user voice information of the reference party is determined based on the voice feature to determine the voice feature parameter information;

for another example, for a case that the user voice information of the video chat calling party and the user voice information of the video chat called party are considered at the same time, first voice characteristic parameter information may be determined based on the acquired first user voice information of the video chat calling party; determining second voice characteristic parameter information based on the acquired second user voice information of the called party of the video chat; if the first voice characteristic parameter information is the same as the second voice characteristic parameter information, determining the first voice characteristic parameter information or the second voice characteristic parameter information as the voice characteristic parameter information of the target user; if the first voice characteristic parameter information is different from the second voice characteristic parameter information, comparing a first effective voice duration corresponding to the first user voice information and a second effective voice duration corresponding to the second user voice information which are acquired within a preset time, and if the first effective voice duration is greater than the second effective voice duration, determining the first voice characteristic parameter information as the voice characteristic parameter information of the target user; otherwise, the second voice characteristic parameter information is determined as the voice characteristic parameter information of the target user, and then the audio characteristic parameter information of the background music audio information is adjusted based on the voice characteristic parameter information, so that the background music fusion sound which is finally required to be played at the client side is obtained.

In specific implementation, the audio characteristic parameter information of the background music played at the multiple clients may be obtained by adjusting based on the same voice characteristic parameter information, and the audio characteristic parameter information of the background music played at different clients may also be obtained by adjusting based on different voice characteristic parameter information, for example, the audio characteristic parameter information of the background music audio information of the video chat caller is directly adjusted based on the first voice characteristic parameter information, so as to obtain a first background music fusion sound that is finally required to be played at the video chat caller (that is, the audio characteristic of the background music played by the video chat caller is determined by the voice characteristic of the video chat caller); and adjusting the audio characteristic parameter information of the background music audio information of the called party of the video chat based on the second voice characteristic parameter information to obtain a second background music fusion voice which is finally required to be played at the called party of the video chat (namely, the audio characteristic of the background music played by the called party of the video chat is determined by the voice characteristic of the called party of the video chat).

Specifically, after adjusting the audio characteristic parameter information of the background music audio information based on the voice characteristic parameter information, the background server sends the background music audio information with the adjusted audio characteristic parameter to the video chat calling party and the video chat called party, and the client receives and plays the background music audio information with the adjusted audio characteristic parameter; or the client side directly plays the background music audio information based on the adjusted audio characteristic parameter information.

Wherein, in order to avoid the interference of the background music to the video chat, the audio information of the background music is the main melody information of the background music obtained by removing the lyrics of the background music, the background music main melody information is stored in the form of a MIDI file, and in particular, the MIDI file is a file storing digitized information of at least one of time, position, intensity, duration, vibrato, and dynamics of a sound signal, the MIDI file usually includes a multitrack accompaniment, the complete background music main melody information can be extracted from the multi-track MIDI melody, further, audio characteristic parameter information (i.e., audio characteristic vector) is extracted from the background music main melody information, and adjusting the audio characteristic parameter information of the background music audio information based on the voice characteristic parameter information, and playing the background music audio information based on the adjusted audio characteristic parameter information.

In the embodiment of the application, the face image information and the user voice information of a target user in a target video chat scene are firstly obtained; the mouth shape change information of the target user can be identified from the face image information, so that the chat content of the target user can be determined, and the background music audio information corresponding to the target video chat scene can be further determined; the voice characteristics of the target user can be recognized from the user voice information of the target user, so that the voice characteristic parameter information of the target user can be determined; then, based on the voice characteristic parameter information, the audio characteristic parameter information of the determined background music audio information is adjusted, so that the background music audio information is played based on the adjusted audio characteristic parameter information, namely, the background music is automatically matched based on the face image information, and meanwhile, the audio characteristic parameter information of the background music is automatically adjusted based on the voice information of the user, so that not only is the background music matched with the current chat topic integrated in the video chat realized, but also the audio and the audio characteristic parameters of the background music can be intelligently adjusted based on the chat content of the user and the voice characteristic parameters of the user, so that the background music is more matched with the chat content of the user and the voice characteristics of the user, and the use experience of the user in the video chat process is improved.

Further, for the determination process of the background music audio information, mouth shape recognition may be performed based on the face image information, and then the speech phoneme sequence corresponding to the user mouth shape information is matched with the lyric phoneme sequence of the alternative background music, so as to determine the background music audio information matched with the current video chat scene, that is, the background music audio information may be determined by performing phoneme matching between the user speech phoneme sequence extracted based on the face image information and the lyric phoneme sequence, as shown in fig. 4, the above S101 determines the background music audio information corresponding to the target video chat scene based on the face image information acquired for the target video chat scene, and specifically includes:

s1011, determining user mouth shape information based on the face image information acquired aiming at the target video chat scene;

s1012, determining a user voice phoneme sequence based on the determined user mouth shape information;

in specific implementation, the mouth shape recognition model trained in advance can be used for recognizing the determined mouth shape information of the user, and a voice phoneme sequence of the user is determined; specifically, the mouth shape recognition model may be obtained by training as follows:

acquiring second training sample data, wherein the second training sample data comprises a plurality of mouth shape recognition model training samples, and each mouth shape recognition model training sample represents the corresponding relation between the mouth shape information of the historical user and the voice phoneme sequence of the historical user;

and performing iterative training and updating on preset mouth shape recognition model parameters by adopting a machine learning method based on the second training sample data to obtain updated model parameters until model functions corresponding to the mouth shape recognition model converge, and further obtaining a trained mouth shape recognition model, wherein the mouth shape recognition model is used for predicting the user voice phoneme sequence based on the user mouth shape information.

It should be noted that, for the case that the method in fig. 2 is executed by the client, the mouth shape recognition model may be obtained by training at the background server, and then the trained mouth shape recognition model is deployed at the client, so that the client can directly recognize the determined mouth shape information of the user by using the mouth shape recognition model trained in advance, and determine the user speech phoneme sequence.

Specifically, after face image information of a target user is acquired, face image information acquired for a target video chat scene within a preset time interval is detected by using a mouth shape detection technology in a machine vision technology, wherein the face image information within the preset time interval is an image sequence continuously changing within the preset time interval, that is, continuously changing face mouth shape position information is identified from the continuously changing image sequence, so that characteristics (namely digital coding vector characteristics) of the continuously changing user mouth shape are obtained; inputting the feature of continuous change of the user mouth shape (namely the feature of the digital coding vector) into a mouth shape recognition model which is trained in advance, recognizing the pronunciation corresponding to the user mouth shape, and outputting a user voice phoneme sequence, namely a natural language phoneme sequence with the highest possibility based on the pronunciation corresponding to the user mouth shape.

And S1013, selecting background music audio information matched with the user voice phoneme sequence based on the lyric phoneme sequence of each alternative background music.

Specifically, the user voice phoneme sequence is compared with the lyric phoneme sequence of the alternative background music, whether the similarity between the lyric phoneme sequence of at least one music fragment and the user voice phoneme sequence in the alternative background music is greater than a preset threshold value is judged, and if yes, the audio information of the alternative background music is used as the background music audio information corresponding to the target video chat scene; and if the face image information does not exist, the face image information is acquired again.

Further, in order to improve the background music matching efficiency, all background music in the dubbing library can be divided into a plurality of background music classifications in advance, the plurality of background music under the target background music classification are determined as alternative background music, and the video chat topic type is determined based on the user voice phoneme sequence; determining a target background music classification corresponding to the video chat topic type based on a plurality of pre-divided background music classifications, comparing a user voice phoneme sequence with lyric phoneme sequences of alternative background music under the target background music classification, judging whether the similarity between the lyric phoneme sequence of at least one music fragment in the alternative background music under the target background music classification and the user voice phoneme sequence is greater than a preset threshold value, and if so, taking audio information of the alternative background music under the target background music classification as background music audio information corresponding to a target video chat scene; and if the face image information does not exist, the face image information is acquired again.

Wherein the pre-divided plurality of background music classifications may include: festival blessing, advertising promotion, color ring dubbing, subject publicity, beauty and joy, expressing emotion people, military subject matter, and other music classification.

In the embodiment of the application, the voice phoneme sequence of the user identified based on the face image information is compared with the lyric phoneme sequence of the background music in the dubbing library, so that the background music which is consistent with the video chat topic type of the user is matched, the harmony between the video chat content and the background music is further improved, and the use experience of the user is improved.

Further, considering that each user has different voice characteristics, in order to make the audio characteristics of the background music more matched with the voice characteristics of the user, the audio characteristic parameters of the background music may be adjusted based on the user voice characteristic parameters corresponding to the user voice information, specifically, as shown in fig. 5, the step S102 determines the voice characteristic parameter information of the target user in the target video chat scene based on the user voice information acquired for the target video chat scene, and specifically includes:

and S1021, preprocessing the user voice information acquired aiming at the target video chat scene to obtain the preprocessed user voice information. Specifically, the user speech information is input into the background speech fusion model, where the user speech information is speech signal information of the user, and the speech signal information is preprocessed, for example, at least one of pre-emphasis, framing and windowing is performed on the speech signal.

S1022, extracting the voice feature parameter information of the target user in the target video chat scene from the preprocessed voice information of the user, where the voice feature parameter information includes: time domain characteristic parameter information and/or frequency domain characteristic parameter information.

Wherein the time domain characteristic parameter information includes: voice duration information, pitch period information, short-time energy spectrum information; the frequency domain characteristic parameter information includes: mel-frequency cepstrum coefficients.

Correspondingly, in S103, adjusting the audio characteristic parameter information of the background music audio information based on the voice characteristic parameter information of the target user in the target video chat scene specifically includes:

judging whether the tone of the background music audio information is matched with the tone of the user voice information or not based on the Mel frequency cepstrum coefficient;

if the judgment result is yes, adjusting a first audio characteristic parameter of the background music audio information, which is used for representing the playing rhythm speed, based on the voice duration information; and the number of the first and second groups,

adjusting a second audio characteristic parameter used for representing the playing frequency of the background music audio information based on the pitch period information; and the number of the first and second groups,

and adjusting a third audio characteristic parameter of the background music audio information for representing the playing volume based on the short-time energy spectrum information.

Specifically, if the tone of the background music audio information is matched with the tone of the user voice information, the background music audio information is used as the background music audio information to be adjusted; correspondingly, aiming at the process of adjusting the first audio characteristic parameter of the background music audio information for representing the playing rhythm speed based on the voice duration information, the method specifically comprises the steps of analyzing the speaking speed of a target user based on the voice duration information, if the number of the user language phonemes identified in a preset time interval is larger than the number of the lyrics phonemes, determining that the speed of the target user is larger than the playing speed of the background music, and meanwhile, accelerating the playing rhythm of the background music; if the number of the user voice phonemes identified in the preset time interval is smaller than the number of the lyric phonemes, determining that the speech speed of the target user is smaller than the playing speed of the background music, and simultaneously slowing down the playing rhythm of the background music;

correspondingly, aiming at the process of adjusting the second audio characteristic parameter, used for representing the playing frequency, of the background music audio information based on the pitch period information, specifically, analyzing the sound frequency of the target user based on the pitch period information, and if the sound frequency of the target user is higher than the playing frequency of the background music, increasing the playing frequency of the background music; if the sound frequency of the target user is lower than the playing frequency of the background music, the playing frequency of the background music is reduced; the sound frequency of the target user can be used for distinguishing male sound, female sound and other sound attributes, and the playing frequency of the background music can be adjusted based on the sound frequency of the target user, so that the playing frequency of the background music can be matched with the sound attributes of the target user;

correspondingly, the process of adjusting the third audio characteristic parameter, which is used for representing the playing volume, of the background music audio information based on the short-time energy spectrum information is specifically that the volume of the target user is analyzed based on the short-time energy spectrum information, and if the volume of the target user is higher than the playing volume of the background music, the playing volume of the background music is increased; if the volume of the target user is smaller than the playing volume of the background music, the playing volume of the background music is reduced; and if the user voice information of the target user is not detected, increasing the playing volume of the background music.

Furthermore, the equalizer EQ algorithm can be set, that is, the frequencies of the user speech information and the background music audio information in the frequency range of 500-8K are properly increased to balance various sound information in the audio mixing (the user speech information and the background music audio information), so as to improve the overall timbre effect.

In the embodiment of the invention, the audio characteristic parameters of the background music are adjusted based on the voice characteristic parameters corresponding to the voice information of the user, so that the audio characteristics of the background music are more matched with the voice characteristics of the user, and the use experience of the user is further improved.

Further, considering that there may be a change of voice setting condition in the video chat process of the user in order to enhance the interest of the video chat, it is necessary to determine the voice feature parameter information based on the voice information of the user after the change of voice processing, where, for the case that the method in fig. 2 is executed by the client, the client may directly perform the change of voice processing on the voice information of the user, and determine the voice feature parameter information based on the voice information of the user after the change of voice processing; in view of the fact that the method in fig. 2 is executed by the client and the background server together, considering that the user voice information uploaded by the client may be voice information before changing voice or voice information after changing voice, and in view of the fact that the user voice information uploaded by the client is voice information after changing voice, the background server may directly determine voice characteristic parameter information based on the received user voice information, and in view of the fact that the user voice information uploaded by the client is voice information before changing voice, the background server needs to perform voice changing processing on the user voice information first, and based on this, the method of preprocessing the user voice information acquired for the target video chat scene to obtain the preprocessed user voice information specifically includes:

when the client is determined to select the voice change setting, carrying out voice change processing on the user voice information acquired aiming at the target video chat scene to obtain the voice information of the user after the voice change processing; specifically, the voice change type option information selected by the client is determined, and voice change processing is performed on the user voice information acquired aiming at the target video chat scene based on the voice change type option information;

determining preprocessed user voice information based on the voice information of the user after the sound changing processing; specifically, after the voice information of the user is subjected to the sound change processing, at least one of pre-emphasis, framing and windowing can be continuously performed on the voice information of the user after the sound change processing.

Specifically, as shown in fig. 6, a "change sound" button is added to the video chat interface, the "change sound" button defaults to a closed state when the video chat starts, and a user can set the "change sound" button to an open state according to the user's own needs; if a user wants to add interest to voice information of the user when the user chats in a video, a sound changing button is started; further, a "change sound" interface is popped up in the video chat interface for the user to select, as shown in fig. 7, the user may select "single change sound" or "both change sound" according to the user's own requirement on the "change sound" interface, and further, the user may select the sound attribute after change sound according to the user's own requirement, where the sound attribute may include: any one of sound attributes of a grand tertiary sound, a Royal sound, a Miss sound, a child sound and a magnetic sweet sound.

In specific implementation, when the user is determined to select the voice change setting, namely the user sets the voice change button to be in an open state, voice change processing is carried out on user voice information acquired under a target video chat scene based on voice change type option information selected by the user, and the voice information of the user after the voice change processing is obtained; specifically, for the case that the method in fig. 2 is executed by the client and the background server together, it is considered that the voice modification processing process for the user voice information may be executed by the client or by the background server; aiming at the execution condition of the client, the client directly uploads the voice information of the user after the voice change to a background server, and the background server directly inputs the voice information of the user after the voice change to a background voice fusion model; aiming at the situation executed by the background server, after detecting that the user completes voice change setting, the client needs to send voice change type option information selected by the user to the background server and upload user voice information before voice change to the background server, and the background server performs voice change processing on the user voice information before voice change and inputs the user voice information to a background voice fusion model; the voice information of the user after the sound change processing is voice signal information of the user after the sound change processing, the voice signal information of the user after the sound change processing is preprocessed, namely, at least one of pre-emphasis, framing and windowing is carried out on the voice signal of the user after the sound change processing to obtain the voice information of the user after the preprocessing, and step S1022 is executed to extract the voice characteristic parameter information in the voice information of the user after the sound change processing, so that the audio characteristic parameter information is adjusted based on the voice characteristic parameter information in the voice information of the user after the sound change processing; if the user feels that the sound-changing effect is not good, the operation of turning off the sound-changing can be performed.

In the embodiment of the invention, the risk of stealing the sound of the user per se in modes of recording and the like can be prevented by changing the sound attribute of the user, and the entertainment of the video chat is improved, so that the participation experience of a plurality of users in the video chat is improved, and the privacy and the safety of the personal sound of the user can be protected.

Further, for a situation that there is ambient noise in the environment during the video chat, based on this, the foregoing preprocess the user voice information acquired for the target video chat scene to obtain the preprocessed user voice information specifically includes:

judging whether the voice information of the user contains noise audio information or not;

if the judgment result is yes, denoising the user voice information based on the noise category of the noise audio information to obtain denoised user voice information; specifically, the noise categories include: at least one noise of driving scene sound, spacious sound, mechanical noise and animal sound; through a noise recognition model trained in advance, firstly, noise types in user voice information are automatically recognized, and then noise of different types is subjected to denoising processing.

When a noise recognition model is trained, acquiring third training sample data, wherein the third training sample data comprises a plurality of noise recognition model training samples, and each noise recognition model training sample represents a corresponding relation between sample voice information containing certain type of noise and a noise type;

and performing iterative training and updating on preset noise recognition model parameters by adopting a machine learning method based on the third training sample data to obtain updated model parameters until model functions corresponding to the noise recognition model converge, and further obtaining a trained noise recognition model, wherein the noise recognition model is used for performing noise type recognition on the voice information of the user.

It should be noted that, for the case that the method in fig. 2 is executed by the client, the noise recognition model may be obtained by training at the background server, and then the trained noise recognition model is deployed at the client, so that the client can directly use the pre-trained noise recognition model to automatically recognize the noise type in the user voice information, and further perform noise removal processing on the noise of different types.

Specifically, noise identification models constructed based on different types of noise are utilized to determine the noise types contained in the user voice information acquired aiming at the target video chat scene, and then the user voice information is subjected to denoising treatment by adopting a denoising treatment mode corresponding to the noise types, so that the identification efficiency of noise identification can be improved, and the effect of denoising treatment of the user voice information is improved.

Determining preprocessed user voice information based on the user voice information subjected to the denoising processing; specifically, after the denoising processing is performed on the user voice information, at least one of pre-emphasis, framing and windowing can be continuously performed on the user voice information after the denoising processing.

In specific implementation, denoising is carried out on user voice information acquired in a target video chat scene to obtain user voice information after denoising processing, and then voice characteristic parameter information of a target user is determined based on the user voice information after denoising processing; in the case that the method in fig. 2 is executed by a client, the client may directly perform denoising recognition and processing on user voice information, and determine voice characteristic parameter information based on the user voice information after denoising processing, where the specific denoising recognition and processing may refer to a processing process of a background server; correspondingly, for the case that the method in fig. 2 is executed by the client and the background server together, it is considered that the user voice information uploaded by the client may be voice information before the noise removal processing (that is, the noise removal processing process for the user voice information may be executed by the background server), or may be voice information after the noise removal processing (that is, the noise removal processing process for the user voice information may be executed by the client); aiming at the condition that the denoising process is executed by the client, the client directly uploads the user voice information after denoising process to the background server, and the background server directly inputs the user voice information after denoising process to the background sound fusion model; aiming at the condition that the denoising process is executed by a background server, a client uploads user voice information before denoising processing to the background server, and the background server performs denoising processing on the user voice information before denoising processing and inputs the user voice information to a background sound fusion model; the method includes the steps of preprocessing the user voice signal information after the denoising process, namely performing at least one of pre-emphasis, framing and windowing on the user voice signal after the denoising process to obtain preprocessed user voice information, executing step S1022, extracting voice characteristic parameter information in the user voice information after the denoising process, and further adjusting audio characteristic parameter information based on the voice characteristic parameter information in the user voice information after the denoising process.

Further, in order to detect the effect of denoising processing, the user voice information after denoising processing may be input to a Perceptual Objective voice Quality evaluation model (POLQA), the Perceptual Objective voice Quality evaluation model is used to identify denoising effects of the user voice information after denoising processing, whether the denoising processing effect of the user voice information after denoising processing reaches an expectation is determined according to the denoising effect identification result, if yes, the user voice information is determined as the final user voice information after denoising processing, and the pre-processed user voice information is determined based on the final user voice information after denoising processing; if not, carrying out denoising processing on the user voice information again until the denoising processing effect of the user voice information reaches the expectation; specifically, in the process of identifying the denoising effect of the user voice information, the current denoised user voice information (i.e. the voice signal to be evaluated) is sent to the opposite terminal through the background server terminal and is compared with the denoised user voice information (i.e. the evaluation reference voice signal) received by the opposite terminal, wherein the perception difference between the evaluation reference voice signal and the voice signal to be evaluated is evaluated as a difference, and since the user voice information generates a voice signal distortion phenomenon in the process of being transmitted through the background server terminal, the quality of the denoising effect determines the severity of the voice signal distortion phenomenon, when the denoising effect is not good, the voice signal distortion phenomenon is more serious, therefore, the perception objective voice quality evaluation model can be used for detecting the denoising treatment effect, and specifically, when the difference is greater than a preset threshold value, the de-noising process is performed again on the user voice information.

In the embodiment of the invention, the influence of environmental noise during video chat can be filtered by carrying out noise removal processing on the user voice information, and the voice characteristic parameter information is determined based on the user voice information after the noise removal processing, so that the determination accuracy of the voice characteristic parameter information can be improved, and the adjustment accuracy of the audio characteristic parameter information of background music can be further improved; and moreover, the denoising sound effect evaluation of the user voice information after denoising processing is added, and the voice characteristic parameter information is determined based on the user voice information with the denoising sound effect reaching the expectation, so that the accuracy of determining the voice characteristic parameter information can be further improved, and the accuracy of adjusting the audio characteristic parameter information of the background music can be further improved.

Further, in the process of video chatting, a situation that different background music needs to be replaced due to the change of the chatting topic may occur, and specifically, the face image information of the target user in the target video chatting scene is acquired according to a preset time interval; determining user mouth shape information based on the currently acquired face image information; determining a user voice phoneme sequence based on the user mouth shape information; judging that the currently determined user speech phoneme sequence is not the same as the last determined user speech phoneme sequence, continuing to execute the step S1012, selecting background music audio information matched with the currently determined user speech phoneme sequence based on the lyric phoneme sequence of each alternative background music, and executing the steps S102 to S103, further playing the background music audio information matched with the currently determined user speech phoneme sequence according to a preset background sound switching mode, specifically, gradually reducing the playing volume of the background music audio information matched with the last determined user speech phoneme sequence within a preset time interval, and gradually increasing the playing volume of the background music audio information matched with the currently determined user speech phoneme sequence, thereby completing the transition between the last determined background music audio information and the currently determined background music audio information, the conversion between the background music is more natural, wherein the determined user voice phoneme sequence changes, so that the currently determined background music audio information and the last determined background music audio information also change, namely, the background sound style conversion is required, and the played background sound can be adaptively adjusted along with the change of the chat style, so that the use experience of a user is improved; further, when the user selects the sound change setting, the voice characteristic parameter information in the voice information of the user is also changed, and the audio characteristic parameter information is adjusted based on the voice characteristic parameter information, so that the background music audio information with new audio characteristic parameter information can be obtained, and at this time, the preset background sound switching mode can be adopted to play the background music audio information with new audio characteristic parameter information.

Further, for a situation that a target user may have a dialect accent, which may result in a problem that background music audio information matching with a user speech phoneme sequence cannot be found, the determining a user speech phoneme sequence based on the user mouth shape information specifically includes:

when the fact that dialect accents exist in the user voice information is determined, determining a dialect phoneme sequence based on the user accent information;

converting the determined dialect phoneme sequence into a standard phoneme sequence based on the corresponding relation between the preset dialect phoneme and the standard phoneme;

and determining the user voice phoneme sequence based on the converted standard phoneme sequence.

In specific implementation, the mouth shape recognition model trained in advance can be used for recognizing the determined mouth shape information of the user to determine a dialect phoneme sequence; specifically, the mouth shape recognition model may be obtained by training as follows:

acquiring fourth training sample data, wherein the fourth training sample data comprises a plurality of mouth shape recognition model training samples, and each mouth shape recognition model training sample represents the corresponding relation between the mouth shape information of the historical user and the phoneme sequence of the historical dialect;

and performing iterative training and updating on preset mouth shape recognition model parameters by adopting a machine learning method based on the fourth training sample data to obtain updated model parameters until model functions corresponding to the mouth shape recognition model converge, and further obtaining a trained mouth shape recognition model, wherein the mouth shape recognition model is used for predicting the speech phoneme sequence based on the mouth shape information of the user.

It should be noted that, for the case that the method in fig. 2 is executed by the client, the mouth shape recognition model may be obtained by training at the background server, and then the trained mouth shape recognition model is deployed at the client, so that the client can directly recognize the determined mouth shape information of the user by using the mouth shape recognition model trained in advance, and determine the dialect phoneme sequence.

Specifically, after face image information of a target user is acquired, face image information acquired for a target video chat scene within a preset time interval is detected by using a mouth shape detection technology in a machine vision technology, wherein the face image information within the preset time interval is an image sequence continuously changing within the preset time interval, that is, continuously changing face mouth shape position information is identified from the continuously changing image sequence, so that characteristics (namely digital coding vector characteristics) of the continuously changing user mouth shape are obtained; inputting the feature of continuous change of the user mouth shape (namely the feature of the digital coding vector) into a mouth shape recognition model which is trained in advance, recognizing the dialect pronunciation corresponding to the user mouth shape, outputting a dialect phoneme sequence based on the dialect pronunciation corresponding to the user mouth shape, and converting the dialect phoneme sequence into a standard phoneme sequence based on the corresponding relation between the preset dialect phoneme and the standard phoneme so as to determine the user voice phoneme sequence and further obtain a natural language phoneme sequence with the highest possibility.

Further, after the user voice phoneme sequence is determined based on the determined dialect phoneme sequence, background music audio information matched with the user voice phoneme sequence is selected based on the lyric phoneme sequence of each alternative background music.

In specific implementation, a target user firstly selects whether to start the intelligent score based on self requirements, after the target user selects to start the intelligent score, the mouth shape information of the target user is determined based on the acquired face image information of the target user, and a user voice phoneme sequence is determined based on the mouth shape information of the target user; then selecting background music audio information matched with the user voice phoneme sequence based on the lyric phoneme sequence of each alternative background music; when the dialect accent exists in the user voice information, determining a dialect phoneme sequence based on the user accent information, and converting the dialect phoneme sequence into a standard phoneme sequence based on a preset corresponding relation between the dialect phoneme and the standard phoneme; determining a user voice phoneme sequence based on the standard phoneme sequence, and then selecting background music audio information matched with the user voice phoneme sequence based on the lyric phoneme sequence of each alternative background music; determining voice characteristic parameter information of a target user in the target video chat scene based on the user voice information acquired aiming at the target video chat scene; the voice characteristic parameter information can be determined by the voice information of the user after the voice change, or by the voice information of the user after the noise removal processing, or by the voice information of the user after the voice change and the noise removal processing; adjusting audio characteristic parameter information of background music audio information based on the determined voice characteristic parameter information, specifically, simultaneously inputting user voice information and the background music audio information into a background sound fusion model, preprocessing the user voice information to obtain the voice characteristic parameter information in the user voice information, adjusting the audio characteristic parameter information in the background music audio information based on the voice characteristic parameter information, and specifically, judging whether the tone colors of the background music audio information and the user voice information are matched based on a Mel frequency cepstrum coefficient; if the judgment result is yes, adjusting a first audio characteristic parameter of the background music audio information, which is used for representing the playing rhythm speed, based on the voice duration information; adjusting a second audio characteristic parameter of the background music audio information for representing the playing frequency based on the pitch period information; and adjusting a third audio characteristic parameter of the background music audio information, which is used for representing the playing volume, based on the short-time energy spectrum information, and playing the background music audio information based on the adjusted audio characteristic parameter information.

The audio playing method in the embodiment of the application comprises the steps of firstly obtaining face image information and user voice information of a target user in a target video chat scene; the mouth shape change information of the target user can be identified from the face image information, so that the chat content of the target user can be determined, and the background music audio information corresponding to the target video chat scene can be further determined; the voice characteristics of the target user can be recognized from the user voice information of the target user, so that the voice characteristic parameter information of the target user can be determined; then, based on the voice characteristic parameter information, the audio characteristic parameter information of the determined background music audio information is adjusted, so that the background music audio information is played based on the adjusted audio characteristic parameter information, namely, the background music is automatically matched based on the face image information, and meanwhile, the audio characteristic parameter information of the background music is automatically adjusted based on the voice information of the user, so that not only is the background music matched with the current chat topic integrated in the video chat realized, but also the audio and the audio characteristic parameters of the background music can be intelligently adjusted based on the chat content of the user and the voice characteristic parameters of the user, so that the background music is more matched with the chat content of the user and the voice characteristics of the user, and the use experience of the user in the video chat process is improved.

It should be noted that, in the audio playing method provided in the embodiment of the present application, the execution main body may be an audio playing device, or a control module used for executing the audio playing method in the audio playing device. The embodiment of the present application takes an audio playing device executing an audio playing method as an example, and describes an audio playing device provided in the embodiment of the present application.

According to the audio playing method provided by the embodiment of the application, in the process of video chatting between a video chatting calling party and a video chatting called party, the face image information and the user voice information of a target user in a target video chatting scene are acquired; the mouth shape change information of the target user can be identified from the face image information, so that the chat content of the target user can be determined, and the background music audio information corresponding to the target video chat scene can be further determined; the voice characteristics of the target user can be recognized from the user voice information of the target user, so that the voice characteristic parameter information of the target user can be determined; then, based on the voice characteristic parameter information, the audio characteristic parameter information of the determined background music audio information is adjusted, so that the background music audio information is played based on the adjusted audio characteristic parameter information, namely, the background music is automatically matched based on the face image information, and meanwhile, the audio characteristic parameter information of the background music is automatically adjusted based on the voice information of the user, so that not only is the background music matched with the current chat topic integrated in the video chat realized, but also the audio and the audio characteristic parameters of the background music can be intelligently adjusted based on the chat content of the user and the voice characteristic parameters of the user, so that the background music is more matched with the chat content of the user and the voice characteristics of the user, and the use experience of the user in the video chat process is improved.

It should be noted that the embodiment of the present application and the previous embodiment of the present application are based on the same inventive concept, and therefore, for specific implementation of the embodiment, reference may be made to implementation of the foregoing audio playing method, and repeated details are not repeated.

On the basis of the same technical concept, an audio playing apparatus is further provided in an embodiment of the present application corresponding to the audio playing method provided in the foregoing embodiment, and fig. 8 is a schematic diagram illustrating a module composition of the audio playing apparatus provided in the embodiment of the present application, where the audio playing apparatus is disposed at a background server or a client and is configured to execute the audio playing method described in fig. 1 to 7, and as shown in fig. 8, the audio playing apparatus includes:

a background music determining module 802, configured to determine, based on face image information acquired for a target video chat scene, background music audio information corresponding to the target video chat scene;

a voice characteristic parameter determining module 804, configured to determine, based on the user voice information acquired for the target video chat scene, voice characteristic parameter information of a target user in the target video chat scene;

an audio characteristic parameter adjusting module 806, configured to adjust audio characteristic parameter information of the background music audio information based on the voice characteristic parameter information; and playing the background music audio information based on the audio characteristic parameter information.

Optionally, the background music determining module 802 is specifically configured to:

determining user mouth shape information based on face image information acquired aiming at a target video chat scene;

determining a user voice phoneme sequence based on the user mouth shape information;

and selecting background music audio information matched with the user voice phoneme sequence based on the lyric phoneme sequence of each alternative background music.

Optionally, the speech feature parameter determining module 804 is specifically configured to:

preprocessing the user voice information acquired aiming at the target video chat scene to obtain preprocessed user voice information;

extracting voice characteristic parameter information of a target user in the target video chat scene from the preprocessed user voice information, wherein the voice characteristic parameter information comprises: time domain characteristic parameter information and/or frequency domain characteristic parameter information.

Optionally, the time-domain feature parameter information includes: the information of the voice duration, the information of the pitch period and the information of the short-time energy spectrum, and the information of the frequency domain characteristic parameters comprises: mel-frequency cepstral coefficients;

the audio characteristic parameter adjusting module 806 is further specifically configured to:

judging whether the tone of the background music audio information is matched with the tone of the user voice information or not based on the Mel frequency cepstrum coefficient;

if the judgment result is yes, adjusting a first audio characteristic parameter, used for representing the playing rhythm speed, of the background music audio information based on the voice duration information; and the number of the first and second groups,

adjusting a second audio characteristic parameter, used for representing the playing frequency, of the background music audio information based on the pitch period information; and the number of the first and second groups,

and adjusting a third audio characteristic parameter of the background music audio information for representing the playing volume based on the short-time energy spectrum information.

Optionally, the speech feature parameter determining module 804 is further specifically configured to:

judging whether the user voice information contains noise audio information or not;

if so, performing denoising processing on the user voice information based on the noise category of the noise audio information to obtain denoised user voice information;

and determining the preprocessed user voice information based on the de-noised user voice information.

The audio playing device in the embodiment of the application firstly acquires the face image information and the user voice information of a target user in a target video chat scene; the mouth shape change information of the target user can be identified from the face image information, so that the chat content of the target user can be determined, and the background music audio information corresponding to the target video chat scene can be further determined; the voice characteristics of the target user can be recognized from the user voice information of the target user, so that the voice characteristic parameter information of the target user can be determined; then, based on the voice characteristic parameter information, the audio characteristic parameter information of the determined background music audio information is adjusted, so that the background music audio information is played based on the adjusted audio characteristic parameter information, namely, the background music is automatically matched based on the face image information, and meanwhile, the audio characteristic parameter information of the background music is automatically adjusted based on the voice information of the user, so that not only is the background music matched with the current chat topic integrated in the video chat realized, but also the audio and the audio characteristic parameters of the background music can be intelligently adjusted based on the chat content of the user and the voice characteristic parameters of the user, so that the background music is more matched with the chat content of the user and the voice characteristics of the user, and the use experience of the user in the video chat process is improved.

It should be noted that the embodiment of the present application and the previous embodiment of the present application are based on the same inventive concept, and therefore, for specific implementation of the embodiment, reference may be made to implementation of the foregoing audio playing method, and repeated details are not repeated.

The audio playing device in the embodiment of the present application may be a device, or may be a component, an integrated circuit, or a chip in a terminal. The device can be mobile electronic equipment or non-mobile electronic equipment. By way of example, the mobile electronic device may be a mobile phone, a tablet computer, a notebook computer, a palm top computer, a vehicle-mounted electronic device, a wearable device, an ultra-mobile personal computer (UMPC), a netbook or a Personal Digital Assistant (PDA), and the like, and the non-mobile electronic device may be a server, a Network Attached Storage (NAS), a Personal Computer (PC), a Television (TV), a teller machine or a self-service machine, and the like, and the embodiments of the present application are not particularly limited.

The audio playing device in the embodiment of the present application may be a device having an operating system. The operating system may be an Android (Android) operating system, an ios operating system, or other possible operating systems, and embodiments of the present application are not limited specifically.

The audio playing device provided in the embodiment of the present application can implement each process implemented by the audio playing method embodiments in fig. 1 to fig. 7, and is not described herein again to avoid repetition.

Optionally, as shown in fig. 9, an electronic device is further provided in this embodiment of the present application, and includes a processor 9011, a memory 909, and a program or an instruction that is stored in the memory 909 and is executable on the processor 9011, and when the program or the instruction is executed by the processor 9011, the processes of the foregoing embodiment of the audio playing method are implemented, and the same technical effect can be achieved, and is not described herein again to avoid repetition.

It should be noted that the electronic device in the embodiment of the present application includes the mobile electronic device and the non-mobile electronic device described above.

Fig. 9 is a schematic diagram of a hardware structure of an electronic device implementing an embodiment of the present application.

The electronic devices include, but are not limited to: a radio frequency unit 901, a network module 902, an audio output unit 903, an input unit 904, a sensor 905, a display unit 906, a user input unit 907, an interface unit 908, a memory 909, a processor 9011, a power supply 9010, and the like.

Those skilled in the art will appreciate that the electronic device may further include a power supply 9010 (such as a battery) for supplying power to each component, and the power supply 9010 may be logically connected to the processor 9011 through a power management system, so as to implement functions of managing charging, discharging, and power consumption through the power management system. The electronic device structure shown in fig. 9 does not constitute a limitation of the electronic device, and the electronic device may include more or less components than those shown, or combine some components, or arrange different components, and thus, the description is not repeated here.

The processor 9011 is configured to determine, based on face image information acquired for a target video chat scene, background music audio information corresponding to the target video chat scene; and the number of the first and second groups,

determining voice characteristic parameter information of a target user in the target video chat scene based on the user voice information acquired aiming at the target video chat scene;

and adjusting the audio characteristic parameter information of the background music audio information based on the voice characteristic parameter information, and playing the background music audio information based on the audio characteristic parameter information.

In the embodiment of the application, the background music is blended in the video chat, and the audio frequency characteristic parameter of the background music are intelligently adjusted based on the chat content of the user and the voice characteristic parameter of the user, so that the background music, the chat content of the user and the voice characteristic of the user are better matched, and the use experience of the user in the video chat process is improved.

The electronic equipment in the embodiment of the application firstly acquires the face image information and the user voice information of a target user in a target video chat scene; the mouth shape change information of the target user can be identified from the face image information, so that the chat content of the target user can be determined, and the background music audio information corresponding to the target video chat scene can be further determined; the voice characteristics of the target user can be recognized from the user voice information of the target user, so that the voice characteristic parameter information of the target user can be determined; then, based on the voice characteristic parameter information, the audio characteristic parameter information of the determined background music audio information is adjusted, so that the background music audio information is played based on the adjusted audio characteristic parameter information, namely, the background music is automatically matched based on the face image information, and meanwhile, the audio characteristic parameter information of the background music is automatically adjusted based on the voice information of the user, so that not only is the background music matched with the current chat topic integrated in the video chat realized, but also the audio and the audio characteristic parameters of the background music can be intelligently adjusted based on the chat content of the user and the voice characteristic parameters of the user, so that the background music is more matched with the chat content of the user and the voice characteristics of the user, and the use experience of the user in the video chat process is improved.

It should be understood that, in this embodiment of the application, the radio frequency unit 901 may be used for receiving and sending signals in a process of receiving and sending information or a call, and specifically, after receiving downlink data from a base station, the downlink data is processed by the processor 9011; in addition, the uplink data is transmitted to the base station. Generally, the radio frequency unit 901 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier, a duplexer, and the like. In addition, the radio frequency unit 901 can also communicate with a network and other devices through a wireless communication system.

The electronic device provides wireless broadband internet access to the user via the network module 902, such as assisting the user in sending and receiving e-mails, browsing web pages, and accessing streaming media.

The audio output unit 903 may convert audio data received by the radio frequency unit 901 or the network module 902 or stored in the memory 909 into an audio signal and output as sound. Also, the audio output unit 903 may also provide audio output related to a specific function performed by the electronic device (e.g., a call signal reception sound, a message reception sound, etc.). The audio output unit 903 includes a speaker, a buzzer, a receiver, and the like.

The input Unit 904 may include a Graphics Processing Unit (GPU) 9041 and a microphone 9042, and the Graphics processor 9041 processes image data of a still picture or video obtained by an image capturing device (such as a camera) in a video capture mode or an image capture mode. The display unit 906 may include a display panel 9061, and the display panel 9061 may be configured in the form of a liquid crystal display, an organic light emitting diode, or the like. The user input unit 907 includes a touch panel 9071 and other input devices 9072. A touch panel 9071 also referred to as a touch screen. The touch panel 9071 may include two parts, a touch detection device and a touch controller. Other input devices 9072 may include, but are not limited to, a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, a mouse, and a joystick, which are not described in detail herein. Memory 909 can be used to store software programs as well as various data including, but not limited to, application programs and operating systems. Processor 9011 may integrate an application processor, which primarily handles operating systems, user interfaces, application programs, and the like, and a modem processor, which primarily handles wireless communications. It is to be appreciated that the modem processor may not be integrated into the processor 9011.

The electronic device also includes at least one sensor 905, such as light sensors, motion sensors, and other sensors. Specifically, the light sensor includes an ambient light sensor and a proximity sensor, wherein the ambient light sensor may adjust the brightness of the display panel 9061 according to the brightness of ambient light, and the proximity sensor may turn off the display panel 9061 and/or the backlight when the electronic device is moved to the ear. As one type of motion sensor, an accelerometer sensor can detect the magnitude of acceleration in each direction (generally three axes), detect the magnitude and direction of gravity when stationary, and can be used to identify the posture of an electronic device (such as horizontal and vertical screen switching, related games, magnetometer posture calibration), and vibration identification related functions (such as pedometer, tapping); the sensors 905 may also include a fingerprint sensor, a pressure sensor, an iris sensor, a molecular sensor, a gyroscope, a barometer, a hygrometer, a thermometer, an infrared sensor, etc., which are not described in detail herein.

The display unit 906 is used to display information input by the user or information provided to the user. The Display unit 906 may include a Display panel 9061, and the Display panel 9061 may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like.

The user input unit 907 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic device. Specifically, the user input unit 907 includes a touch panel 9071 and other input devices 9072. The touch panel 9071, also referred to as a touch screen, may collect touch operations by a user on or near the touch panel 9071 (e.g., operations by a user on or near the touch panel 9071 using a finger, a stylus, or any other suitable object or accessory). The touch panel 9071 may include two parts, a touch detection device and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, sends the touch point coordinates to the processor 9011, and receives and executes a command sent from the processor 9011. In addition, the touch panel 9071 may be implemented by using various types such as a resistive type, a capacitive type, an infrared ray, and a surface acoustic wave. The user input unit 907 may include other input devices 9072 in addition to the touch panel 9071. Specifically, the other input devices 9072 may include, but are not limited to, a physical keyboard, function keys (such as a volume control key, a switch key, and the like), a track ball, a mouse, and a joystick, which are not described herein again.

Further, the touch panel 9071 may be overlaid on the display panel 9061, and when the touch panel 9071 detects a touch operation on or near the touch panel 9071, the touch panel is transmitted to the processor 9011 to determine the type of the touch event, and then the processor 9011 provides a corresponding visual output on the display panel 9061 according to the type of the touch event. Although in fig. 9, the touch panel 9071 and the display panel 9061 are two independent components to implement the input and output functions of the electronic device, in some embodiments, the touch panel 9071 and the display panel 9061 may be integrated to implement the input and output functions of the electronic device, which is not limited herein.

The interface unit 908 is an interface for connecting an external device to the electronic apparatus. For example, the external device may include a wired or wireless headset port, an external power supply (or battery charger) port, a wired or wireless data port, a memory card port, a port for connecting a device having an identification module, an audio input/output (I/O) port, a video I/O port, an earphone port, and the like. The interface unit 908 may be used to receive input (e.g., data information, power, etc.) from an external device and transmit the received input to one or more elements within the electronic apparatus or may be used to transmit data between the electronic apparatus and the external device.

The memory 909 may be used to store software programs as well as various data. The memory 909 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. Further, the memory 909 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.

The processor 9011 is a control center of the electronic device, connects various parts of the entire electronic device with various interfaces and lines, and performs various functions of the electronic device and processes data by running or executing software programs and/or modules stored in the memory 909 and calling data stored in the memory 909, thereby performing overall monitoring of the electronic device. Processor 9011 may include one or more processing units; preferably, the processor 9011 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, and the like, and a modem processor, which mainly handles wireless communication. It will be appreciated that the above-described modem processing may not be integrated into the processor.

The electronic device may further include a power supply 9010 (such as a battery) for supplying power to each component, and preferably, the power supply 9010 may be logically connected to the processor 9011 through a power supply 9010 management system, so that functions of charging, discharging, power consumption management and the like are managed through the power supply 9010 management system.

In addition, the electronic device includes some functional modules that are not shown, and are not described in detail herein.

Preferably, an embodiment of the present application further provides an electronic device, which includes a processor 9011, a memory 909, and a program or an instruction that is stored in the memory 909 and is executable on the processor 9011, and when the program or the instruction is executed by the processor 9011, the processes of the above-mentioned embodiment of the audio playing method are implemented, and the same technical effect can be achieved, and in order to avoid repetition, details are not repeated here.

An embodiment of the present application further provides a readable storage medium, where a program or an instruction is stored in the readable storage medium, and when the program or the instruction is executed by the processor 9011, the processes of the embodiment of the audio playing method are implemented, and the same technical effect can be achieved.

The processor 9011 is the processor 9011 in the electronic device described in the foregoing embodiment. The readable storage medium includes a computer readable storage medium, such as a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and so on.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element. Further, it should be noted that the scope of the methods and apparatus of the embodiments of the present application is not limited to performing the functions in the order illustrated or discussed, but may include performing the functions in a substantially simultaneous manner or in a reverse order based on the functions involved, e.g., the methods described may be performed in an order different than that described, and various steps may be added, omitted, or combined. In addition, features described with reference to certain examples may be combined in other examples.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a computer software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (such as a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present application.

While the present embodiments have been described with reference to the accompanying drawings, it is to be understood that the invention is not limited to the precise embodiments described above, which are meant to be illustrative and not restrictive, and that various changes may be made therein by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

31页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种手势控制视频播放的方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!

技术分类