Voice processing method and device, computer readable storage medium and electronic device

文档序号:972877 发布日期:2020-11-03 浏览:2次 中文

阅读说明:本技术 语音处理方法及装置、计算机可读的存储介质、电子装置 (Voice processing method and device, computer readable storage medium and electronic device ) 是由 陈帅 于 2020-07-15 设计创作,主要内容包括:本发明提供了一种语音处理方法及装置、计算机可读的存储介质、电子装置,其中,语音处理方法包括:获取第一终端发送的第一语音信息,并将所述第一语音信息与预设的一个或多个目标感情信息叠加以生成一个或多个第二语音信息;发送所述第一语音信息或所述第二语音信息至第二终端。通过本发明实施例,可以解决相关技术中,用户发送的语音无法表达用户期望的情感进而导致用户体验不佳的问题,以令用户发送的语音可真实表达用户期望的情感,进而改善了语音功能实现的用户体验。(The invention provides a voice processing method and device, a computer readable storage medium and an electronic device, wherein the voice processing method comprises the following steps: acquiring first voice information sent by a first terminal, and superposing the first voice information and preset one or more target emotion information to generate one or more second voice information; and sending the first voice information or the second voice information to a second terminal. By the embodiment of the invention, the problem that the voice sent by the user can not express the emotion expected by the user and the user experience is poor in the related technology can be solved, so that the voice sent by the user can really express the emotion expected by the user, and the user experience of realizing the voice function is improved.)

1. A speech processing method applied to a server, the method comprising:

acquiring first voice information sent by a first terminal, and superposing the first voice information and preset one or more target emotion information to generate one or more second voice information;

and sending the first voice information or the second voice information to a second terminal.

2. The method of claim 1, wherein the superimposing the first voice message with one or more preset target emotion information to generate one or more second voice messages comprises:

determining original emotion information carried in the first voice information;

determining the one or more target emotion information corresponding to the original emotion information according to the original emotion information;

superimposing the one or more target emotional information with the first voice information to generate the one or more second voice information.

3. The method of claim 2, wherein the determining original emotion information carried in the first voice message comprises:

determining the original emotion information carried in the first voice information according to the first voice information and a preset neural network model;

the neural network model is obtained by training according to sample voice information and sample emotion information carried in the sample voice information.

4. The method according to claim 3, wherein the determining the original emotion information carried in the first speech information according to the first speech information and a preset neural network model comprises:

determining one or more pieces of emotion information to be confirmed according to the first voice information and the neural network model; wherein the one or more pieces of emotion information to be confirmed are emotion information output by the neural network model;

sending the one or more identifications of the emotional information to be confirmed to the first terminal for confirmation by the first terminal;

determining the original emotion information in the one or more pieces of emotion information to be determined according to first confirmation information returned by the first terminal; the first confirmation information is used for indicating the identification of the emotion information to be confirmed.

5. The method of any one of claims 1 to 4, wherein the one or more target emotional information is one or more emotional information previously associated with the original emotional information.

6. The method according to any one of claims 1 to 4, wherein the superimposing the first speech information with one or more preset target emotion information to generate one or more second speech information comprises:

and replacing the acoustic features of the first voice information with the acoustic features corresponding to the one or more target emotion information respectively to generate one or more second voice information.

7. The method of any of claims 1 to 4, wherein said sending the first voice message or the second voice message to a second terminal comprises:

sending a first identifier corresponding to the first voice message and a second identifier corresponding to the one or more second voice messages to the first terminal for the first terminal to confirm;

receiving second confirmation information returned by the first terminal, and sending the first voice information or the second voice information to a second terminal according to the second confirmation information; wherein the second acknowledgement information is used for indicating the first identifier or the second identifier.

8. A speech processing method applied to a first terminal, the method comprising:

sending first voice information to a server so that the server can send the first voice information or second voice information to a second terminal;

the second voice message is one or more voice messages generated by the server by superposing the first voice message and one or more preset target emotion messages.

9. The method according to claim 8, wherein the one or more target emotion information is determined by the server according to original emotion information carried in the first voice message;

and the original emotional information is determined by the server according to the first voice information.

10. The method according to claim 9, wherein the original emotional information is determined by the server according to the first speech information and a preset neural network model;

the neural network model is obtained by training according to sample voice information and sample emotion information carried in the sample voice information.

11. The method according to claim 10, wherein the determining the original emotion information carried in the first speech information according to the first speech information and a preset neural network model comprises:

receiving one or more identifications of emotional information to be confirmed sent by the server for confirmation; the server determines one or more pieces of emotion information to be confirmed according to the first voice information and the neural network model, wherein the one or more pieces of emotion information to be confirmed are emotion information output by the neural network model;

returning first confirmation information to the server; the first confirmation information is used for indicating the identification of the emotion information to be confirmed.

12. The method according to any one of claims 8 to 11, wherein the one or more target emotional information is one or more emotional information pre-associated with the original emotional information.

13. The method according to any one of claims 8 to 11, wherein the second voice message is generated by the server by replacing the acoustic features of the first voice message with the acoustic features corresponding to the one or more target emotional messages, respectively.

14. The method of any of claims 8 to 11, wherein after sending the first voice message to the server, the method further comprises:

receiving a first identifier corresponding to the first voice message and a second identifier corresponding to the one or more second voice messages sent by the server for confirmation;

returning second confirmation information to the server so that the server can send the first voice information or the second voice information to a second terminal according to the second confirmation information; wherein the second acknowledgement information is used for indicating the first identifier or the second identifier.

15. A speech processing apparatus provided in a server, the apparatus comprising:

the generating module is used for acquiring first voice information sent by a first terminal and superposing the first voice information and one or more preset target emotion information to generate one or more second voice information;

and the first sending module is used for sending the first voice information or the second voice information to a second terminal.

16. A speech processing apparatus, provided in a first terminal, the apparatus comprising:

the second sending module is used for sending the first voice information to a server so that the server can send the first voice information or the second voice information to a second terminal;

the second voice message is one or more voice messages generated by the server by superposing the first voice message and one or more preset target emotion messages.

17. A computer-readable storage medium, in which a computer program is stored, wherein the computer program is configured to perform the method of any one of claims 1 to 7 and 8 to 14 when the computer program is executed.

18. An electronic device comprising a memory and a processor, wherein the memory has stored therein a computer program, and the processor is configured to execute the computer program to perform the method of any one of claims 1 to 7 and 8 to 14.

Technical Field

The invention relates to the field of smart home, in particular to a voice processing method and device, a computer-readable storage medium and an electronic device.

Background

The function of voice messaging or voice call has become one of the common functions of communication tools in the related art. At present, the voice function in the general communication tool is that the voice to be sent of the sender is not rendered, and is directly sent to the receiver with the original sound effect. However, in some scenarios, the voice transmitted by the user of the transmitting party cannot reflect the emotion that the user wishes to express due to psychological or wording factors, and the receiving party user is prone to have a deviation in understanding about the emotion that the user of the transmitting party actually expresses or wishes to express after receiving the voice. In this regard, the related art communication tool provides a partial "voice change" function, but the above-mentioned "voice change" function can only change the characteristics of the user of the sender to some extent, for example, modify the sound effect into the sound effect of "old man", which still cannot realize the change of the emotion expressed by the voice.

Aiming at the problem that the voice sent by the user cannot express the emotion expected by the user and further causes poor user experience in the related technology, an effective solution is not provided in the related technology.

Disclosure of Invention

The embodiment of the invention provides a voice processing method and device, a computer readable storage medium and an electronic device, which are used for at least solving the problem that the voice sent by a user cannot express the emotion expected by the user and further causes poor user experience in the related technology.

According to an embodiment of the present invention, there is provided a speech processing method applied to a server, the method including:

acquiring first voice information sent by a first terminal, and superposing the first voice information and preset one or more target emotion information to generate one or more second voice information;

and sending the first voice information or the second voice information to a second terminal.

In an optional embodiment, the superimposing the first voice message with one or more preset target emotion information to generate one or more second voice messages includes:

determining original emotion information carried in the first voice information;

determining the one or more target emotion information corresponding to the original emotion information according to the original emotion information;

superimposing the one or more target emotional information with the first voice information to generate the one or more second voice information.

In an optional embodiment, the determining the original emotion information carried in the first speech information includes:

determining the original emotion information carried in the first voice information according to the first voice information and a preset neural network model;

the neural network model is obtained by training according to sample voice information and sample emotion information carried in the sample voice information.

In an optional embodiment, the determining, according to the first speech information and a preset neural network model, the original emotion information carried in the first speech information includes:

determining one or more pieces of emotion information to be confirmed according to the first voice information and the neural network model; wherein the one or more pieces of emotion information to be confirmed are emotion information output by the neural network model;

sending the one or more identifications of the emotional information to be confirmed to the first terminal for confirmation by the first terminal;

determining the original emotion information in the one or more pieces of emotion information to be determined according to first confirmation information returned by the first terminal; the first confirmation information is used for indicating the identification of the emotion information to be confirmed.

In an optional embodiment, the one or more target emotional information is one or more emotional information pre-associated with the original emotional information.

In an optional embodiment, the superimposing the first voice message with one or more preset target emotion information to generate one or more second voice messages includes:

and replacing the acoustic features of the first voice information with the acoustic features corresponding to the one or more target emotion information respectively to generate one or more second voice information.

In an optional embodiment, the sending the first voice message or the second voice message to the second terminal includes:

sending a first identifier corresponding to the first voice message and a second identifier corresponding to the one or more second voice messages to the first terminal for the first terminal to confirm;

receiving second confirmation information returned by the first terminal, and sending the first voice information or the second voice information to a second terminal according to the second confirmation information; wherein the second acknowledgement information is used for indicating the first identifier or the second identifier.

According to another embodiment of the present invention, there is also provided a speech processing method applied to a first terminal, the method including:

sending first voice information to a server so that the server can send the first voice information or second voice information to a second terminal;

the second voice message is one or more voice messages generated by the server by superposing the first voice message and one or more preset target emotion messages.

In an optional embodiment, the one or more target emotion information is determined by the server according to original emotion information carried in the first voice message;

and the original emotional information is determined by the server according to the first voice information.

In an optional embodiment, the original emotional information is determined by the server according to the first voice information and a preset neural network model;

the neural network model is obtained by training according to sample voice information and sample emotion information carried in the sample voice information.

In an optional embodiment, the determining, according to the first speech information and a preset neural network model, the original emotion information carried in the first speech information includes:

receiving one or more identifications of emotional information to be confirmed sent by the server for confirmation; the server determines one or more pieces of emotion information to be confirmed according to the first voice information and the neural network model, wherein the one or more pieces of emotion information to be confirmed are emotion information output by the neural network model;

returning first confirmation information to the server; the first confirmation information is used for indicating the identification of the emotion information to be confirmed.

In an optional embodiment, the one or more target emotional information is one or more emotional information pre-associated with the original emotional information.

In an optional embodiment, the second voice message is generated by replacing, by the server, the acoustic features of the first voice message with the acoustic features corresponding to the one or more target emotion information, respectively.

In an optional embodiment, after sending the first voice message to the server, the method further includes:

receiving a first identifier corresponding to the first voice message and a second identifier corresponding to the one or more second voice messages sent by the server for confirmation;

returning second confirmation information to the server so that the server can send the first voice information or the second voice information to a second terminal according to the second confirmation information; wherein the second acknowledgement information is used for indicating the first identifier or the second identifier.

According to another embodiment of the present invention, there is also provided a speech processing apparatus provided in a server, the apparatus including:

the generating module is used for acquiring first voice information sent by a first terminal and superposing the first voice information and one or more preset target emotion information to generate one or more second voice information;

and the first sending module is used for sending the first voice information or the second voice information to a second terminal.

According to another embodiment of the present invention, there is also provided a voice processing apparatus provided in a first terminal, the apparatus including:

the second sending module is used for sending the first voice information to a server so that the server can send the first voice information or the second voice information to a second terminal;

the second voice message is one or more voice messages generated by the server by superposing the first voice message and one or more preset target emotion messages.

According to another embodiment of the present invention, a computer-readable storage medium is also provided, in which a computer program is stored, wherein the computer program is configured to perform the steps of any of the above-described method embodiments when executed.

According to another embodiment of the present invention, there is also provided an electronic device, including a memory in which a computer program is stored and a processor configured to execute the computer program to perform the steps in any of the above method embodiments.

According to the embodiment of the invention, in the process of sending the voice message to the second terminal by the first terminal, the server can superpose the first voice message which is taken as the original sound and the preset one or more target emotion information to generate one or more second voice messages, and then sends the first voice message or the second voice message to the second terminal. Therefore, the embodiment of the invention can solve the problem that the voice sent by the user can not express the emotion expected by the user and the user experience is poor in the related technology, so that the voice sent by the user can really express the emotion expected by the user, and the user experience of realizing the voice function is improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

FIG. 1 is a schematic diagram of a scenario of a speech processing method according to an embodiment of the present invention;

fig. 2 is a functional diagram of a first terminal provided according to an embodiment of the present invention;

FIG. 3 is a flow chart (one) of a speech processing method according to an embodiment of the present invention;

FIG. 4 is a system architecture diagram of a speech processing method provided in accordance with an exemplary embodiment of the present invention;

FIG. 5 is a flow chart of a method of speech processing provided according to an exemplary embodiment of the invention;

FIG. 6 is a flow chart of a speech processing method according to an embodiment of the present invention (II);

FIG. 7 is a block diagram (I) of a speech processing apparatus according to an embodiment of the present invention;

fig. 8 is a block diagram of a speech processing method according to an embodiment of the present invention.

Detailed Description

The invention will be described in detail hereinafter with reference to the accompanying drawings in conjunction with embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

In order to further describe the speech processing method and apparatus, the computer-readable storage medium, and the electronic apparatus in the embodiments of the present invention, the following further describes the speech processing method and apparatus, the computer-readable storage medium, and the application scenario related to the electronic apparatus in the embodiments of the present invention.

Fig. 1 is a schematic view of a scenario of a voice processing method according to an embodiment of the present invention, and as shown in fig. 1, an application scenario of the voice processing method in the embodiment of the present invention includes a first terminal 100, a second terminal 200, and a server 300, where a user of the first terminal 100 inputs voice information and sends the voice information to the second terminal through the server 300, so as to output the voice information to the user of the second terminal 200.

The first terminal and the second terminal in the embodiment of the present invention may be any electronic device with a voice function, such as a mobile terminal, a tablet computer, a desktop/laptop/notebook computer, a super mobile personal computer, a handheld computer, a netbook, a personal digital assistant, a wearable electronic device, and a virtual reality device, which is not limited in the embodiment of the present invention.

Taking the first terminal as an example of a mobile terminal, fig. 2 is a functional schematic diagram of the first terminal provided according to the embodiment of the present invention, and as shown in fig. 2, the mobile terminal may include one or more processors 102 (only one is shown in fig. 2) (the processor 102 may include but is not limited to a processing device such as a microprocessor MCU or a programmable logic device FPGA), and a memory 104 for storing data, where the mobile terminal may further include a transmission device 106 for a communication function and an input/output device 108. It will be understood by those skilled in the art that the structure shown in fig. 2 is only an illustration, and does not limit the structure of the mobile terminal. For example, the mobile terminal may also include more or fewer components than shown in FIG. 2, or have a different configuration than shown in FIG. 2.

The memory 104 can be used for storing computer programs, for example, software programs and modules of application software, such as computer programs corresponding to the voice processing method in the embodiment of the present invention, and the processor 102 executes various functional applications and data processing by running the computer programs stored in the memory 104, so as to implement the above-mentioned method. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In an example, the memory 104 may further include memory remotely located from the processor 102, which may be connected to the mobile terminal over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device 106 is used for receiving or transmitting data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the mobile terminal. In one example, the transmission device 106 includes a Network adapter (NIC), which can be connected to other Network devices through a base station so as to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is used to communicate with the internet in a wireless manner, for example, to transmit voice information to a server.

The following describes a data transmission method and apparatus, and an equipment upgrading method and apparatus in the embodiments of the present invention.

18页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:音频处理方法、装置、系统、浏览器模块和可读存储介质

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!