Sound source separation method, device and equipment

文档序号:958602 发布日期:2020-10-30 浏览:7次 中文

阅读说明:本技术 一种声源分离方法、装置及设备 (Sound source separation method, device and equipment ) 是由 尚光双 孙凤宇 陈亮 于 2019-02-27 设计创作,主要内容包括:本申请实施例提供一种声源分离方法、装置及设备,其中方法包括:获取第一音频信号,并获取该第一音频信号对应的至少一个图像帧,该至少一个图像帧包括目标声源的图像信息;以及,根据第一音频信号和至少一个图像帧,获取目标声源在第一音频信号中的时频分布信息;进而,根据所述时频分布信息,从所述第一音频信号中获取属于所述目标声源的第二音频信号。可以较为准确地从第一音频信号中获取属于目标声源的第二音频信号。(The embodiment of the application provides a sound source separation method, a sound source separation device and sound source separation equipment, wherein the method comprises the following steps: acquiring a first audio signal and acquiring at least one image frame corresponding to the first audio signal, wherein the at least one image frame comprises image information of a target sound source; acquiring time-frequency distribution information of a target sound source in the first audio signal according to the first audio signal and the at least one image frame; and then, according to the time-frequency distribution information, acquiring a second audio signal belonging to the target sound source from the first audio signal. The second audio signal belonging to the target sound source can be acquired more accurately from the first audio signal.)

A sound source separation method, comprising:

acquiring a first audio signal;

acquiring at least one image frame corresponding to the first audio signal; the at least one image frame includes image information of a target sound source;

acquiring time-frequency distribution information of the target sound source in the first audio signal according to the first audio signal and the at least one image frame;

and acquiring a second audio signal belonging to the target sound source from the first audio signal according to the time-frequency distribution information.

The method of claim 1, wherein obtaining time-frequency distribution information of the target sound source in the first audio signal from the first audio signal and the at least one image frame comprises:

acquiring a first audio feature of the first audio signal;

acquiring a first image frame from the at least one image frame;

identifying a feature region in the first image frame;

acquiring a first image characteristic according to the characteristic region;

and processing the characteristic region, the first image characteristic and the first audio characteristic by utilizing a neural network to obtain the time-frequency distribution information.

The method as recited in claim 2, wherein the first image frame is any one of the at least one image frame or the first image frame is a center image frame of the at least one image frame.

The method of claim 2 or 3, wherein acquiring a first image feature from the feature region comprises:

and processing the characteristic region by utilizing an Active Appearance Model (AAM) to obtain the first image characteristic.

The method of any of claims 2 to 4, wherein obtaining the first audio feature of the first audio signal comprises:

and performing time-frequency transformation processing on the first audio signal to obtain the first audio characteristic.

The method of any of claims 1-5, wherein the time-frequency distribution information comprises a probability value for each time-frequency unit in the first audio signal; the probability value is used for indicating the probability of the audio signal generated by the target sound source existing in the time frequency unit;

acquiring a second audio signal belonging to the target sound source from the first audio signal according to the time-frequency distribution information, wherein the method comprises the following steps:

acquiring a first audio intensity value of each time-frequency unit in the first audio signal;

obtaining a second audio intensity value of each time-frequency unit according to the first audio intensity value of each time-frequency unit and the probability value corresponding to each time-frequency unit;

and obtaining the second audio signal according to the second audio intensity value of each time-frequency unit.

The method according to any one of claims 1 to 6, further comprising, after obtaining a second audio signal belonging to the target sound source from the first audio signal:

processing the second audio signal using a speech recognition model to obtain language text information carried in the second audio signal.

A sound source separation apparatus, comprising:

the audio acquisition module is used for acquiring a first audio signal;

the image acquisition module is used for acquiring at least one image frame corresponding to the first audio signal; the at least one image frame includes image information of a target sound source;

the joint processing module is used for acquiring time-frequency distribution information of the target sound source in the first audio signal according to the first audio signal and the at least one image frame; and acquiring a second audio signal belonging to the target sound source from the first audio signal according to the time-frequency distribution information.

The apparatus of claim 8, wherein the joint processing module is specifically configured to: acquiring a first audio feature of the first audio signal; acquiring a first image frame from the at least one image frame; identifying a feature region in the first image frame; acquiring a first image characteristic according to the characteristic region; and processing the characteristic region, the first image characteristic and the first audio characteristic by utilizing a neural network to obtain the time-frequency distribution information.

The apparatus as recited in claim 9, wherein said first image frame is any one of said at least one image frame or said first image frame is a center image frame of said at least one image frame.

The apparatus of claim 9 or 10, wherein the joint processing module is specifically configured to: and processing the characteristic region by utilizing an Active Appearance Model (AAM) to obtain the first image characteristic.

The apparatus according to any of claims 9 to 11, wherein the joint processing module is specifically configured to: and performing time-frequency transformation processing on the first audio signal to obtain the first audio characteristic.

The apparatus of any of claims 8-12, wherein the time-frequency distribution information comprises a probability value corresponding to each time-frequency unit in the first audio signal; the probability value is used for indicating the probability of the audio signal generated by the target sound source in the time frequency unit;

the joint processing module is specifically configured to: acquiring a first audio intensity value of each time-frequency unit in the first audio signal; obtaining a second audio intensity value of each time-frequency unit according to the first audio intensity value of each time-frequency unit and the probability value corresponding to each time-frequency unit; and obtaining the second audio signal according to the second audio intensity value of each time-frequency unit.

The apparatus of any of claims 8 to 13, further comprising a speech recognition module;

and the voice recognition module is used for processing the second audio signal by using a voice recognition model to acquire language text information carried in the second audio signal.

A sound source separation apparatus comprising a processor and a memory;

the memory is to store program instructions;

the processor is configured to execute the program instructions to cause the apparatus to perform the method of any of claims 1 to 7.

A sound source separating apparatus comprising the sound source separating device according to claim 15, and an audio collector, and/or a video collector;

the audio collector is used for collecting the first audio signal;

the video collector is used for collecting a first video signal carrying the at least one image frame.

The apparatus of claim 16, further comprising a speaker;

the loudspeaker is used for converting the second audio signal into an external sound.

The device of claim 16 or 17, further comprising a display;

the display is for displaying textual information identified from the second audio signal.

The apparatus of any of claims 16 to 18, further comprising a transceiver;

the transceiver is configured to receive the first audio signal, and/or to receive the first video signal, and/or to transmit the second audio signal, and/or to transmit textual information identified from the second audio signal.

26页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:智能设备的情境感知控制

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!