Robot, voice data processing method, device and storage medium

文档序号：96702 发布日期：2021-10-12 浏览：32次中文

阅读说明：本技术 机器人、语音数据处理方法、装置以及存储介质 (Robot, voice data processing method, device and storage medium ) 是由禤小兵黄寅于 2021-09-08 设计创作，主要内容包括：本发明公开了一种机器人、语音数据处理方法、装置以及存储介质,通过对检测区域进行检测,以确定在所述检测区域中是否存在目标物；在确定所述检测区域中存在目标物时,获取机器人采集的语音数据；对所述语音数据进行语音端点检测,以确定所述目标物是否为目标声音源。本发明通过目标物检测以及语音端点检测的方法,实现对目标声音源进行定位,免去通过特定唤醒音频输入来实现声源定位,提升产品在语音交互中的用户体验。(The invention discloses a robot, a voice data processing method, a voice data processing device and a storage medium, wherein a detection area is detected to determine whether a target object exists in the detection area; when the target object exists in the detection area, acquiring voice data acquired by the robot; and carrying out voice endpoint detection on the voice data to determine whether the target object is a target sound source. According to the invention, through the methods of target object detection and voice endpoint detection, the target sound source is positioned, the sound source positioning is not realized through specific awakening audio input, and the user experience of the product in voice interaction is improved.)

1. A robot comprising a memory, a processor, and computer readable instructions stored in the memory and executable on the processor, wherein a sensor is provided on the robot; the processor, when executing the computer readable instructions, performs the steps of:

detecting a detection zone to determine whether a target is present in the detection zone;

when the target object exists in the detection area, acquiring voice data collected by the robot;

performing voice endpoint detection on the voice data to determine whether the target object is a target sound source;

the detecting a detection area to determine whether a target is present in the detection area comprises:

detecting the detection area through the sensor to acquire all characteristic information in the detection area;

determining the feature similarity between the feature information and preset target feature information, and comparing the feature similarity with a preset similarity threshold;

and when the characteristic similarity is greater than or equal to the preset similarity threshold value, determining that the target object exists in the detection area.

2. A robot as set forth in claim 1 wherein said sensor includes a lidar; the detecting the detection area by the sensor to acquire all characteristic information in the detection area includes:

and detecting the detection area through the laser radar so as to acquire all characteristic information in the detection area.

3. The robot according to claim 2, wherein the detection area includes a first detection area, which is an area within a preset scanning radius in a detection range of the laser radar;

the processor, when executing the computer readable instructions, further performs the steps of:

after the feature similarity is compared with a preset similarity threshold, when the feature information is the feature information in the first detection area and the feature similarity is greater than or equal to the preset similarity threshold, determining that the target object exists in the first detection area.

4. The robot of claim 3, wherein said detection area further comprises a second detection area; the second detection area is an area except the first detection area in the detection range of the laser radar; the processor, when executing the computer readable instructions, further performs the steps of:

after comparing the feature similarity with a preset similarity threshold, when the feature information is the feature information in the second detection area and the feature similarity is greater than or equal to the preset similarity threshold, performing feature tracking on the feature information to determine whether the feature information meets a preset tracking condition;

and determining that the target object exists in the second detection area when the characteristic information is determined to meet the preset tracking condition.

5. The robot of claim 1, wherein the processor, when executing the computer readable instructions, further performs the steps of:

when the target object is determined to be the target sound source, performing sound receiving processing on the target sound source;

the sound receiving processing of the target sound source comprises:

acquiring target position information of the target sound source, and determining a target driving path of the robot according to the target position information;

and when the robot runs along the target running path and is close to the target sound source, performing sound reception processing on the target sound source.

6. The robot of claim 1, wherein said performing voice endpoint detection on said voice data to determine whether said target object is a target sound source comprises:

performing voice endpoint detection on the voice data to obtain a starting point and/or an ending point of the voice data;

and if the starting point and/or the end point of the voice data are/is detected, determining that the target object is a target sound source.

7. The robot of claim 6, wherein the processor when executing the computer readable instructions further performs the steps of:

after the starting point and/or the end point of the voice data are/is detected, acquiring image acquisition information of the target object;

performing lip movement feature recognition on the image acquisition information to obtain a lip movement feature recognition result corresponding to the target object;

and performing voice verification on the starting point and/or the ending point according to the lip movement feature identification result so as to determine whether the target object is a target sound source.

8. The robot according to claim 7, wherein the lip movement feature recognition result is a lip movement time point; the voice verification of the starting point and/or the ending point according to the lip movement feature recognition result to determine whether the target object is a target sound source includes:

determining whether the lip movement time point matches the time of the starting point and/or the ending point;

and if so, determining that the target object is the target sound source.

9. A method for processing voice data, comprising:

detecting a detection zone to determine whether a target is present in the detection zone;

when the target object exists in the detection area, acquiring voice data acquired by the robot;

performing voice endpoint detection on the voice data to determine whether the target object is a target sound source;

the detecting a detection area to determine whether a target is present in the detection area comprises:

detecting the detection area through a sensor arranged on the robot to acquire all characteristic information in the detection area;

determining the feature similarity between the feature information and preset target feature information, and comparing the feature similarity with a preset similarity threshold;

and when the characteristic similarity is greater than or equal to the preset similarity threshold value, determining that the target object exists in the detection area.

10. A speech data processing apparatus, comprising:

the target detection module is used for detecting a detection area so as to determine whether a target exists in the detection area;

the voice data acquisition module is used for acquiring voice data acquired by the robot when the target object exists in the detection area;

the voice endpoint detection module is used for carrying out voice endpoint detection on the voice data so as to determine whether the target object is a target sound source;

the detecting a detection area to determine whether a target is present in the detection area comprises:

detecting the detection area through a sensor arranged on the robot to acquire all characteristic information in the detection area;

determining the feature similarity between the feature information and preset target feature information, and comparing the feature similarity with a preset similarity threshold;

and when the characteristic similarity is greater than or equal to the preset similarity threshold value, determining that the target object exists in the detection area.

11. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the speech data processing method according to claim 9.

Technical Field

The present invention relates to the field of voice interaction technologies, and in particular, to a robot, a voice data processing method, a voice data processing device, and a storage medium.

Background

The voice interaction is widely applied to scenes such as intelligent conferences and intelligent customer service, in the application process of the voice interaction, voice signals are often required to be collected in a noisy environment, and at the moment, the interference of environmental noise and indoor reverberation on the voice signals is very large, so that the accuracy is low when the collected voice signals are analyzed.

In the prior art, a microphone array can accurately acquire a voice signal, so that a mode of acquiring the voice signal by the microphone array is widely applied. However, the existing microphone array voice collecting mode has the following defects: sound source localization needs to be performed through a wake-up operation, and user experience is poor in voice interaction.

Disclosure of Invention

Embodiments of the present invention provide a robot, a voice data processing method, an apparatus, and a storage medium, so as to solve a problem in the prior art that a user experience is poor in voice interaction due to a need to perform sound source localization through a wake-up operation.

A robot comprising a memory, a processor, and computer readable instructions stored in the memory and executable on the processor, the robot having a sensor thereon; the processor, when executing the computer readable instructions, performs the steps of:

detecting a detection zone to determine whether a target is present in the detection zone;

when the target object exists in the detection area, acquiring voice data acquired by the robot;

performing voice endpoint detection on the voice data to determine whether the target object is a target sound source;

the detecting a detection area to determine whether a target is present in the detection area comprises:

detecting the detection area through the sensor to acquire all characteristic information in the detection area;

determining the feature similarity between the feature information and preset target feature information, and comparing the feature similarity with a preset similarity threshold;

and when the characteristic similarity is greater than or equal to the preset similarity threshold value, determining that the target object exists in the detection area.

A method of speech data processing, comprising:

detecting a detection zone to determine whether a target is present in the detection zone;

when the target object exists in the detection area, acquiring voice data acquired by the robot;

performing voice endpoint detection on the voice data to determine whether the target object is a target sound source;

the detecting a detection area to determine whether a target is present in the detection area comprises:

detecting the detection area through a sensor arranged on the robot to acquire all characteristic information in the detection area;

determining the feature similarity between the feature information and preset target feature information, and comparing the feature similarity with a preset similarity threshold;

and when the characteristic similarity is greater than or equal to the preset similarity threshold value, determining that the target object exists in the detection area.

A speech data processing apparatus comprising:

the target detection module is used for detecting a detection area so as to determine whether a target exists in the detection area;

the voice data acquisition module is used for acquiring voice data acquired by the robot when the target object exists in the detection area;

the voice endpoint detection module is used for carrying out voice endpoint detection on the voice data so as to determine whether the target object is a target sound source;

the detecting a detection area to determine whether a target is present in the detection area comprises:

detecting the detection area through a sensor arranged on the robot to acquire all characteristic information in the detection area;

determining the feature similarity between the feature information and preset target feature information, and comparing the feature similarity with a preset similarity threshold;

and when the characteristic similarity is greater than or equal to the preset similarity threshold value, determining that the target object exists in the detection area.

A computer-readable storage medium, which stores a computer program that, when executed by a processor, implements the above-described voice data processing method.

The robot, the voice data processing method, the voice data processing device and the storage medium determine whether the target object exists in the detection area, and perform voice endpoint detection on the voice data on the premise that the target object exists in the detection area, so as to determine whether the target object is a target sound source; therefore, the target sound source is positioned by the target object detection and voice endpoint detection methods, sound source positioning is realized by avoiding specific awakening audio input, and the user experience of the product in voice interaction is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.

FIG. 1 is a schematic view of a robot in accordance with an embodiment of the present invention;

FIG. 2 is a flowchart of a voice data processing method according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In one embodiment, a robot is provided, the internal structure of which may be as shown in fig. 1. The robot includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the robot is used to provide computational and control capabilities. The robot memory includes a readable storage medium, an internal memory. The readable storage medium stores an operating system, computer readable instructions, and a database. The internal memory provides an environment for the operating system and execution of computer-readable instructions in the readable storage medium. The database of the robot is used for storing data used by the corresponding voice data processing method. The network interface of the robot is used for communicating with an external terminal through network connection. The computer readable instructions, when executed by a processor, implement a method of speech data processing. The readable storage media provided by the present embodiment include nonvolatile readable storage media and volatile readable storage media. Further, the robot can also comprise an input device and a display screen, wherein the input device is used for receiving signals, texts and the like sent by other equipment; the display screen may be used to display voice data and the like.

In one embodiment, a robot is provided, comprising a memory, a processor, and computer program computer readable instructions stored in the memory and executable on the processor, the processor implementing the following steps when executing the computer readable instructions, as shown in fig. 2:

s10: detecting the detection zone to determine whether the target is present in the detection zone.

It can be understood that the detection area represents a detection range of the robot for detecting the target object, the robot is provided with a sensor, and the sensor may include one or more than two of a laser radar, an RGBD camera, various cameras, and the like. Taking the laser radar as an example, in an embodiment, the laser radar may be disposed at a front side of the robot, and a detection range (i.e., a detection area) of the laser radar is located in front of the robot; in another embodiment, the lidar may be further disposed at a rear side of the robot, and a detection range of the lidar is located behind the robot, so that the lidar is not limited to a position where the lidar is disposed on the robot in this embodiment, and only the lidar may satisfy a function of detecting a target object in a detection area. In another embodiment, the detection region may be detected by other detection methods such as image detection, besides laser detection. The target object is preferably a target human body, and it is understood that since the detection area includes other obstacles (such as a table, a trash can, etc.) besides the human body, it is necessary to determine whether the target object exists in the detection area in order to avoid collecting sounds such as noise emitted by other obstacles, and then determine whether the target object is a target sound source when the target object exists in the detection area, so as to improve the accuracy of sound collection by the sound source.

In an embodiment, the robot is provided with a sensor, and the step S10 of detecting the detection area to determine whether the target object exists in the detection area includes:

detecting the detection area through the sensor to acquire all characteristic information in the detection area;

in a specific embodiment, the sensor includes a laser radar, and the detecting the detection area by the sensor to obtain all feature information in the detection area includes:

and detecting the detection area through the laser radar so as to acquire all characteristic information in the detection area.

As can be appreciated, a lidar is a device disposed on the robot for acquiring characteristic information in the detection area. The feature information refers to physical feature information of all obstacles (such as pedestrians, tables, etc.) in the detection area. For example, when the laser radar is disposed at a lower position of the robot (e.g., a chassis position of the robot), the characteristic information collected by the laser radar may be, for example, a human leg characteristic of a human body, a table leg characteristic of a table, or the like; when the laser radar is arranged at a position (such as the head position of the robot) above the robot, the characteristic information collected by the laser radar may be, for example, the upper body characteristic information of the human body, the appearance characteristic information of the table, and the like.

And determining the feature similarity between the feature information and preset target feature information, and comparing the feature similarity with a preset similarity threshold.

It is to be understood that the preset target feature information is preferably human body feature information, such as human leg feature information, body feature information, head feature information, and the like of a human body. In this embodiment, in order to improve the accuracy of determining the target object, the leg detection may be performed by a laser radar. The preset similarity threshold may be set according to specific requirements, for example, the preset similarity threshold may be set to 90%, 95%, or the like.

Specifically, after the detection area is detected through the laser radar to obtain all feature information in the detection area, determining feature similarity between each feature information and preset target feature information, and comparing the feature similarity with a preset similarity threshold; if the feature similarity corresponding to any one feature information is larger than or equal to a preset similarity threshold, determining that a target object exists in the detection area; and if the feature similarity corresponding to all the feature information is smaller than a preset similarity threshold, determining that the target object does not exist in the detection area.

And when the characteristic similarity is greater than or equal to the preset similarity threshold value, determining that the target object exists in the detection area.

Specifically, after comparing the feature similarity with a preset similarity threshold, when any one feature similarity is greater than or equal to the preset similarity threshold, determining that a target object exists in the detection area; and if the feature similarity corresponding to all the feature information is smaller than a preset similarity threshold, determining that the target object does not exist in the detection area.

In an embodiment, the detection area includes a first detection area, where the first detection area is an area located within a preset scanning radius in a detection range of the laser radar;

the processor, when executing the computer readable instructions, further performs the steps of:

It is understood that the first detection area refers to an area within a preset scanning radius of the detection range of the laser radar, and the first detection area may be smaller than the detection range of the laser radar, or the first detection area may be the same as the detection range of the laser radar. The preset scanning radius can be set according to the detection requirement.

Specifically, after the feature similarity is compared with a preset similarity threshold, if the feature information corresponding to the feature similarity is the feature information in the first detection area, and the feature similarity is greater than or equal to the preset similarity threshold, it may be determined that the target object exists in the first detection area. As can be appreciated, since the distance between the first detection area and the robot is short, when detecting whether the target object exists in the first detection area through the lidar, the accuracy of the lidar detection is high, and therefore when determining that the feature similarity is greater than or equal to the preset similarity threshold, it may be determined that the target object exists in the first detection area.

In an embodiment, the detection area further comprises a second detection area; the second detection area is an area except the first detection area in the detection range of the laser radar; the processor, when executing the computer readable instructions, further performs the steps of:

It is to be understood that in the above description it is pointed out that the first detection area may be smaller than the detection range of the lidar or equal to the detection range of the lidar; and when the first detection area is smaller than the detection range of the laser radar, a second detection area exists in the detection area, and the second detection area is an area except the first detection area in the detection range of the laser radar.

It can be understood that when the feature information is the feature information in the second detection area and the feature similarity is greater than or equal to the preset similarity threshold, it cannot be directly determined that the feature information is the feature information of the human body, for example, because two table legs in the table may also be similar to human legs, human leg tracking needs to be performed on the feature information to determine whether the feature information meets the preset tracking condition, and then when it is determined that the feature information meets the preset tracking condition, it is determined that a target object exists in the second detection area. The preset tracking condition comprises a preset moving speed condition and a preset motion track. The preset moving speed condition is that whether the average moving speed corresponding to the characteristic information is smaller than a preset speed threshold value or not, and the preset speed threshold value can be determined by collecting the moving speeds of a plurality of pedestrians. The preset motion track means whether the characteristic information is a cross motion track, if the characteristic information is human leg characteristic information, two human legs in the human leg characteristic information should move in a cross mode, so that the situation that when the characteristic similarity of obstacles such as table legs similar to the human legs is larger than or equal to the preset similarity, the table legs can be regarded as the human legs, and then the target objects exist in the second detection area (due to the fact that the table legs move in parallel) is mistakenly considered, and therefore the accuracy of target object detection can be improved through a characteristic tracking mode.

And determining that the target object exists in the second detection area when the characteristic information is determined to meet the preset tracking condition.

Specifically, feature tracking is performed on the feature information to determine whether the feature information meets a preset tracking condition, and if the feature information meets the preset tracking condition, it is determined that a target object exists in the second detection area; and if the characteristic information does not meet the preset tracking condition, determining that the characteristic information is not the characteristic information of the target object.

And acquiring the position information of the target object when the target object exists in the detection area. It can be understood that the position information of the target object is the current position information of the target object, and the position information may be determined according to the feature information corresponding to the target object. Further, when the feature information is the human leg feature information of the human body, the first position information of one of the human leg feature information and the second position information of the other human leg feature information may be acquired, and the average position information of the first position information and the second position information may be used as the position information of the target object.

S20: and when the target object exists in the detection area, acquiring voice data collected by the robot.

It is understood that the voice data is data collected by a voice collecting device disposed on the robot, and the voice data may include environmental noise, human voice (the human voice may be a sound of the target object or may not be a sound of the target object), and the like.

S30: and carrying out voice endpoint detection on the voice data to determine whether the target object is a target sound source.

It is understood that the voice endpoint detection is used to detect whether a human voice exists in the voice data, i.e., to detect a start point and/or an end point of the human voice in the voice data.

In one embodiment, step S30 includes:

and carrying out voice endpoint detection on the voice data to obtain a starting point and/or an end point of the voice data.

Specifically, when the target object is determined to be present in the detection area, after voice data collected by the robot is acquired, voice endpoint detection is performed on the voice data to determine whether the voice data contains a start point and/or an end point of human voice. That is, only the starting point and/or the ending point needs to be detected in the voice data, and the voice data can be determined to include the voice. Wherein, the starting point is also the starting time of starting to collect the voice in the voice data; the ending point is also the time when the voice is not collected after the voice is collected in the voice data.

In this embodiment, voice endpoint detection is performed through a change in a voice energy value. The speech data is first divided into a fixed duration, for example, 20 ms, each of which contains the same number of speech samples, and then the energy value of the speech in each of the divided units is calculated. If the energy values of a plurality of continuous segmentation units at the front end part of the voice data are lower than a preset energy value threshold (the preset energy value threshold can be set according to requirements), and the energy values of a plurality of continuous segmentation units are larger than or equal to the preset energy value threshold, the starting point of the voice data is the position where the voice energy value is increased. Similarly, if the energy value of the speech in a plurality of consecutive segmentation units is larger, and then the energy value of the speech in a plurality of segmentation units becomes smaller and lasts for a certain period of time, it can be considered that the speech data end point is where the energy value is reduced.

And if the starting point and/or the end point of the voice data are/is detected, determining that the target object is a target sound source.

Specifically, after voice endpoint detection is performed on the voice data to obtain a starting point and/or an end point of the voice data, if the starting point and/or the end point of the voice data are/is detected, a target object is determined as a target sound source; if the start point and the end point of the voice data are not detected, it is determined that the target object is not the target sound source.

In this embodiment, whether a target object exists in the detection area is determined, and voice endpoint detection is performed on voice data on the premise that the target object exists in the detection area, so as to determine whether the target object is a target sound source; therefore, the target sound source is positioned by the target object detection and voice endpoint detection methods, sound source positioning is realized by avoiding specific awakening audio input, and the user experience of the product in voice interaction is improved.

In one embodiment, the processor, when executing the computer readable instructions, further performs the steps of:

and when the target object is determined to be the target sound source, performing sound receiving processing on the target sound source.

Specifically, after voice endpoint detection is performed on the voice data to determine whether the target object is a target sound source, if the target object is determined to be the target sound source, performing sound reception processing on the target sound source, that is, uploading the acquired voice information of the target object to a voice data processing device for voice analysis and the like; further, in this embodiment, the sound reception processing on the target sound source is preferably directional sound reception processing; and if the target object is determined not to be the target sound source, the target object is not subjected to sound receiving processing.

In an embodiment, the sound receiving processing of the target sound source includes:

and acquiring target position information of the target sound source, and determining a target running path of the robot according to the target position information.

As can be understood, the target location information is the current location information of the target sound source, and the target location information can be determined according to the feature information corresponding to the target sound source. Further, when the feature information is the leg feature information of the human body, the first position information of one of the leg feature information and the second position information of the other of the leg feature information may be acquired, and the average position information of the first position information and the second position information may be used as the target position information of the target sound source. The target driving path is a path from the current position to the target position information close to the target sound source.

And running according to the target running path and approaching the target sound source, and performing sound receiving processing on the target sound source.

Specifically, after target position information of the target sound source is acquired, a target driving path is determined according to the target position information, and after the target driving path is driven and is close to the target position information, sound receiving processing of the target sound source is achieved through microphone array sound receiving equipment arranged on the robot. If the target position information of the target object changes, the moving direction and the moving track of the target object are determined in a mode of tracking the target object, and then the target driving path can be changed, the target object is driven in a direction close to the target object in the past, and the sound receiving processing of the target object is realized.

In one embodiment, the processor, when executing the computer readable instructions, further performs the steps of:

and acquiring the image acquisition information of the target object after detecting the starting point and/or the end point of the voice data.

Understandably, the image acquisition information is obtained by shooting by an image acquisition device arranged on the robot; the image capturing device may be, for example, a camera, a video camera, a scanner, or the like. Further, the image acquisition information may be a face image, a lip image, or the like of the target object.

And carrying out lip motion characteristic identification on the image acquisition information to obtain a lip motion characteristic identification result corresponding to the target object.

Specifically, after the image acquisition information of the target object is acquired, lip movement feature recognition is performed on the image acquisition information, that is, whether the lip of the target object in the image acquisition information changes (for example, the lip is turned on from the lip is turned off), for example, whether the opening degree between the upper lip and the lower lip of the lip part in the image acquisition information changes is detected, so that a lip movement feature recognition result corresponding to the target object can be obtained. The lip movement feature recognition result is a lip movement time point, namely a starting time point of the lip of the target object from closing to opening and an ending time point of the lip from opening to closing. And if the lip characteristics of the target object are not detected to be changed, the lip movement time point is empty.

And performing voice verification on the starting point and/or the ending point according to the lip movement feature identification result so as to determine whether the target object is a target sound source.

It can be understood that the voice verification in this embodiment is to determine that the lip movement time point of the target object is matched with the starting point and/or the ending point, and if the lip movement time point is matched with the starting point and/or the ending point, the target object is determined to be the target sound source; and if not, determining that the target object is not the target sound source.

In one embodiment, the lip movement feature recognition result is a lip movement time point; the voice verification of the starting point and/or the ending point according to the lip movement feature recognition result to determine whether the target object is a target sound source includes:

determining whether the lip movement time point matches the time of the start point and/or the end point.

And if so, determining that the target object is the target sound source.

It is understood that, in the above description, the lip movement feature recognition result is indicated as a lip movement time point, that is, a starting time point of the lip of the object from closing to opening, and an ending time point of the lip from opening to closing. And if the lip characteristics of the target object are not detected to be changed, the lip movement time point is empty. Therefore, whether the lip movement time point is matched with the starting point of the voice data can be determined, and if the lip movement time point is matched with the starting point of the voice data, the voice data is represented as the voice data of the target object; if the images are not matched (if the image acquisition information has errors), whether the lip movement time point is matched with the end point of the voice data or not can be determined, and if the lip movement time point is matched with the end point of the voice data, the voice data are represented as the voice data of the target object. Further, if the lip movement time point is not matched with the time of the starting point and the time of the ending point, the voice data is represented to be not the voice data of the target object, or background sound (such as sound of other target objects) in the voice data is larger than the sound of the target object, so that the lip movement time point cannot be matched with the starting point and/or the ending point.

In this embodiment, whether the target object sounds can be better determined by the lip motion feature recognition method, and compared with performing endpoint detection analysis only from voice data, the comprehensive use of the lip motion feature recognition method and the voice endpoint detection method can improve the accuracy of voice data detection (for example, when a non-target object sounds and the lip motion feature of the target object does not change, that is, when the target object is characterized by not sounding, the target object is determined to be the target sound source); furthermore, the lip movement characteristics are adopted for auxiliary judgment, and the accuracy is higher compared with other human body characteristics (such as neck characteristics, chin characteristics and the like), so that accurate judgment basis is provided for the subsequent radio processing of the target object, and the recording quality and efficiency of directional radio are improved.

In one embodiment, a method for processing voice data is provided, which includes the following steps:

s10: detecting a detection zone to determine whether a target is present in the detection zone;

s20: when the target object exists in the detection area, acquiring voice data acquired by the robot;

s30: performing voice endpoint detection on the voice data to determine whether the target object is a target sound source;

the detecting a detection area to determine whether a target is present in the detection area comprises:

detecting the detection area through a sensor arranged on the robot to acquire all characteristic information in the detection area;

determining the feature similarity between the feature information and preset target feature information, and comparing the feature similarity with a preset similarity threshold;

and when the characteristic similarity is greater than or equal to the preset similarity threshold value, determining that the target object exists in the detection area.

In this embodiment, whether a target object exists in the detection area is determined, and voice endpoint detection is performed on the voice data on the premise that the target object exists in the detection area, so that whether the target object is a target sound source is determined, and thus, by means of the target object detection and the voice endpoint detection, the target sound source is positioned, sound source positioning is achieved by avoiding specific awakening audio input, and user experience of the product in voice interaction is improved.

In an embodiment, the sensor comprises a lidar; the detecting area through a sensor arranged on the robot to acquire all characteristic information in the detecting area, including:

and detecting the detection area through the laser radar so as to acquire all characteristic information in the detection area.

In an embodiment, the detection area includes a first detection area, where the first detection area is an area located within a preset scanning radius in a detection range of the laser radar;

the voice data processing method further comprises:

In an embodiment, the detection area further comprises a second detection area; the second detection area is an area except the first detection area in the detection range of the laser radar;

the voice data processing method further comprises:

and determining that the target object exists in the second detection area when the characteristic information is determined to meet the preset tracking condition.

In one embodiment, after performing voice endpoint detection on the voice data to determine whether the target object is a target sound source, the method further includes:

when the target object is determined to be the target sound source, performing sound receiving processing on the target sound source;

the sound receiving processing of the target sound source comprises:

acquiring target position information of the target sound source, and determining a target driving path of the robot according to the target position information;

and when the robot runs along the target running path and is close to the target sound source, performing sound reception processing on the target object.

In one embodiment, the performing voice endpoint detection on the voice data to determine whether the target object is a sound source includes:

performing voice endpoint detection on the voice data to obtain a starting point and/or an ending point of the voice data;

and if the starting point and/or the end point of the voice data are/is detected, determining that the target object is a target sound source.

In one embodiment, the voice data processing method further includes:

after the starting point and/or the end point of the voice data are/is detected, acquiring image acquisition information of the target object;

performing lip movement feature recognition on the image acquisition information to obtain a lip movement feature recognition result corresponding to the target object;

determining whether the lip movement time point matches the time of the starting point and/or the ending point;

and if so, determining that the target object is the target sound source.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.

In one embodiment, there is provided a voice data processing apparatus including:

the target detection module is used for detecting a detection area so as to determine whether a target exists in the detection area;

the voice data acquisition module is used for acquiring voice data acquired by the robot when the target object exists in the detection area;

the voice endpoint detection module is used for carrying out voice endpoint detection on the voice data so as to determine whether the target object is a target sound source;

the detecting a detection area to determine whether a target is present in the detection area comprises:

detecting the detection area through a sensor arranged on the robot to acquire all characteristic information in the detection area;

determining the feature similarity between the feature information and preset target feature information, and comparing the feature similarity with a preset similarity threshold;

and when the characteristic similarity is greater than or equal to the preset similarity threshold value, determining that the target object exists in the detection area.

For the specific limitation of the voice data processing apparatus, reference may be made to the above limitation of the voice data processing method, which is not described herein again. The respective modules in the above-described voice data processing apparatus may be wholly or partially implemented by software, hardware, and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored, which, when executed by a processor, implements the speech data processing method in the above-described embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present invention, and are intended to be included within the scope of the present invention.

15页详细技术资料下载

Robot, voice data processing method, device and storage medium

相关技术

网友询问留言