Multi-person conversation ordering method and system based on visual-auditory fusion

文档序号:170849 发布日期:2021-10-29 浏览:35次 中文

阅读说明:本技术 一种基于视听觉融合的多人对话点餐方法及系统 (Multi-person conversation ordering method and system based on visual-auditory fusion ) 是由 王坤朋 卢文静 姚娟 刘得榜 李文娜 蔡景祥 刘鹏 张江梅 冯兴华 于 2021-06-10 设计创作,主要内容包括:本发明公开了一种基于视听觉融合的多人对话点餐方法,持续采集多人对话的视频,将视频中人脸图像进行处理,得到每个点餐人的嘴部图像;结合嘴部图像,将混合语音分离,得到多个第一语音片段;将第一语音片段与点餐人进行匹配,得到确认了身份的第二语音片段,将同一个点餐人的多个语音片段分类到一起,对第二语音片段进行识别,包括餐厅菜名时,才处理该语音片段,提取其中的点餐信息,经点餐人确认后,将点餐信息传输给后厨,完成点餐。本发明能够实现多人长时间点餐识别,能够更加准确的分离混合语音,提高了语音分离的稳定性,且在点餐信息识别中保证了点餐人的隐私。(The invention discloses a multi-person conversation ordering method based on visual-auditory fusion, which comprises the steps of continuously collecting a video of multi-person conversation, and processing a face image in the video to obtain a mouth image of each ordering person; combining the mouth images, and separating the mixed voice to obtain a plurality of first voice fragments; the first voice fragment is matched with the ordering person to obtain a second voice fragment with confirmed identity, a plurality of voice fragments of the same ordering person are classified together, the second voice fragment is identified, the voice fragment is processed when the name of a restaurant dish is included, ordering information in the voice fragment is extracted, and after the ordering person confirms the voice fragment, the ordering information is transmitted to a kitchen to finish ordering. The method and the device can realize long-time ordering recognition of multiple persons, can more accurately separate mixed voice, improve the voice separation stability, and ensure the privacy of ordering persons in ordering information recognition.)

1. A multi-person conversation ordering method based on visual-auditory fusion is characterized by comprising the following steps:

s1, data acquisition: continuously acquiring a conversation video containing mixed voice and face images of a plurality of ordering people; respectively extracting a mouth image of each ordering person by using the face image of each ordering person;

s2, voice separation: separating the mixed voice by combining the mouth images of a plurality of ordering people to obtain a plurality of first voice segments; then a plurality of first voice segments corresponding to the same ordering person are identified, and identity matching is carried out to obtain a second voice segment with confirmed identity;

s3, keyword recognition: after the second voice segment is subjected to feature extraction, inputting the second voice segment into a voice recognition network comprising an acoustic model of restaurant dish names and restaurant keywords and a voice model sample library for keyword recognition; if the second voice segment comprises the dish name key words, converting the dish ordering information key words extracted from the second voice segment into text information; if the voice segment does not include the dish name keyword, the second voice segment is the chatting voice of the person ordering the dishes, and the second voice segment is not processed;

s4, decision response: comparing the information with the ordering information in the text information by using a knowledge base, confirming the ordering information again by an ordering person after the ordering information is confirmed, transmitting the confirmed ordering information to a kitchen, converting the ordering information into a voice format and outputting the voice format to finish ordering;

repeating the steps S1-S4 until a plurality of ordering persons finish ordering;

the knowledge base comprises the ordering keywords, the ordering keywords comprise an ordering start keyword and an ordering end keyword, and when the text information simultaneously comprises the ordering start keyword and the ordering end keyword, the ordering person finishes ordering; otherwise, the ordering is not finished, and the ordering information of the ordering person is continuously received.

2. The method according to claim 1, wherein step S1 includes:

s11, down-sampling the dialogue video;

s12, the down-sampled dialogue video is subjected to a face detection model and a face classifier which are trained in advance, and a face image of each ordering person is obtained;

and S13, acquiring mouth images of each ordering person by using a mouth detection model trained in advance according to the face images of each ordering person.

3. The method of claim 1, step S2 comprising:

s21, processing the mixed voice and the mouth image by using a voice decoder and an image decoder respectively to obtain mixed voice characteristics and mouth image characteristics;

s22, inputting the mixed voice features and the mouth image features into a fusion network trained in advance, and fusing the audio-visual feature sequences to obtain fusion feature sequences;

s23, inputting the fusion feature sequence and the mixed voice into a separation network trained in advance, and separating a plurality of voice fragments in the mixed voice to obtain the first voice fragment;

s24, matching the first voice fragment with the identity of the person ordering the meal to obtain the second voice fragment.

4. The method according to claim 3, wherein step S24 includes:

s241, extracting acoustic features of the first voice fragment;

s242, calculating the similarity between the acoustic features of the first voice fragment and different acoustic features in the prior feature set;

s243, judging the relationship between the maximum similarity of the acoustic features of the first voice fragment and different acoustic features in the prior feature set and a threshold value through a judgment logic, and determining whether the person ordering the first voice fragment corresponds to the person ordering the food is an existing person ordering the food or a new person ordering the food in the prior feature set to obtain a second voice fragment;

the priori feature set is initially an empty set, and as the acoustic features of new meal ordering people continuously appear, the acoustic features of the meal ordering people are continuously added to the priori feature set.

5. The method according to claim 4, wherein in step S241, the acoustic features of the first speech segment are extracted by MFCC feature extraction parameter method; acoustic feature C of person ordering jjComprises the following steps:

wherein n represents the order of the cepstral coefficient, M represents the mth filter channel of the triangular filter bank, y (M) represents the output of the mth triangular band-pass filter, and M represents the total number of M channels.

6. The method according to claim 5, wherein in step S242, the acoustic feature C of the ith first speech segment is calculated by using normalized Euclidean distanceiAnd the jth acoustic feature C in the prior feature setjSimilarity of (2):

wherein, CiuU-dimensional spatial feature vector representing i-th speech segment, CjuA u-dimensional spatial feature vector representing prior features of the jth person ordering,representing the variance of the ith speech segment feature from the jth prior feature.

7. The method according to claim 6, wherein in step S243, the judging logic is:

wherein S represents the prior feature C of the ith first voice fragment feature and different meal ordering persons in the prior feature setjCalculating the minimum normalized Euclidean distance, wherein j is 1,2, n, n is the total number of acoustic features in the prior feature set; when S is larger than a set threshold value theta, the separation of the ith voice segment is considered to belong to a new speaker voice, and the acoustic feature of the ith first voice segment is added into the prior feature set;and when the S is less than or equal to the set threshold value theta, matching the ith voice segment with the speaker with the identity j, and determining the identity of the meal ordering person of the ith first voice segment.

8. A multi-person dialogue ordering system based on visual-auditory fusion is characterized by comprising:

the data acquisition module is used for continuously acquiring a conversation video comprising mixed voice and face images of a plurality of ordering people and processing the conversation video to obtain a mouth image of each ordering person;

the voice separation module is connected with the data acquisition module and used for separating the mixed voice to obtain a plurality of first voice fragments according to the mouth image and the mixed voice; matching each first voice segment with the corresponding ordering person to obtain a second voice segment with the identity of the ordering person confirmed;

the keyword identification module is connected with the ordering person matching module; the keyword identification module is used for carrying out keyword identification according to an ordering information sample library comprising restaurant dish names and ordering keywords and identifying whether the second voice fragment comprises the restaurant dish name keywords or not; if yes, converting the ordering keywords extracted from the second voice segment into text information; if not, not outputting the text information;

the decision response module is connected with the keyword identification module and is used for comparing the text information output by the keyword identification module with a knowledge base comprising meal ordering keywords and judging whether the meal ordering person corresponding to the text information finishes meal ordering; if the ordering is finished, synthesizing the text information into voice, and playing the voice to the ordering person; otherwise, the ordering person does not finish ordering, and continues to receive the text information of the ordering person.

9. The system of claim 8, wherein the decision response module comprises a speaker for converting the ordering information in the text information into voice to be played to the ordering person.

10. An electronic device comprising at least one processor, and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.

Technical Field

The invention belongs to the technical field of ordering, and particularly relates to a multi-person conversation ordering method and system based on visual-auditory fusion.

Background

The artificial intelligence and human-computer interaction technology are rapidly developed, the intelligent robot based on human-computer interaction plays an important role in the aspects of improving the working efficiency, optimizing the industrial structure, guaranteeing the social productivity, improving the quality of life of people and the like, is widely applied to the fields of service, education, medical treatment, scientific research and the like, and effectively promotes the development of high and new technology industry. As an efficient man-machine interaction mode, voice enables people to more conveniently acquire various services provided by the robot, and the voice is applied to multi-person voice scenes such as restaurant services. However, the current speech dialogue system in these scenes is more of a pure speech single-modal single-person and robot speech dialogue mode, and when the speech is interfered by multiple persons in a complex environment, the performance of the speech dialogue system is difficult to meet the man-machine dialogue requirement. Therefore, in a conversation scene of multiple persons and the robot with noise interference, a method for completing a meal ordering task by the conversation of the multiple persons and the robot is constructed, so that the method can stably separate the speeches of the speakers, track and identify the speeches of the multiple persons in a complex scene, and is a key for carrying out accurate, efficient and real-time man-machine interaction meal ordering.

A conversation system, an important application field of human-machine interaction, is a system in which a human and a machine perform bidirectional information interaction in a conversational manner. In the 60's of the 20 th century, dialog systems began to emerge, with most dialog systems dialoguing based on manual templates, with limited freedom of dialog, such as the Eliza system developed by Weizenbaum, institute of technology, et al, for psychotherapy. In the 80 and 90 s of the 20 th century, the utility value of dialog systems has increased and began to step into commercialization, such as the airplane ticketing system pegauss developed by ZUE et al for aviation services. By the 21 st century, the performance of computers has been improved, and the dialogue quality of dialogue systems has been improved remarkably, such as the spoken language dialogue system MUDIS for human-computer interaction designed by the industry university of munich, germany, 2008. In the last decade, with the further development of deep neural networks, various intelligent dialogue systems based on deep learning have become popular. Many scientific and technical companies have successively introduced their own intelligent one-man dialog products, such as apple's voice Assistant Siri for entertainment and conversation, microsoft's chat robot mini-ice, google's voice Assistant asistat and Cortana, hundredth's voice Assistant mini, amazon's voice Assistant Alexa, etc. However, at present, the dialog system is generally applied to a dialog scenario of a single user in a pure speech mode, and does not have the capability of stably separating a multi-person mixed speech, so that the dialog system cannot respectively perform a dialog for each person in the multi-person dialog speech. Therefore, in a noisy multi-person conversation scene, how to make the conversation system stably separate the voice of each person during conversation from the mixed voice is a key for improving the multi-person conversation capability of the conversation system.

Since the middle of the 20 th century, people's gaze began exploring speech separation in a multi-person speaking environment. After decades of development, the voice separation technology has been greatly improved, the voice separation is developed into a depth model from a traditional model, the performance is greatly improved, and the method is also applied to various aspects of human living. Most speech separation models are only suitable for dialogue scenes with weak environmental noise. When a speaker is in a noisy and multi-speaker scene such as a restaurant for ordering, the stability of separating the voices of multiple speakers by the separation model is challenged, and the problem of tag arrangement of long-time separated voice frames occurs (when voice separation is performed for a long time span, separated voice segments are mistakenly matched with other target speakers). These problems have all greatly limited the use of conversation robots in restaurant ordering scenarios.

Disclosure of Invention

The invention aims to: in order to solve the problems existing in the scheme, a multi-person conversation ordering method based on audio-visual fusion is provided, a conversation video of multi-person ordering is continuously collected, a mouth image of each ordering person in the conversation video is extracted, mixed voice in the conversation video is separated by combining the mouth image, the identity of the ordering person is matched in the separated voice segment, ordering keyword recognition is carried out on each voice segment subjected to identity matching, and if the voice segment comprises a dish name keyword, text conversion is carried out on the extracted keyword to obtain text information; if not, no processing is performed. And comparing the output text information with a knowledge base comprising an ordering start keyword and an ordering end keyword, if the ordering start keyword and the ordering end keyword exist at the same time, ending the ordering, converting the ordering information into voice output, confirming again by an ordering person, and transmitting the confirmed ordering information to a kitchen to finish the ordering.

The purpose of the invention is realized by the following technical scheme: a multi-person conversation ordering method based on visual-auditory fusion comprises the following steps:

s1, data acquisition: continuously acquiring a conversation video containing mixed voice and face images of a plurality of ordering people, and respectively extracting a mouth image of each ordering person by using the face image of each ordering person;

s2, voice separation: separating the mixed voice by combining the mouth images of a plurality of ordering people to obtain a plurality of first voice segments; then a plurality of first voice segments corresponding to the same ordering person are identified, and identity matching is carried out to obtain a second voice segment with confirmed identity;

s3, keyword recognition: after the second voice segment is subjected to feature extraction, inputting the second voice segment into a voice recognition network comprising an acoustic model of restaurant dish names and restaurant keywords and a voice model sample library for keyword recognition; if the second voice segment comprises the dish name key words, converting the dish ordering information key words extracted from the second voice segment into text information; if the voice segment does not include the dish name keyword, the second voice segment is the chatting voice of the person ordering the dishes, and the second voice segment is not processed;

s4, decision response: comparing the information with the ordering information in the text information by using a knowledge base, confirming the ordering information again by an ordering person after the ordering information is confirmed, transmitting the confirmed ordering information to a kitchen, converting the ordering information into a voice format and outputting the voice format to finish ordering;

repeating the steps S1-S4 until a plurality of ordering persons finish ordering;

the knowledge base comprises the ordering keywords, the ordering keywords comprise an ordering start keyword and an ordering end keyword, and when the text information simultaneously comprises the ordering start keyword and the ordering end keyword, the ordering person finishes ordering; otherwise, the ordering is not finished, and the ordering information of the ordering person is continuously received.

The method provided by the invention can continuously collect the conversation video of ordering by a plurality of people, can continuously receive the ordering information of a plurality of ordering people for a long time, and improves the comfort and convenience of ordering service. The mixed voice in the dialogue video is separated by combining the mouth image, so that each voice segment can be separated more accurately. Identity matching is carried out on each first voice fragment by utilizing the prior feature set, voice distinguishing of long-time ordering can be realized, voice fragments of the same ordering person at different moments are classified together, and long-time ordering recognition is realized. Constructing an ordering information sample library comprising restaurant dish names and ordering keywords, wherein the ordering keywords comprise ordering start keywords, ordering end keywords and other keywords which are ordered with the ordering, and when the dish name keywords exist in the second voice segment, processing the second voice segment; if the second voice segment does not include the dish name keyword, the segment is defaulted to be a chatting segment of the ordering person, the chatting segment is irrelevant to ordering, the segment is not processed, privacy of the ordering person is guaranteed, and meanwhile accuracy of voice keyword recognition is improved.

Preferably, step S1 includes:

s11, down-sampling the dialogue video;

s12, the down-sampled dialogue video is subjected to a face detection model and a face classifier which are trained in advance, and a face image of each ordering person is obtained;

and S13, acquiring mouth images of each ordering person by using a mouth detection model trained in advance according to the face images of each ordering person.

Preferably, step S2 includes:

s21, processing the mixed voice and the mouth image by using a voice decoder and an image decoder respectively to obtain mixed voice characteristics and mouth image characteristics;

s22, inputting the mixed voice features and the mouth image features into a fusion network trained in advance, and fusing the audio-visual feature sequences to obtain fusion feature sequences;

s23, inputting the fusion feature sequence and the mixed voice into a separation network trained in advance, and separating a plurality of voice fragments in the mixed voice to obtain the first voice fragment;

s24, matching the first voice fragment with the identity of the person ordering the meal to obtain the second voice fragment.

The voice separation is carried out after the visual and auditory characteristics are fused by combining the characteristics of the mouth image and the characteristics of the mixed voice, so that each voice segment in the mixed voice can be more accurately separated.

Preferably, step S24 includes:

s241, extracting acoustic features of the first voice fragment;

s242, calculating the similarity between the acoustic features of the first voice fragment and different acoustic features in the prior feature set;

s243, judging the relationship between the maximum similarity of the acoustic features of the first voice fragment and different acoustic features in the prior feature set and a threshold value through a judgment logic, and determining whether the person ordering the first voice fragment corresponds to the person ordering the food is an existing person ordering the food or a new person ordering the food in the prior feature set to obtain a second voice fragment;

the priori feature set is initially an empty set, and as the acoustic features of new meal ordering people continuously appear, the acoustic features of the meal ordering people are continuously added to the priori feature set.

Extracting acoustic features of the first voice segments, and performing identity matching on each first voice segment by using a prior feature set formed by the acoustic features of the existing ordering people; when a new meal order person appears, the acoustic features of the new meal order person are added into the prior feature set, and subsequent identity matching is carried out in the future. The voice separation method and the voice separation device can accurately classify a plurality of voice fragments belonging to the same ordering person together in a long-time ordering conversation, and improve the reliability and stability of voice separation.

Preferably, in step S241, the acoustic feature of the first speech segment is extracted by a MFCC feature extraction parameter method; acoustic feature C of person ordering jjComprises the following steps:

wherein n represents the order of the cepstral coefficient, M represents the mth filter channel of the triangular filter bank, y (M) represents the output of the mth triangular band-pass filter, and M represents the total number of M channels.

Preferably, in step S242, the acoustic feature C of the ith first speech segment is calculated by using a normalized euclidean distanceiAnd the jth acoustic feature C in the prior feature setjSimilarity of (2):

wherein, CiuU-dimensional spatial feature vector representing i-th speech segment, CjuIs shown asThe u-th dimension space feature vector of the prior feature of the j meal people,representing the variance of the ith speech segment feature from the jth prior feature.

Calculating the acoustic characteristic C of the ith first voice segment by using a standard Euclidean distance formulaiAcoustic feature C of person having meal with jth pointjSimilarity between features, dist (C)i,Cj) The smaller the value of (a), the greater the similarity between the two.

Preferably, in step S243, the determining logic is:

wherein S represents the prior feature C of the ith first voice fragment feature and different meal ordering persons in the prior feature setjCalculating the minimum normalized Euclidean distance, wherein j is 1,2, 3.. n, and n is the total number of acoustic features in the prior feature set; when S is larger than a set threshold value theta, the separation of the ith voice segment is considered to belong to a new speaker voice, and the acoustic feature of the ith first voice segment is added into the prior feature set; and when the S is less than or equal to the set threshold value theta, matching the ith voice segment with the speaker with the identity j, and determining the identity of the meal ordering person of the ith first voice segment.

The invention also provides a multi-person dialogue ordering system based on visual and auditory fusion, which comprises:

the data acquisition module is used for continuously acquiring a conversation video comprising mixed voice and face images of a plurality of ordering people and processing the conversation video to obtain a mouth image of each ordering person;

the voice separation module is connected with the data acquisition module and is used for separating a plurality of first voice fragments in the mixed voice according to the mouth image and the mixed voice; matching each first voice segment with the corresponding ordering person to obtain a second voice segment with the identity of the ordering person confirmed;

the keyword identification module is connected with the ordering person matching module; the keyword identification module is used for carrying out keyword identification according to an ordering information sample library comprising restaurant dish names and ordering keywords and identifying whether the second voice fragment comprises the restaurant dish name keywords or not; if yes, converting the ordering keywords extracted from the second voice segment into text information; if not, not outputting the text information;

the decision response module is connected with the keyword identification module and is used for comparing the text information output by the keyword identification module with a knowledge base comprising meal ordering keywords and judging whether the meal ordering person corresponding to the text information finishes meal ordering; if the ordering is finished, synthesizing the text information into voice, and playing the voice to the ordering person; otherwise, the ordering person does not finish ordering, and continues to receive the text information of the ordering person.

Preferably, the decision response module further comprises a speaker, which is used for converting the ordering information in the text information into voice to be played to the ordering person.

The invention also provides an electronic device, which comprises at least one processor and a memory which is in communication connection with the at least one processor; the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the above-described method.

The main scheme and the further selection schemes can be freely combined to form a plurality of schemes which are all adopted and claimed by the invention; in the invention, the selection (each non-conflict selection) and other selections can be freely combined. The skilled person in the art can understand that there are many combinations, which are all the technical solutions to be protected by the present invention, according to the prior art and the common general knowledge after understanding the scheme of the present invention, and the technical solutions are not exhaustive herein.

The invention has the beneficial effects that:

1. the method provided by the invention can be used for simultaneously collecting the voice and the image data of a plurality of ordering people and separating the mixed voice by combining the image data, so that the stability of the mixed voice separation in noisy and multi-person speaking scenes such as ordering in a restaurant is improved, the traditional single ordering service is improved, and the comfort and the convenience of the ordering service are improved.

2. The invention combines the traditional audio-visual voice separation module and the ordering person matching module, firstly combines the mouth image characteristics to separate the voice segments in the mixed voice, and then utilizes the prior characteristic set to calculate the similarity between the acoustic characteristics of the voice segments and each acoustic characteristic in the prior characteristic set, so that the same ordering person voice segments can be accurately classified together when the voice is ordered for a long time, the problem of disordered classification when the voice segments of the speaker are separated for a long time is solved, and the reliability and the stability of voice separation are improved.

3. The invention utilizes the voice recognition model comprising the restaurant dish names to recognize the long-time ordering information of each person, and only when the ordering information of the voice fragment comprises the dish name key words, the voice fragment is processed. The privacy of the ordering person is protected, and the instantaneity and the accuracy of the voice ordering function of a plurality of persons are improved.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

FIG. 2 is a schematic flow chart of a method according to an embodiment of the present invention.

Fig. 3 is a schematic diagram of a data acquisition process according to an embodiment of the present invention.

Fig. 4 is a schematic diagram of a speech separation process according to an embodiment of the present invention.

Fig. 5 is a schematic diagram of a keyword recognition process according to an embodiment of the present invention.

Fig. 6 is a schematic diagram of a decision response flow according to an embodiment of the present invention.

Fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The following non-limiting examples serve to illustrate the invention.

Examples

Referring to fig. 1 and 2, a multi-person conversation ordering method based on audio-visual fusion specifically includes the following steps:

s1, data acquisition: continuously collecting a conversation video containing mixed voice and face images of a plurality of ordering people, and respectively extracting mouth images of the ordering people by using the face images of the ordering people;

s11, down-sampling the dialogue video;

s12, the down-sampled dialogue video is subjected to a face detection model and a face classifier which are trained in advance, and a face image of each ordering person is obtained;

and S13, acquiring mouth images of each ordering person by using a mouth detection model trained in advance according to the face images of each ordering person.

Referring to fig. 3, the two ordering people of this embodiment continuously collect conversation videos of two ordering people, firstly, a multi-person conversation video stream is sampled to 25fps, then, a face detection model trained in advance is used for face detection, face image frames of the two ordering people are obtained, then, a face classifier is used for classifying the face image frames corresponding to each ordering person, finally, a mouth detection model trained in advance is used for collecting mouth image frames of each ordering person, and mouth image features are extracted.

S2, voice separation: separating the mixed voice by combining the mouth images of a plurality of ordering people to obtain a plurality of first voice segments; then a plurality of first voice segments corresponding to the same ordering person are identified, and identity matching is carried out to obtain a second voice segment with confirmed identity;

s21, processing the mixed voice and the mouth image by using a voice decoder and an image decoder respectively to obtain mixed voice characteristics and mouth image characteristics;

s22, inputting the mixed voice features and the mouth image features into a fusion network trained in advance, and fusing the audio-visual feature sequences to obtain fusion feature sequences;

s23, inputting the fusion feature sequence and the mixed voice into a separation network trained in advance, and separating a plurality of voice fragments in the mixed voice to obtain the first voice fragment;

s24, matching the first voice fragment with the identity of the person ordering the meal to obtain the second voice fragment.

S241, extracting acoustic features of the first voice fragment;

s342, calculating the similarity between the acoustic features of the first voice fragment and different acoustic features in the prior feature set;

s423, judging the relationship between the maximum similarity of the acoustic features of the first voice fragment and different acoustic features in the prior feature set and a threshold value through a decision logic, and determining whether the person ordering the first voice fragment corresponds to the person ordering the food is an existing person ordering the food or a new person ordering the food in the prior feature set to obtain a second voice fragment;

the priori feature set is initially an empty set, and as the acoustic features of new meal ordering people continuously appear, the acoustic features of the meal ordering people are continuously added to the priori feature set.

Referring to fig. 4, feature extraction is performed on the mouth image and the mixed voice by using an image encoder and a voice encoder respectively, the extracted mouth image feature and the extracted mixed voice feature are fused by using a fusion network, the fused features are input into a separation network to separate the mixed voice, and a plurality of first voice segments, namely voice segments of different ordering people in the mixed voice are obtained to realize separation.

And matching the separated first voice fragments with the prior feature set, and determining the identity of the ordering person corresponding to each first voice fragment by using the similarity and the decision logic. The priori feature set is initially an empty set, and as new acoustic features of the ordering people continuously appear, the acoustic features of the ordering people are added to the priori feature set. Added person of ordering jThe acoustic feature of (a) is obtained by an MFCC feature extraction parameter method, and is marked as Cj

Wherein n represents the order of the cepstral coefficient, M represents the mth filter channel of the triangular filter bank, y (M) represents the output of the mth triangular band-pass filter, and M represents the total number of M channels.

The ith first speech fragment is subjected to acoustic feature extraction by the MFCC method described above and denoted as CiTo map a priori features C in a set of a priori featuresjMatching with the acoustic feature of the ith first voice segment, calculating a feature vector CjAnd CiThe similarity of (c). The method adopts the standardized Euclidean distance to calculate the similarity of the Euclidean distance:

wherein, CiuU-dimensional spatial feature vector representing i-th speech segment, CjuA u-dimensional spatial feature vector representing the j-th prior feature,representing the variance of the ith speech segment feature from the jth prior feature.

Respectively carrying out similarity calculation on the features of the ith first voice segment and different prior features in the prior feature set to obtain the maximum similarity (namely the minimum standardized Euclidean distance dist (C)i,Cj) The identity of the person ordering the meal in the first voice fragment can be determined, and the formula of the judgment logic is as follows:

wherein S represents the prior feature C of the ith first voice fragment feature and different meal ordering persons in the prior feature setjCalculating the minimum normalized Euclidean distance, wherein j is 1,2, 3.. n, and n is the total number of acoustic features in the prior feature set; when S is larger than a set threshold value theta, the separation of the ith voice segment is considered to belong to a new speaker voice, and the acoustic feature of the ith first voice segment is added into the prior feature set; and when the S is less than or equal to the set threshold value theta, matching the ith voice segment with the speaker with the identity j, and determining the identity of the meal ordering person of the ith first voice segment.

Respectively calculating Euclidean distances between the acoustic features of the ith first voice fragment and different prior features in the prior feature set, wherein if the minimum standardized Euclidean distance is smaller than or equal to a threshold value theta, the prior features corresponding to the ith first voice fragment and the minimum standardized Euclidean distance are the same person-to-order voice fragments; if the minimum normalized Euclidean distance is larger than the threshold value theta, the ordering person corresponding to the ith first voice fragment is a new ordering person, the acoustic features which can be matched with the ordering person do not exist in the prior feature set, and the acoustic features of the new ordering person are added into the prior feature set.

S3, keyword recognition: after the second voice segment is subjected to feature extraction, inputting the second voice segment into a voice recognition network comprising an acoustic model of restaurant dish names and restaurant keywords and a voice model sample library for keyword recognition; if the second voice segment comprises the dish name key words, converting the dish ordering information key words extracted from the second voice segment into text information; if the voice segment does not include the dish name keyword, the second voice segment is the chatting voice of the person ordering the dishes, and the second voice segment is not processed;

referring to fig. 5, firstly, feature extraction is performed on the second voice segment matched with the person ordering food, and in this embodiment, pre-emphasis and framing are performed on the second voice segment; then, a corresponding frequency spectrum is obtained through FFT (fast Fourier transform), namely, the frequency spectrum is subjected to Mel filter bank to obtain Mel frequency spectrum, and a voice feature vector is obtained through DCT (discrete cosine transform).

Recognizing the food ordering voice keyword by utilizing an acoustic model comprising the restaurant food name and the food ordering keyword and a voice recognition network of a voice model sample library, wherein the recognized keyword is the restaurant food name and/or the food ordering keyword contained in the sample library; if the keywords identified by the second voice segment comprise the dish name keywords, the keywords extracted from the voice segment are converted into a text format to output text information. And if the dish name key words are not included, the text information is not output. The voice recognition network aims to convert an input voice characteristic sequence into a word sequence and output the word sequence in a text format by utilizing acoustic and linguistic information.

When the second voice segment comprises the dish name key words, the voice segment is processed; otherwise, the voice fragment is defaulted to be chatting voice of the ordering person, and the voice fragment is not processed in order to protect the privacy of the ordering person.

S4, decision response: comparing the information with the ordering information in the text information by using a knowledge base, confirming the ordering information again by an ordering person after the ordering information is confirmed, transmitting the confirmed ordering information to a kitchen, converting the ordering information into a voice format and outputting the voice format to finish ordering;

repeating the steps S1-S4 until a plurality of ordering persons finish ordering;

the knowledge base comprises the ordering keywords, the ordering keywords comprise an ordering start keyword and an ordering end keyword, and when the text information simultaneously comprises the ordering start keyword and the ordering end keyword, the ordering person finishes ordering; otherwise, the ordering is not finished, and the ordering information of the ordering person is continuously received.

Referring to fig. 6, the knowledge base is used to confirm the ordering information in the text information, and after the ordering is confirmed, the ordering person confirms the ordering information and transmits the confirmed ordering information to the kitchen. And meanwhile, the confirmed text information is used as a response text, and the response text is synthesized into voice and output to the ordering person to finish ordering.

In conclusion, the method can separate the mixed voice of a plurality of ordering people for a long time, can accurately match the separated voice with the corresponding ordering people, matches the identity of each voice segment of the same ordering people, identifies the keyword of each voice segment, processes the voice segments only when the voice segments comprise the names of the dish of the restaurant, and protects the privacy of the ordering people. The method and the device can realize ordering for a plurality of ordering people in a multi-person conversation scene.

The embodiment also provides a multi-person dialogue ordering system based on visual and auditory fusion, which comprises:

the data acquisition module is used for continuously acquiring a conversation video comprising mixed voice and face images of a plurality of ordering people and processing the conversation video to obtain a mouth image of each ordering person;

the data acquisition module can continuously acquire the conversation videos of a plurality of ordering people, simultaneously acquire the mixed voice and the face images of the plurality of ordering people, and process the acquired face images to obtain the mouth images of each ordering people.

The voice separation module is connected with the data acquisition module and is used for separating a plurality of first voice fragments in the mixed voice according to the mouth image and the mixed voice; matching each first voice segment with the corresponding ordering person to obtain a second voice segment with the identity of the ordering person confirmed;

the keyword identification module is connected with the ordering person matching module; the keyword identification module is used for carrying out keyword identification according to an ordering information sample library comprising restaurant dish names and ordering keywords and identifying whether the second voice fragment comprises the restaurant dish name keywords or not; if yes, converting the ordering keywords extracted from the second voice segment into text information; if not, not outputting the text information;

the decision response module is connected with the keyword identification module and is used for comparing the text information output by the keyword identification module with a knowledge base containing the ordering keywords and confirming whether the text information simultaneously contains the ordering start keyword and the ordering end keyword; if yes, ordering is finished; otherwise, the ordering person does not finish ordering, and continues to receive the text information of the ordering person; and after confirming that the ordering is finished, synthesizing the confirmed text information into voice and playing the voice to the ordering person.

The decision response module described in this embodiment includes a speaker, and is used to convert the ordering information into voice and play the voice to the ordering person.

Referring to fig. 7, which is a schematic structural diagram of an electronic device provided by the present invention, the present embodiment discloses an electronic device, which includes at least one processor and a memory communicatively connected to the at least one processor; the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the preceding embodiments. The input and output interface can comprise a display, a keyboard, a mouse and a USB interface and is used for inputting and outputting data; the power supply is used for supplying electric energy to the electronic equipment.

As will be understood by those skilled in the art: all or part of the steps for realizing the method embodiments can be completed by hardware related to program instructions, the program can be stored in a computer readable storage medium, and the steps comprising the method embodiments are executed when the program is executed; and the aforementioned storage medium includes: a removable storage device, a Read Only Memory (ROM), a magnetic disk or an optical disk, etc. may store the program code.

When the integrated unit of the present invention is implemented in the form of a software functional unit and sold or used as a separate product, it may also be stored in a computer-readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present invention may be essentially implemented or a part contributing to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present invention. And the aforementioned storage medium includes: a removable storage device, a ROM, a magnetic or optical disk, or other various media that can store program code.

The foregoing basic embodiments of the invention and their various further alternatives can be freely combined to form multiple embodiments, all of which are contemplated and claimed herein. In the scheme of the invention, each selection example can be combined with any other basic example and selection example at will. It is not exhaustive and numerous combinations will be known to those skilled in the art.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

18页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:语音转写文本编辑系统、方法、装置及设备

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!