Role identification method, device and system in conversation scene

文档序号:1923546 发布日期:2021-12-03 浏览:22次 中文

阅读说明:本技术 对话场景下的角色识别方法、装置和系统 (Role identification method, device and system in conversation scene ) 是由 曾然然 杨杰 林悦 于 2020-05-29 设计创作,主要内容包括:本发明的一个方面涉及对话场景下的角色识别方法、装置和系统。具体公开了一种用于识别对话中的角色的方法,包括:采集对话的音频;基于音频中说话人的语音特征对说话人进行语音角色分离;基于音频的对话内容中的场景和/或行业相关信息对说话人进行语义角色分离并确定角色类别;以及基于语音角色分离的结果和语义角色分离的结果得到角色分类结果。(One aspect of the invention relates to a method, a device and a system for identifying a role in a dialogue scene. Specifically disclosed is a method for identifying a character in a conversation, comprising: collecting audio of the conversation; separating the voice roles of the speakers based on the voice characteristics of the speakers in the audio; semantic role separation is carried out on speakers based on scene and/or industry related information in the audio conversation content, and role categories are determined; and obtaining a role classification result based on the result of the voice role separation and the result of the semantic role separation.)

1. A method for identifying a character in a conversation, comprising:

collecting audio of the conversation;

separating the voice roles of the speakers based on the voice characteristics of the speakers in the audio;

semantic role separation is carried out on speakers based on scene and/or industry related information in the audio conversation content, and role categories are determined; and

and obtaining a role classification result based on the result of the voice role separation and the result of the semantic role separation.

2. The method of claim 1, further comprising:

identifying voiceprint characteristics of the speaker from the audio using a voiceprint identification algorithm; and

based on the role classification result and the voiceprint characteristics of the speaker, the identity of the speaker is identified from a database of registered voiceprints.

3. The method of claim 1, wherein deriving the character recognition result based on the result of the voice character separation and the result of the semantic character separation comprises:

comparing the result of the voice role separation with the result of the semantic role separation for each voice segment;

if the result of the voice role separation is inconsistent with the result of the semantic role separation, adopting the result of the semantic role separation;

and if the result of semantic role separation is null, adopting the result of voice role separation.

4. The method of claim 2, further comprising: voice transcribing the audio to obtain audio text,

wherein, the semantic role separation comprises: and inputting the audio text into a semantic role separation model, wherein the semantic role classification model generates a semantic role separation result based on scene and/or industry related information in the audio text.

5. The method of claim 4, wherein the semantic role separation model is constructed based on a deep learning neural network algorithm using semantic clustering of word vectors or sentence vectors.

6. The method of claim 1, wherein the scenario and/or industry-related information comprises one or more of: conversation scene characteristics, conversation character characteristics, industry proper nouns and enterprise common usage.

7. The method of claim 3, wherein identifying voiceprint characteristics of the speaker from the audio and voice transcribing the audio are performed in real-time during the capturing of the audio of the conversation.

8. An apparatus for identifying a character in a conversation, comprising:

the voice processing module is configured to perform voice role separation on the speaker based on the voice characteristics of the speaker in the collected voice frequency of the conversation;

the semantic processing module is configured to perform semantic role separation on the speaker and determine a role category based on scene and/or industry related information in the audio dialog content; and

and the processing module is configured to obtain a role classification result based on the voice role separation result and the semantic role separation result.

9. The apparatus of claim 8, wherein the speech processing module is further configured to identify voiceprint features of the speaker from the audio using a voiceprint recognition algorithm; and wherein the processing module is further configured to identify the identity of the speaker from a database of registered voiceprints based on the role classification results and the voiceprint characteristics of the speaker.

10. The method of claim 8, wherein deriving the character recognition result based on the result of the voice character separation and the result of the semantic character separation comprises:

comparing the result of the voice role separation with the result of the semantic role separation for each voice segment;

if the result of the voice role separation is inconsistent with the result of the semantic role separation, adopting the result of the semantic role separation;

and if the result of semantic role separation is null, adopting the result of voice role separation.

11. The apparatus of claim 9, further comprising a speech recognition module configured to speech transcribe the audio to obtain audio text,

wherein, the semantic role separation comprises: and inputting the audio text into a semantic role separation model, and generating a semantic role separation result based on scene and/or industry related information in the audio text.

12. The apparatus of claim 11, wherein the semantic role separation model is constructed using semantic clustering of word vectors or sentence vectors based on a deep learning neural network algorithm.

13. The apparatus of claim 8, wherein the scenario and/or industry-related information comprises one or more of: conversation scene characteristics, conversation character characteristics, industry proper nouns and enterprise common usage.

14. The apparatus of claim 11, wherein the recognition of the speaker's voiceprint characteristics from the audio and the voice transcription of the audio are performed in real time during the collection of the audio of the conversation.

15. A system for identifying characters in a conversation, comprising:

the radio device is used for collecting the audio frequency of the conversation; and

apparatus for identifying a character in a conversation according to any one of claims 8 to 14.

Technical Field

The invention relates to the technical field of intelligent voice, in particular to a role recognition technology in a conversation scene.

Background

In the application of intelligent voice, the scenes of identifying the identity of a speaker in voice conversation are very typical and common, such as speaker role distinction in a teleconference, voice separation and quality inspection of customer service/customer of intelligent customer service, suspect identity identification and comparison in public safety fraud and harassment calls, and the like. The business feature of this type of typical application scenario is that multiple speakers alternately have a conversation.

The mainstream role identification method in the industry directly establishes a voiceprint model based on data and identifies the voiceprint model. Under the scenes of telephone conference, trial communication, customer service communication and the like, a plurality of speakers speak alternately, and when the identities of the speakers are identified in real time, under the quick switching of ultra-short sentences, the voiceprint identification can often misjudge the identification of short sentences or sentences (the length of a voice fragment is shorter than 2-3 seconds) and the misjudgment rate is very high. Taking the current customer service system as an example, the accuracy rate of the role separation in the prior art is only about 70%.

In addition, the recognition only depending on the voiceprint features cannot judge the scene and/or industry of the conversation, so that the role of the speaker cannot be recognized under the condition that the scene and/or industry is unknown, and the identity of the speaker cannot be recognized. In the patent application of the invention with publication number CN108074576A entitled "speaker role separation method and system in interrogation scene", periods and/or durations are used as role recognition features in addition to the speech features. However, the application only aims at the interrogation scene, and still cannot be applied to character recognition under various scenes.

Disclosure of Invention

In consideration of accumulation of professional technical data or semantic template data of a large amount of industry experience of enterprise clients in many scenes, the invention provides a method, a device and a system for improving the accuracy of character recognition by combining information about scenes and/or industries in a conversation with character separation. The method, the device and the system for recognizing the role fully utilize key semantic information such as conversation scene characteristics, conversation character and role characteristics, industry proper nouns, enterprise accumulated common dialogues and the like in conversation to improve and improve the accuracy of the role recognition so as to meet the usability of the role recognition required by businesses aiming at the role recognition requirement of a typical conversation scene in intelligent voice.

According to an aspect of the present invention, there is provided a method for identifying a character in a conversation, including: collecting audio of the conversation; separating the voice roles of the speakers based on the voice characteristics of the speakers in the audio; semantic role separation is carried out on speakers based on scene and/or industry related information in the audio conversation content, and role categories are determined; and obtaining a role classification result based on the result of the voice role separation and the result of the semantic role separation.

According to another aspect of the present invention, there is provided an apparatus for identifying a character in a conversation, including: the voice processing module is configured to perform voice role separation on the speaker based on the voice characteristics of the speaker in the collected voice frequency of the conversation; the semantic processing module is configured to perform semantic role separation on the speaker and determine a role category based on scene and/or industry related information in the audio dialog content; and the processing module is configured to obtain a role classification result based on the result of the voice role separation and the result of the semantic role separation.

According to still another aspect of the present invention, there is provided a system for identifying a character in a conversation, including a sound receiving device for collecting audio of the conversation; and means for identifying a character in a conversation as described above.

Drawings

The present disclosure may be more clearly understood from the following detailed description with reference to the accompanying drawings, in which:

FIG. 1 illustrates a block diagram of a system for identifying characters in a conversation, in accordance with one embodiment of the present invention;

FIG. 2 illustrates a flow diagram of a method for identifying a character in a conversation, according to one embodiment of the invention;

FIG. 3 illustrates a flow diagram of a method of determining the identity of a role based on role classification results in accordance with one embodiment of the present invention; and

fig. 4 illustrates an example of determining the identity of a user based on the role classification result and voiceprint.

Detailed Description

Various exemplary embodiments of the present disclosure will now be described in detail with reference to the accompanying drawings. The relative arrangement of the components and steps, the numerical expressions, and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless specifically stated otherwise.

FIG. 1 shows a block diagram of a system for identifying characters in a conversation, according to one embodiment of the invention. As shown in fig. 1, the character recognition system 1 includes a sound pickup device 200 and a character recognition device 100. The sound receiving apparatus 200 is used for collecting audio, and may be a sound pickup device commonly used in the art, such as a microphone. The character recognition apparatus 100 includes a processing module 110, a voice recognition module 120, a voice processing module 130, and a semantic processing module 140. According to an embodiment, the character recognition apparatus may further include a storage apparatus storing the intermediate processing result. In another embodiment, the intermediate processing results may also be stored remotely.

The functions of the above-described modules are described below with reference to fig. 1.

The speech recognition module 120 is used to voice transcribe the captured audio of the conversation. The voice transcription refers to converting voice content in audio into characters through a voice recognition technology.

The speech processing module 130 is configured to perform voice-character separation on the speaker based on the voice characteristics of the speaker in the captured audio of the conversation. In addition, the speech processing module 130 is also configured to recognize voiceprint characteristics of the speaker from the audio using a voiceprint recognition algorithm.

The semantic processing module 140 is configured to semantically personally separate speakers and determine a persona classification based on scene and/or industry-related information in the conversational content of the audio.

The processing module 110 is configured to store the intermediate processing results of the above modules to a memory or a remote end. The processing module 110 is further configured to obtain a role classification result based on the voice role separation result and the semantic role separation result. In one embodiment, the processing module may also identify the identity of the speaker from a database of registered voiceprints based on the role classification results and the voiceprint characteristics of the speaker.

It should be understood that the above modules are merely logic modules divided according to the specific functions implemented by the modules, and are not used for limiting the specific implementation manner. In actual implementation, the above modules may be implemented as separate physical entities, or may also be implemented by a single entity (e.g., a processor (CPU or DSP, etc.), an integrated circuit, etc.).

A method for identifying a character in a conversation according to one embodiment of the present invention is described below with reference to fig. 2.

In step 201, audio of a conversation is captured by the sound receiving device 200, and the captured audio is sent to the processing module 110. The processing module 110 stores it in local memory or at a remotely located memory or server.

In one embodiment, the sound receiving device 200 transmits the audio data stream to the processing module 110 in real-time while audio is being captured.

After the dialog is completed and the audio is collected, in step 202, the processing module 110 sends the collected audio of the dialog to the speech processing module 130, and the speech processing module 130 performs speech role separation on the speaker based on the speech features of the speaker in the audio. The speech characteristics refer to characteristics specific to the speaker's voice, such as tone quality and tone color. Because the size and shape of human vocal organs are different, the tone quality and tone quality of different human voices are also different. The speech processing module uses the specificity of the speech to distinguish speakers.

According to one embodiment, the speech processing module uses a speech role separation algorithm to cluster each speech in the audio, and after clustering, separates into class a/class B, and returns the results to the processing module 110 for storage. The result of clustering is not necessarily two categories, and depending on the scene, more than two categories may be possible. The speech role separation algorithm can be implemented using algorithms known in the art, for example, using neural network based algorithms, according to actual needs.

In step 203, the semantic processing module 140 performs semantic role separation on the speaker and determines a role category based on scene and/or industry related information in the dialog content of the audio.

In one embodiment, the semantic processing module 140 utilizes audio text transcribed from audio to identify scene and/or industry related information.

The audio text may be obtained by the following process. During the audio acquisition process of the sound receiving device 200, the audio data stream is transmitted to the processing module 110 in real time. Each time the processing module 110 receives a segment of a speech unit, it sends it to the speech recognition module 120 for speech transcription and obtaining the corresponding text. The text is sent back to the processing module 110 for storage until the collection process is complete and all speech unit segments have been processed. Then, the stored texts of all the speech segments are merged and sent to the semantic processing module 140 for processing.

The audio transcription may not be performed in real time. In one embodiment, after the capturing is completed, the processing module 110 sends the captured complete conversation audio to the speech recognition module 120 for speech transcription, thereby obtaining the text of the conversation content. The audio text may be sent back to the processor 110 for storage or directly to the semantic processing module 140.

The process of semantic role separation is described below.

In one embodiment, the semantic processing module 140 may input the audio text into a semantic role separation model and re-role separate each piece of content according to scene and/or industry related information in the audio text, thereby obtaining 0/1 a second classification. At this time, the semantic processing module 140 can directly determine what the roles of class 0 and class 1 are respectively according to the scene and/or industry related information. For example, role 0 is customer service and role 1 is customer. The semantic role separation is described here by taking two classes as an example, but in practical application, the classes may be more than two classes. In one implementation, the roles may be further refined to old customers, new customers, VIP customers. For example, role 1 may be directly recognized as a VIP client.

In one embodiment, the semantic role separation model is constructed based on a deep learning neural network algorithm using semantic clustering of word vectors or sentence vectors.

In practical applications, a conversation typically occurs among different industries, and one industry may have one or more different scenarios. For example, a conversation may exist in the financial, telecommunications, educational, etc. industries, where customer service scenarios, meeting scenarios, etc. may exist in the telecommunications industry. Further, the customer service scenario may also be refined into, for example, an old customer service scenario, a new customer service scenario, and a VIP customer service scenario. Thus, obtaining context and/or industry related information is important for identifying what roles speakers in a conversation are respectively.

Scene and/or industry related information can be extracted using word vector and/or sentence vector processing methods in the prior art. This is just one embodiment, and other extraction methods can be selected according to actual needs.

Scene-related information is information that expresses a scene characteristic, such as information that can be used to distinguish different scenes. In one example, the context-related information includes one or more of a dialog context feature and a dialog character feature. Commonly used scenarios for role recognition include, for example, customer service calls, telephone conferences, inquiries, fraudulent nuisance calls, and the like. In different scenarios, the vocabulary, sentences, and templates employed by the dialog are different. In one example, the scene-related information may be template information employed by the dialog.

The industry-related information is information expressing the characteristics of an industry, for example, information capable of distinguishing different industries. In one example, the industry-related information includes one or more of industry proper nouns and business conventions. Corporate conversational terminology refers to a corporate-specific set of speaking circuits. Common industries for role recognition include, for example: finance, insurance, telecommunications, education, public security, etc. Taking the telecommunications industry as an example, industry specific terms include, for example: traffic, wire speed, 4G, etc.

In step 204, the processing module 110 determines a character classification result based on the voice character separation result output by the voice processing module 130 and the semantic character separation result output by the semantic processing module 140. The role classification result not only comprises the separation result of the role, but also comprises the corresponding category of the separated role.

The accuracy of the voice role separation is not high enough, but the role separation result can be output for each voice no matter what the speaking content of the voice. Although the semantic role separation based on the scene and/or industry related information is high in accuracy, the role separation cannot be performed on the sentences which do not contain the scene and/or industry related information, namely, the output result is null.

In step 204, the processing module 110 combines the results of the two role separations to obtain a separation result that is more accurate than the speech role separation alone and more complete than the semantic role separation alone. In addition, the voice character separation has high misjudgment rate for ultra-short sentences (the length of a voice fragment is shorter than 2-3 seconds), and the semantic character separation has little relation with the length of the sentence, so that the problem is solved. Therefore, after the two are combined, an accurate separation result can be obtained even for an ultra-short sentence.

In one embodiment, the processing module 110 may compare the results of the voice role separation and the results of the semantic role separation for each voice segment; if the result of the voice role separation is inconsistent with the result of the semantic role separation, adopting the result of the semantic role separation; if the result of semantic role separation is null for a voice segment, that is, the voice segment cannot be classified, the result of voice role separation is directly adopted.

Since the semantic processing module has determined the role category while the semantic roles are separated, (e.g., 0 is the client and 1 is the client), after integrating the two role separation results, a role separation result with the role category being clarified can be obtained, which is also referred to herein as a role recognition result.

The role separation device and the role separation method utilize a dual role separation mechanism of voice role separation and semantic role separation, and not only the voice characteristics but also the scenes of conversation and/or industry information are considered during separation. Therefore, the invention has the following advantages:

1. in the prior art, the mainstream role recognition mechanism only relies on voice characteristics to separate roles, so that the accuracy rate is only about 70%. Especially, under the condition of fast switching of ultra-short sentences (the length of the voice segment is shorter than 2-3 seconds), the misjudgment rate is very high. The invention introduces a semantic role separation mechanism on the basis of a voice role separation mechanism, thereby greatly improving the accuracy of role separation.

2. In the prior art, a mainstream role recognition mechanism generally only uses a single mechanism of voice role separation, and scenes and industries cannot be recognized. Thus, prior art character recognition mechanisms can only be applied in the case of a single scene and/or industry (i.e., scene and/or industry known). The invention utilizes the scene and/or industry related information to judge the scene and/or industry where the conversation is located, so that the invention can be applied to a plurality of scenes.

3. The speech role separation and the semantic role separation are unified to calculate results after the conversation is finished, and compared with the prior art that the accuracy of the role separation recognition is improved based on a single speech fragment.

After the character classification result with improved accuracy is obtained, the identity of the character can be further identified. A method of determining the identity of a character based on the result of the classification of the character is described below with reference to fig. 3.

In step 205, the speech processing module 130 identifies voiceprint characteristics of the speaker from the audio using a voiceprint recognition algorithm. The voiceprint recognition algorithm recognizes voiceprints, i.e., sound wave spectra, unique to different persons based on the difference between the tone quality and tone color of the speech of the person. Voiceprints, like fingerprints, can be used to detect the identity of a person.

In one embodiment, voiceprint recognition can be performed on each phonetic unit segment in real-time while audio is being captured. This is not necessary, however, and the voiceprint recognition can be performed on the speech segment in the entire audio after the audio acquisition is finished.

In step 206, the processing module 110 identifies the identity of the speaker from the database of registered voiceprints based on the role classification results and the voiceprint characteristics of the speaker.

The database of registered voiceprints is a database obtained in advance that stores voiceprints and identities of candidate speaker objects in association with each other. The database may be categorized by role category. For example, in a customer service call scenario, the customer service voiceprint database records a customer service number of a customer service person and a voiceprint corresponding to the customer service person, and the customer database records an identity of a customer (e.g., a mobile phone number, a bank account, an ID, etc.) and a voiceprint corresponding to the customer. On the basis, the customer database can be further divided into an old customer database, a new customer database, a VIP customer database and the like.

Fig. 4 illustrates an example of determining the identity of a user based on the role classification result and voiceprint.

In this example, the processing module may compare the voiceprint identified from the class 0 voiceprint to each registered voiceprint in the customer service voiceprint database and score each registered voiceprint based on the role classification results (e.g., class 0 for customer service and class 1 for customer). For example, according to the difference of the voiceprint similarity, a score of 0-10 is given, 0 is the lowest similarity, and 10 is the highest similarity. From these, the highest-scoring customer service (e.g., customer service with customer service number 3579) is selected as the customer service person in the conversation. A similar scoring method may also be used for client role identification, comparing and scoring the voiceprint identified from class 1 speech segments against each registered voiceprint in the client voiceprint database, and selecting the highest scoring client from the voiceprints. In the case where class 1 is directly refined to a VIP client, the voiceprint can be compared to registered voiceprints in a VIP client database and the VIP client with the highest score (e.g., VIP client bee) is selected as the client in the conversation.

After the identity ID of the speaker is finally determined, the final result can be formed for saving, synchronizing or displaying.

The steps described above with respect to fig. 2 and 3 are merely exemplary and do not necessarily occur in the order described. The order of the steps can be adjusted by those skilled in the art according to actual needs, for example, semantic character separation can be performed first, and then voice character separation can be performed. As another example, the step of voiceprint recognition can be performed prior to speech persona separation or prior to semantic persona separation.

Compared with the prior art, the dual role separation mechanism is not completely independent of voiceprint recognition processing, and the output of the two separation mechanisms is finally distinguished and comprehensively calculated by combining the voiceprint recognition result after being integrated. The invention not only can distinguish the role category of the speaker, but also can identify the identity of the speaker. Since accurate role classification results are obtained before, the identity of each speaker can be identified only by performing voiceprint comparison in the database of the corresponding category. Compared with the prior art, the method has stronger pertinence to the voiceprint under the condition that the role category is unknown, more accurate result and smaller processing amount.

The many features and advantages of the invention are apparent from the detailed specification, and thus, it is intended by the appended claims to cover all such features and advantages of the invention which fall within the true spirit and scope of the invention. Further, since numerous modifications and variations will readily occur to those skilled in the art, it is not desired to limit the invention to the exact construction and operation illustrated and described, and accordingly, all suitable modifications and equivalents may be resorted to, falling within the scope of the invention.

12页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种洗衣机的语音交互方法及装置

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!