Voice interaction method and device and related equipment

文档序号：191272 发布日期：2021-11-02 浏览：32次中文

阅读说明：本技术 语音交互方法、装置及相关设备 (Voice interaction method and device and related equipment ) 是由李少军杨杰于 2021-07-30 设计创作，主要内容包括：本申请涉及数据处理技术,提供一种语音交互方法、装置、计算机设备及存储介质,包括：基于目标人物判定模型对初始讲解文本与评价文本分析,判定人物是否为目标人物；获取目标人物的初始讲解文本集,得到若干聚类簇；对聚类簇中的初始讲解文本提取目标特征,得到第一讲解文本集,对第一讲解文本进行组合,得到目标讲解文本；解析目标讲解文本,得到业务流程文本；生成业务流程语音,构建虚拟人物,获取虚拟人物的面部特征以及音频特征；解析语音指令,得到业务流程节点信息；根据业务流程节点信息得到与业务流程节点信息匹配的目标业务流程语音。本申请能够提高业务讲解效率,可用于智慧城市的各个功能模块中,促进智慧城市的快速发展。(The application relates to a data processing technology, and provides a voice interaction method, a voice interaction device, computer equipment and a storage medium, wherein the voice interaction method comprises the following steps: analyzing the initial explanation text and the evaluation text based on the target character judgment model, and judging whether the character is the target character; acquiring an initial explanation text set of a target character to obtain a plurality of clustering clusters; extracting target characteristics from the initial explanation texts in the cluster to obtain a first explanation text set, and combining the first explanation texts to obtain a target explanation text; analyzing the target explanation text to obtain a service flow text; generating business process voice, constructing a virtual character, and acquiring the facial features and audio features of the virtual character; analyzing the voice instruction to obtain service flow node information; and obtaining the target business process voice matched with the business process node information according to the business process node information. The method and the system can improve the service explanation efficiency, can be used in each functional module of the smart city, and promote the rapid development of the smart city.)

1. A voice interaction method, characterized in that the voice interaction method comprises:

acquiring an initial explanation text and an evaluation text of a participant on the initial explanation text, and automatically analyzing the initial explanation text and the evaluation text based on a pre-trained target character judgment model to judge whether a character is a target character;

acquiring an initial explanation text set of the target character and preprocessing the initial explanation text set to obtain a plurality of clustering clusters, wherein each clustering cluster comprises an initial explanation text meeting a threshold condition;

extracting target features from the initial explanation texts in each cluster to obtain a first explanation text set, and combining each first explanation text in the first explanation text set according to a preset text sequence to obtain a target explanation text;

analyzing the target explanation text to obtain a service flow text;

generating business process voice corresponding to the business process text, constructing a virtual character according to a preset mathematical model, and acquiring the facial features of the virtual character and the audio features which are output by the virtual character and correspond to the business process voice;

when a voice instruction is received, analyzing the voice instruction to obtain service flow node information;

and obtaining target business process voice matched with the business process node information according to the business process node information.

2. The voice interaction method of claim 1, wherein the obtaining and preprocessing the initial interpretive text set of the target person to obtain a plurality of clusters comprises:

acquiring an explanation theme corresponding to each initial explanation text in the initial explanation text set;

calculating text similarity between the explaining subjects;

and taking the explanation subjects with the text similarity exceeding a preset similarity threshold as a clustering center, and forming clustering clusters corresponding to the clustering center by the initial explanation texts corresponding to the explanation subjects.

3. The method of claim 1, wherein the extracting target features from the initial interpretation text in each cluster to obtain a first set of interpretation texts comprises:

acquiring an initial explanation text in the cluster, and splitting the initial explanation text data into a plurality of paragraphs by adopting a sequential segmentation method;

calling a pre-trained feature positioning model to screen out a plurality of target paragraphs with most information;

calling a pre-trained feature extraction model to respectively extract sentence-level hierarchical features of the target paragraphs to obtain a first explanation text;

and combining the first explanation texts corresponding to each cluster to obtain a first explanation text set.

4. The method of claim 1, wherein the combining each first interpretation text in the first set of interpretation texts according to a preset text order to obtain a target interpretation text comprises:

acquiring a target cluster to which the first explanation text belongs and a target explanation theme corresponding to the target cluster;

acquiring a logical relation among the target explanation themes, and determining a theme sequence among the target explanation themes according to the logical relation;

and acquiring a preset text sequence according to the theme sequence, and combining each first explained text in the first explained text set according to the preset text sequence to obtain a target explained text.

5. The method of claim 1, wherein the obtaining of the facial features of the avatar and the audio features output by the avatar corresponding to the business process speech comprises:

determining a phoneme sequence corresponding to the voice, wherein the phoneme sequence comprises phonemes corresponding to all time points;

determining lip key point information and eye key point information corresponding to each phoneme in the phoneme sequence;

searching a pre-established lip shape library and an eye shape library according to the determined lip shape key point information and the eye key point information respectively to obtain a lip shape image and an eye shape image of each phoneme;

and respectively corresponding the searched lip shape image and eye shape image of each phoneme to each time point to obtain a lip shape image sequence and an eye shape image sequence corresponding to the voice.

6. The voice interaction method of claim 1, wherein the parsing the target explanation text to obtain a business process text comprises:

determining candidate subject terms;

acquiring the word frequency of the candidate subject term in the target explanation text and the semantic similarity between the candidate subject term and the text word in the target explanation text;

and determining the correlation between the text and each candidate subject term according to the word frequency and the semantic similarity, and filling the candidate subject terms with the correlation higher than a preset correlation threshold value into the target explanation text to obtain a business process text.

7. The voice interaction method according to claim 1, wherein the generating the business process voice corresponding to the business process text comprises:

acquiring a preset mapping table of a text and a voice, wherein the mapping table stores the corresponding relation between characters or character strings and pronunciation phonemes;

identifying characters or character strings corresponding to the business process texts;

traversing the mapping table to retrieve the pronunciation factors corresponding to the characters or the character strings, and splicing the pronunciation phonemes to obtain the business process voice corresponding to the business process text.

8. A voice interaction apparatus, comprising:

the target judgment module is used for acquiring an initial explanation text and an evaluation text of a participant to the initial explanation text, and automatically analyzing the initial explanation text and the evaluation text based on a pre-trained target character judgment model so as to judge whether a character is a target character;

the cluster analysis module is used for acquiring an initial explanation text set of the target character and preprocessing the initial explanation text set to obtain a plurality of cluster clusters, wherein each cluster comprises an initial explanation text meeting a threshold condition;

the feature extraction module is used for extracting target features from the initial explanation texts in each cluster to obtain a first explanation text set, and combining each first explanation text in the first explanation text set according to a preset text sequence to obtain a target explanation text;

the text analysis module is used for analyzing the target explanation text to obtain a service flow text;

the voice generation module is used for generating business process voice corresponding to the business process text, constructing a virtual character according to a preset mathematical model, and acquiring the facial features of the virtual character and the audio features which are output by the virtual character and correspond to the business process voice;

the instruction analysis module is used for analyzing the voice instruction to obtain service flow node information when receiving the voice instruction;

and the voice determining module is used for obtaining the target business process voice matched with the business process node information according to the business process node information.

9. A computer device, characterized in that the computer device comprises a processor for implementing the voice interaction method according to any one of claims 1 to 7 when executing a computer program stored in a memory.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method for voice interaction according to any one of claims 1 to 7.

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a voice interaction method, apparatus, computer device, and medium.

Background

Under the guidance of the strategies of finance + science and technology and finance + ecology in the insurance industry, on the important node of science and technology accelerated insurance digital transformation, how to quickly integrate AI capacity to enable mass business agents so as to improve the business explanation efficiency is an important task.

In the process of implementing the present application, the inventor finds that the following technical problems exist in the prior art: the prior art when waiting to report the video in order to realize the business explanation in the output, can treat the business explanation text of reporting and carry out voice broadcast, show simultaneously and report virtual personage and report to make and wait to report the video and can satisfy user's vision and sense organ demand in the sense of hearing simultaneously. However, in the prior art, most of the service explanation texts to be broadcasted are manually edited by related personnel, and the generation cost and the generation efficiency of the service explanation texts to be broadcasted are high, so that the efficiency of service explanation is low; and because the manual editing mode is adopted, the generation accuracy of the service explanation text to be broadcasted cannot be ensured, so that the accuracy of the service explanation cannot be ensured.

Therefore, it is necessary to provide a method for voice interaction of virtual characters, which can improve the efficiency and accuracy of service explanation.

Disclosure of Invention

In view of the foregoing, there is a need for a voice interaction method, a voice interaction apparatus, a computer device and a medium, which can improve the efficiency and accuracy of service explanation.

A first aspect of an embodiment of the present application provides a voice interaction method, where the voice interaction method includes:

analyzing the target explanation text to obtain a service flow text;

when a voice instruction is received, analyzing the voice instruction to obtain service flow node information;

and obtaining target business process voice matched with the business process node information according to the business process node information.

Further, in the voice interaction method provided by the present application, the obtaining and preprocessing the initial explanation text set of the target person to obtain a plurality of cluster clusters includes:

acquiring an explanation theme corresponding to each initial explanation text in the initial explanation text set;

calculating text similarity between the explaining subjects;

Further, in the above speech interaction method provided by the present application, the extracting target features from the initial explanation text in each cluster to obtain a first explanation text set includes:

acquiring an initial explanation text in the cluster, and splitting the initial explanation text data into a plurality of paragraphs by adopting a sequential segmentation method;

calling a pre-trained feature positioning model to screen out a plurality of target paragraphs with most information;

calling a pre-trained feature extraction model to respectively extract word-level, sentence-level and paragraph-level hierarchical features of the target paragraph to obtain a first explanation text;

and combining the first explanation texts corresponding to each cluster to obtain a first explanation text set.

Further, in the voice interaction method provided by the present application, the combining each first interpretation text in the first interpretation text set according to a preset text order to obtain a target interpretation text includes:

acquiring a target cluster to which the first explanation text belongs and a target explanation theme corresponding to the target cluster;

acquiring a logical relation among the target explanation themes, and determining a theme sequence among the target explanation themes according to the logical relation;

Further, in the above voice interaction method provided in the present application, after the determining a business process framework corresponding to the target business, the method further includes:

determining a parent-child relationship among a plurality of business items in the business process framework;

setting adjustment attributes and constraint conditions among a plurality of service items;

and determining the self-adaptive adjustment relation between the service items according to the adjustment attribute and the constraint condition.

Further, in the voice interaction method provided by the present application, the analyzing the target explanation text to obtain a service flow text includes:

determining candidate subject terms;

Further, in the voice interaction method provided by the present application, the generating a business process voice corresponding to the business process text includes:

acquiring a preset mapping table of a text and a voice, wherein the mapping table stores the corresponding relation between characters or character strings and pronunciation phonemes;

identifying characters or character strings corresponding to the business process texts;

A second aspect of the embodiments of the present application further provides a voice interaction apparatus, where the voice interaction apparatus includes:

the text analysis module is used for analyzing the target explanation text to obtain a service flow text;

the instruction analysis module is used for analyzing the voice instruction to obtain service flow node information when receiving the voice instruction;

and the voice determining module is used for obtaining the target business process voice matched with the business process node information according to the business process node information.

The third aspect of the embodiments of the present application further provides a computer device, where the computer device includes a processor, and the processor is configured to implement the voice interaction method according to any one of the above when executing the computer program stored in the memory.

The fourth aspect of the embodiments of the present application further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program implements any one of the above-mentioned voice interaction methods.

According to the voice interaction method, the voice interaction device, the computer equipment and the computer readable storage medium provided by the embodiment of the application, the initial explanation text and the evaluation text are automatically analyzed based on a pre-trained target character judgment model to judge whether a character is a target character, and then target characteristics in the initial explanation text corresponding to the target character are sorted in a characteristic extraction mode to obtain a target explanation text, so that the explanation styles of service contents are unified; the target explanation text does not need to be edited manually, so that the manual editing cost can be saved, and the service explanation efficiency is improved; in the method, a plurality of clustering clusters are obtained by clustering and analyzing the initial explanation text sets of a plurality of target characters, and useful features in each clustering cluster are extracted, so that the useful features can be extracted, and the comprehensiveness of the target explanation text is ensured; in addition, the method and the device construct the virtual character, and match the facial features of the virtual character with the audio features of the business process voice. When a voice instruction triggered by a participant is received, the voice instruction is analyzed to obtain service process node information, and target service process voice is output, so that service explanation of virtual characters is realized, and the service explanation efficiency can be improved. The application can be applied to each function module in wisdom cities such as wisdom government affairs, wisdom traffic, for example wisdom government affairs based on virtual character's voice interaction module etc. can promote wisdom city's rapid development.

Drawings

Fig. 1 is a flowchart of a voice interaction method according to an embodiment of the present application.

Fig. 2 is a structural diagram of a voice interaction apparatus according to a second embodiment of the present application.

Fig. 3 is a schematic structural diagram of a computer device provided in the third embodiment of the present application.

The following detailed description will further illustrate the present application in conjunction with the above-described figures.

Detailed Description

In order that the above objects, features and advantages of the present application can be more clearly understood, a detailed description of the present application will be given below with reference to the accompanying drawings and specific embodiments. It should be noted that the embodiments and features of the embodiments of the present application may be combined with each other without conflict.

In the following description, numerous specific details are set forth to provide a thorough understanding of the present application, and the described embodiments are a part, but not all, of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein in the description of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application.

The voice interaction method provided by the embodiment of the invention is executed by the computer equipment, and correspondingly, the voice interaction device runs in the computer equipment.

Fig. 1 is a flowchart of a voice interaction method according to a first embodiment of the present application. As shown in fig. 1, the voice interaction method may include the following steps, and the order of the steps in the flowchart may be changed and some may be omitted according to different requirements.

S11, obtaining an initial explanation text and an evaluation text of the initial explanation text by a participant, and automatically analyzing the initial explanation text and the evaluation text based on a pre-trained target character judgment model to judge whether a character is a target character.

In at least one embodiment of the present application, the initial explanation text refers to an explanation text for each service content in the target service. The target service refers to a service that needs to perform service content explanation, for example, the target service may be an insurance service, a financial reimbursement service, or a mail sending and receiving service. The target person refers to an agent with excellent business corresponding to the target business. Taking the target service as an insurance service as an example, the corresponding initial explanation text refers to a text used by an agent to explain relevant information of an insurance product to a user. It is understood that the initial lecture text may be different for different agents due to the influence of language habits, work experience, etc. of the different agents. The evaluation text refers to evaluation contents of initial explanation texts of corresponding agents by different participants, the participants may refer to people who participate in learning the initial explanation texts, the evaluation text may include contents such as evaluation levels, and the evaluation levels may include a level a, a level B, a level C, and the like.

In one embodiment, the target character may be determined by comprehensively considering the explanation behavior of the agent, and the explanation behavior may include dimensions such as proficiency, logic level, and standard level of mandarin for explaining the text, for example, a higher proficiency, logic level, and standard level of mandarin for explaining the text by the agent indicates that the agent is more excellent, and the agent is identified as the target character; otherwise, the explanation quality of the agent is low, and the agent can not be identified as the target character. The logic level of the explanation text can be determined by detecting whether the logic of the initial explanation text meets a preset logic requirement, and the proficiency level and the standard level of Mandarin Chinese can be determined by analyzing the evaluation text. The method and the device can train a target character judgment model through a deep learning network model, and call the target character judgment model to carry out automatic analysis on the initial explanation text and the evaluation text so as to judge whether the agent is a target character.

Wherein the determining of the logic degree of the explained text by detecting whether the logic of the initial explained text meets the preset logic requirement may include: acquiring a logic keyword of the explanation text; constructing a logic architecture to be checked according to the logic keywords; calculating the architecture similarity between the logic architecture to be audited and a preset reference logic architecture, and detecting whether the architecture similarity exceeds a preset architecture similarity threshold value; when the detection result shows that the framework similarity exceeds a preset framework similarity threshold, determining that the logic of the initial explained text meets the preset logic requirement; and when the detection result is that the framework similarity does not exceed a preset framework similarity threshold, determining that the logic of the initial explained text does not meet the preset logic requirement. The reference logic architecture refers to a logic architecture corresponding to an explanation text meeting preset logic requirements, the logic architecture is composed of a plurality of logic keywords, each logic keyword can be an explanation theme corresponding to a certain paragraph in the explanation text, the explanation theme of the certain paragraph can be determined by the occurrence frequency of each logic keyword, and generally, the logic keyword with the largest occurrence frequency is selected as the explanation theme of the certain paragraph. The logic framework is composed of a plurality of logic keywords, parallel relations and/or inclusion relations exist among the logic keywords, for example, for the logic keyword A, two logic keywords B and C are included below the logic keyword A, at the moment, the logic keyword A and the logic keyword B, the logic keyword A and the logic keyword C all belong to the inclusion relations, and the logic keyword B and the logic keyword C belong to the parallel relations. Calculating the architecture similarity between the logic architecture to be audited and a preset reference logic architecture, namely determining whether a logic keyword containing a relationship error or a parallel relationship error exists in the logic architecture to be audited, and determining that the logic of the initial explanation text does not meet the preset logic requirement for the logic architecture to be audited with more relationship errors or parallel relationship errors; and for the logic architecture to be audited with less or no containing relation errors or parallel relation errors, determining that the logic of the initial explanation text meets the preset logic requirement.

The proficiency level and the standard level of mandarin chinese of the lecture text may be determined by parsing the comment text, and may include: the participants respectively evaluate the proficiency level of the explanation text and the standard level of Mandarin, and store the evaluation results according to a preset data format to form an evaluation text, wherein the evaluation text can comprise contents such as evaluation levels, and the evaluation levels can comprise A level, B level, C level and the like. When the number of the participants is plural, the evaluation level of each participant may be averaged to identify the proficiency level of the lecture text and the standard level of mandarin chinese in the average.

The training of the target character determination model by the deep learning network model may include: taking the logic degree corresponding to the initial explanation text, the proficiency degree corresponding to the evaluation text and the standard degree of Mandarin as input data, and taking the judgment result of whether the agent is the target character as output data to construct a training sample and a testing sample; calling an initial neural network model to process the training sample to obtain a target character judgment model; and calling the target character judgment model to process the test sample, calculating the model accuracy, and determining that the training of the target character judgment model is finished when the model accuracy exceeds a preset model accuracy threshold. The preset model accuracy threshold is a preset value, and is not limited herein.

The invoking the target person determination model to perform automated analysis on the initial interpretation text and the evaluation text to determine whether the agent is the target person may include: acquiring the logic degree corresponding to the initial explanation text; acquiring proficiency degree and standard degree of Mandarin corresponding to the evaluation text; using the logic level, proficiency level and mandarin standard level as input data; and calling the target character judgment model to process the input data to obtain a judgment result of whether the agent is the target character.

According to the method and the device, the mode of analyzing the evaluation text to obtain the proficiency and standard degree of the Mandarin Chinese replaces the mode of training more models and analyzing the explanation behaviors by the models to obtain the proficiency and standard degree of the Mandarin Chinese, and the problem of large calculation amount caused by marking the training text during model training can be solved.

S12, acquiring the initial explanation text set of the target character and preprocessing the initial explanation text set to obtain a plurality of cluster clusters, wherein each cluster comprises the initial explanation text meeting the threshold condition.

In at least one embodiment of the present application, the number of target persons may be 1, or may be multiple, and when the number of target persons is multiple, the number of initial interpretation texts is also multiple, and the initial interpretation texts are combined to obtain an initial interpretation text set. Preprocessing the initial interpretation text set of the target character may include: and deleting irrelevant information in each initial explained text in the initial explained text set, wherein the irrelevant information comprises stop words (such as words of 'and', 'also', and the like), repeated words, punctuation marks and the like. By deleting the irrelevant information in the initial explanation text, the interference of the irrelevant information can be reduced, and the accuracy of cluster analysis is improved.

In an embodiment, the initial explained text set includes initial explained texts of a plurality of different target characters, each of the initial explained texts includes a plurality of explained topics, and the plurality of explained topics are different, the explained topics can be understood as different explained units included in the initial explained text, and each explained topic has a corresponding explained segment. Different target characters may have smaller gaps for the explanation fragments of the same explanation topic. Analyzing the explanation subjects in the initial explanation text set through clustering, and dividing explanation segments of different target characters under the explanation subjects with text similarity exceeding a preset similarity threshold into the same clustering cluster, wherein the preset similarity threshold is a preset similarity value. In each cluster, there are several explanation segments of different target characters corresponding to the same or similar explanation topics.

Optionally, the obtaining and preprocessing the initial explanation text set of the target person to obtain a plurality of clustering clusters includes:

acquiring an explanation theme corresponding to each initial explanation text in the initial explanation text set;

calculating text similarity between the explaining subjects;

The method comprises the steps that an initial explanation text is divided into a plurality of explanation themes, wherein the number of the explanation themes corresponding to one initial explanation text is multiple, and the explanation themes of one initial explanation text are arranged in a vector form to obtain a first explanation theme vector; by acquiring the explanation themes of all the initial explanation texts in the initial explanation text set and performing vector form arrangement, a second explanation theme vector, a third explanation theme vector to an nth explanation theme vector can be obtained. And respectively calculating the similarity between the explanation themes from the first explanation theme vector to the nth explanation theme vector, and taking a plurality of explanation fragments of different target characters corresponding to the explanation themes with the similarity exceeding a preset similarity threshold value as a cluster, thereby obtaining a plurality of cluster clusters.

S13, extracting target features of the initial explained texts in each cluster to obtain a first explained text set, and combining each first explained text in the first explained text set according to a preset text sequence to obtain a target explained text.

In at least one embodiment of the present application, the first interpretation text set refers to a set formed by key texts extracted from each cluster, and the target interpretation text refers to a key text formed by combining first interpretation texts corresponding to a plurality of clusters.

Optionally, the extracting target features from the initial explanation text in each cluster to obtain a first explanation text set includes:

acquiring an initial explanation text in the cluster, and splitting the initial explanation text data into a plurality of paragraphs by adopting a sequential segmentation method;

calling a pre-trained feature positioning model to screen out a plurality of target paragraphs with most information;

calling a pre-trained feature extraction model to respectively extract word-level, sentence-level and paragraph-level hierarchical features of the target paragraph to obtain a first explanation text;

and combining the first explanation texts corresponding to each cluster to obtain a first explanation text set.

Because the cluster includes the explanation fragments of different target characters under the same or similar explanation subjects, when the explanation fragments in the cluster are subjected to feature extraction, the number of the obtained first explanation texts may be multiple, and the same or similar features exist among the multiple first explanation texts, so that the first explanation texts extracted from each cluster can be subjected to de-duplication processing to remove the same or similar features in the multiple first explanation texts, and finally the first explanation texts with non-repeated features are obtained.

The initial explained text can be preprocessed, for example, by deleting irregular words (such as special symbols, punctuations, etc.) in the initial explained text. The feature location model is used for locating useful information in the paragraph, and the useful information can be information with a preset forward effect on the talkback solution. When the feature location model is trained, the initial neural network is trained by taking the information as input vectors and taking the labels whether the information corresponds to useful information or not as output vectors, so that the feature location model can be obtained.

The feature extraction model can comprise a convolutional neural network and a bidirectional long-short term memory network, can comprise a sentence-level attention layer, and extracts features of a target paragraph layer by using a hierarchical structure. The sentence-level attention layer firstly obtains local features of each sentence through a convolutional neural network, then, the front text features and the rear text features of each sentence are related through a bidirectional long-short term memory network, the soft attention layer is introduced to calculate the weight of each sentence, a feature vector of each paragraph is formed by weighting and summing a plurality of sentence-level features, and the feature vector can be used as a first explanation text.

According to the method and the device, clustering analysis is carried out on the initial explanation text set of the target character to obtain a plurality of clustering clusters, and useful features in each clustering cluster are extracted, so that the useful features can be extracted, and the comprehensiveness of the target explanation text is ensured.

Optionally, the combining each first explanation text in the first explanation text set according to a preset text sequence to obtain a target explanation text includes:

acquiring a target cluster to which the first explanation text belongs and a target explanation theme corresponding to the target cluster;

acquiring a logical relation among the target explanation themes, and determining a theme sequence among the target explanation themes according to the logical relation;

The logic relation exists among the target explanation topics, the logic relation can be determined by traversing the logic keywords in the logic architecture, and a parallel relation and/or an inclusion relation exists among the logic keywords, for example, for the logic keyword a, two logic keywords B and C are included below the logic keyword a, at this time, the logic keyword a and the logic keyword B, the logic keyword a and the logic keyword C all belong to the inclusion relation, and the logic keyword B and the logic keyword C belong to the parallel relation. And determining a theme sequence among the target explanation themes according to the parallel relation and/or the inclusion relation of the logical relations, wherein the theme sequence and a preset text sequence have a mapping relation, and the preset text sequence corresponding to the theme sequence can be obtained by inquiring the mapping relation.

And S14, analyzing the target explanation text to obtain a service flow text.

In at least one embodiment of the present application, a business process framework corresponds to a framework for explaining business content, and taking the target business as an insurance business as an example, the business process framework may include: product information, business team, operation mode, payment management and other frameworks, wherein, for each framework in the business process framework, one or more subframes may correspond to the framework, for example, two subframes of "online operation" and "offline operation" are also included in the "operation mode" framework. In an embodiment, the business process framework may be a framework preset by a business person. In other embodiments, in order to improve the accuracy and efficiency of setting the business process framework, a mode of setting the business process framework by machine learning is adopted instead of a mode of manually setting the business process framework.

Optionally, when the business process framework is set in a machine learning manner, the determining the business process framework corresponding to the target business includes:

acquiring a service system corresponding to the target service;

determining explanation themes in the business system and business items corresponding to each explanation theme;

and constructing a business relation tree according to the explanation theme and the business items, and taking the business relation tree as a business process framework.

The target business has a corresponding business system, and the business system includes a plurality of explanation themes, for example, the explanation theme may be a theme such as product information, a business team, an operation mode, and payment management. The explanation theme is a general name of a plurality of service items, one explanation theme may correspond to 1 service item, or may correspond to a plurality of service items, for example, for an "operation mode" of the explanation theme, the corresponding service item may include two items of "online operation" and "offline operation". And taking the explanation theme as a father node of the tree, and taking the service item as a child node of the tree to construct a service relation tree. The business process framework may be in the form of a relational tree.

Optionally, after determining the business process framework corresponding to the target business, the method further includes:

determining a parent-child relationship among a plurality of business items in the business process framework;

setting adjustment attributes and constraint conditions among a plurality of service items;

and determining the self-adaptive adjustment relation between the service items according to the adjustment attribute and the constraint condition.

The parent-child relationship may include a one-to-one parent-child relationship, or may include a one-to-many parent-child relationship. The constraint may be: and setting the child object to perform corresponding position movement and/or size adjustment along with the position movement and/or size adjustment of the parent object. The adjustment attribute may be: setting a maximum, minimum width and/or height; and/or setting adjustable features, including width adjustable, height adjustable, or proportional adjustable.

It can be understood that after the business process framework is constructed in a machine learning manner, the business process framework can be displayed on a front-end page for system personnel to confirm whether the business process framework needs to be adjusted, and when the business process framework needs to be adjusted, the system personnel can adjust the framework.

Optionally, the analyzing the target explanation text to obtain a service flow text includes:

determining candidate subject terms;

The candidate subject term can be a subject term which is manually set and has a mapping relation with the node identification; or automatically generating the subject term through a subject expansion algorithm on the basis of the manually set candidate subject term; or the subject term can be automatically extracted from the corpus by a subject discovery algorithm. For a target explanation text in a Chinese form, words contained in the target explanation text can be obtained through a Chinese word segmentation technology. Chinese word segmentation is a process of recombining continuous word sequences into word sequences according to a certain specification. Since the chinese word segmentation technique belongs to the mature prior art, it is not described herein again.

The method comprises the steps of calling a deep learning model to learn an initial explanation text of a target character, and sorting important sentences in the initial explanation text to obtain the target explanation text, so that the explanation style of business contents is uniform; and the target explanation text does not need to be edited manually, so that the manual editing cost can be saved, and the service explanation efficiency is improved.

S15, generating business process voice corresponding to the business process text, constructing a virtual character according to a preset mathematical model, and acquiring the facial features of the virtual character and the audio features output by the virtual character and corresponding to the business process voice.

In at least one embodiment of the present application, a service flow voice corresponding to the service flow text is generated according to a preset voice requirement, where the preset voice requirement is a requirement preset by a system worker, for example, the preset voice requirement may include a timbre requirement and a language requirement, where the timbre requirement includes a male voice and a female voice, and the language requirement includes chinese, english, and the like, which is not limited herein. The method and the device have the advantages that through generation of the unified business process voice corresponding to the business process text, the problems that voice explanation is not clear, content explanation is not standard and the like caused by low explanation quality of the agent can be avoided, and the level of business explanation is improved.

Optionally, the generating the business process voice corresponding to the business process text includes:

acquiring a preset mapping table of a text and a voice, wherein the mapping table stores the corresponding relation between characters or character strings and pronunciation phonemes;

identifying characters or character strings corresponding to the business process texts;

The sounds such as tone, and strength in the converted voice data depend on the pronunciation phonemes stored in the text/voice mapping table. The same text data can be converted into voice data of different persons' voices through different text/voice mapping tables respectively.

Optionally, the constructing the virtual character according to the preset mathematical model includes:

acquiring a plurality of element characteristics, wherein the element characteristics comprise language elements, behavior elements, image elements and scene elements of a human body;

establishing an element database according to a plurality of element characteristics;

and selecting target element characteristics from the element database, combining the target element characteristics, establishing a virtual character model, calling virtual VR equipment, and restoring the virtual character model into a virtual character.

The method comprises the following steps of acquiring a plurality of essential features of a human body through a pre-stored video clip; a plurality of key features of the human body are collected within a preset time period through a collecting device. Acquiring a plurality of element characteristics, wherein the element characteristics comprise language, behavior, image and scene of a human body, and the acquiring comprises acquiring average speech speed, average speech and habitual speech of the human body during speaking in a preset time period; collecting facial expressions of a human body, wherein the facial expressions comprise expressions of happiness, hurry, anger, fear, disgust and surprise; collecting commonly used actions of a human body, wherein the commonly used actions comprise frowning, supporting the forehead, biting the lips, shaking the legs, touching the nose and wearing glasses; the collection of the language elements, the behavior elements and the image elements of the human body is carried out by a microphone, a camera device, a scanner and a sensor.

The method and the device enable the virtual character image to be endowed with specific personality, language, habitual action, corresponding scene and the like. And after the big data is processed by an AI intelligent technology, the big data is stored on the device and displayed by VR equipment.

Optionally, the obtaining the facial feature of the virtual character and the audio feature output by the virtual character and corresponding to the business process speech includes:

determining a phoneme sequence corresponding to the voice, wherein the phoneme sequence comprises phonemes corresponding to all time points;

determining lip key point information and eye key point information corresponding to each phoneme in the phoneme sequence;

The method and the device can effectively avoid the problem that the voice output state and the face state of the virtual character are inconsistent in display, improve the accuracy of voice synthesis, and further improve the user experience.

And S16, when receiving the voice command, analyzing the voice command to obtain the service process node information.

Optionally, the analyzing the voice instruction to obtain the service flow node information includes:

acquiring a voice instruction input by a participant, and performing semantic recognition on the voice instruction to obtain a conversation intention of the participant;

and inquiring a service flow framework according to the session intention to obtain service flow node information.

The voice instruction may include an instruction to customize the start section, the session intention refers to an intention to include the customized start section, and the session intention may include indication information of a specific service flow node in the service flow framework, for example, the session intention may include a name or an identifier of the specific service flow node. And inquiring a business process framework according to the conversation intention, so as to obtain the related information of the specific business process node.

And S17, obtaining the target business process voice matched with the business process node information according to the business process node information.

In at least one embodiment of the present application, the service flow framework includes a plurality of service flow nodes, and the service flow node information refers to information of a certain node in the service flow framework. Traversing the business process frame according to the business process node information, determining the position of the business process node information in the business process frame, and acquiring the business process voice at the position as the target business process voice.

Optionally, traversing the business process framework according to the business process node information to obtain a target business process voice matched with the business process node information includes:

traversing the service flow framework according to the service flow node information to obtain a target position of the service flow node in the service flow framework;

and acquiring the business process voice at the target position as the target business process voice matched with the business process node information.

In at least one embodiment of the present application, the controlling the virtual character to output the target business process speech includes:

acquiring a phoneme sequence corresponding to the target business process voice;

determining a lip shape image sequence corresponding to the virtual character according to the phoneme sequence;

and calling virtual VR equipment to control the virtual character to output the target business process voice.

According to the voice interaction method provided by the embodiment of the application, the initial explanation text and the evaluation text are automatically analyzed based on a pre-trained target character judgment model to judge whether an agent is a target character, and then important sentences in the initial explanation text corresponding to the target character are sorted in a characteristic extraction mode to obtain a target explanation text, so that the explanation styles of service contents are unified; the target explanation text does not need to be edited manually, so that the manual editing cost can be saved, and the service explanation efficiency is improved; in the method, a plurality of clustering clusters are obtained by clustering and analyzing the initial explanation text sets of a plurality of target characters, and useful features in each clustering cluster are extracted, so that the useful features can be extracted, and the comprehensiveness of the target explanation text is ensured; in addition, the method and the device construct the virtual character, and match the facial features of the virtual character with the audio features of the business process voice. When a voice instruction triggered by a participant is received, the voice instruction is analyzed to obtain service process node information, the virtual character is controlled to output the target service process voice corresponding to the service process node information, service explanation of the virtual character is achieved, and service explanation efficiency can be improved. The application can be applied to each function module in wisdom cities such as wisdom government affairs, wisdom traffic, for example wisdom government affairs based on virtual character's voice interaction module etc. can promote wisdom city's rapid development.

Fig. 2 is a structural diagram of a voice interaction apparatus according to a second embodiment of the present application.

In some embodiments, the voice interaction device 20 may include a plurality of functional modules composed of computer program segments. The computer program of each program segment in the voice interaction apparatus 20 can be stored in a memory of a computer device and executed by at least one processor to perform the functions of voice interaction (described in detail in fig. 1).

In this embodiment, the voice interaction apparatus 20 may be divided into a plurality of functional modules according to the functions performed by the apparatus. The functional module may include: a goal determination module 201, a cluster analysis module 202, a feature extraction module 203, a text parsing module 204, a speech generation module 205, an instruction parsing module 206, and a speech determination module 207. A module as referred to herein is a series of computer program segments capable of being executed by at least one processor and capable of performing a fixed function and is stored in a memory. In the present embodiment, the functions of the modules will be described in detail in the following embodiments.

The target judgment module 201 is configured to obtain an initial explanation text and an evaluation text of the initial explanation text by a participant, and perform automatic analysis on the initial explanation text and the evaluation text based on a pre-trained target character judgment model to judge whether a character is a target character.

The cluster analysis module 202 is configured to obtain an initial explanation text set of the target person and perform preprocessing to obtain a plurality of cluster clusters, where each cluster includes an initial explanation text that meets a threshold condition.

In at least one embodiment of the present application, the number of the determined target persons may be 1 or more, and when the number of the determined target persons is more than one, the number of the initial interpretation texts is also more than one, and the initial interpretation text set can be obtained by combining the plurality of initial interpretation texts. The preprocessing the initial explained text set of the target character may include: and deleting irrelevant information in each initial explained text in the initial explained text set, wherein the irrelevant information comprises stop words (such as words of 'and', 'also', and the like), repeated words, punctuation marks and the like. By deleting the irrelevant information in the initial explanation text, the interference of the irrelevant information can be reduced, and the accuracy of cluster analysis is improved.

Optionally, the obtaining and preprocessing the initial explanation text set of the target person to obtain a plurality of clustering clusters includes:

acquiring an explanation theme corresponding to each initial explanation text in the initial explanation text set;

calculating text similarity between the explaining subjects;

The feature extraction module 203 is configured to extract target features from the initial explained texts in each cluster to obtain a first explained text set, and combine each first explained text in the first explained text set according to a preset text sequence to obtain a target explained text.