Intelligent outbound voice splicing method, device, equipment, medium and program product

文档序号：1923523 发布日期：2021-12-03 浏览：16次中文

阅读说明：本技术 智能外呼语音拼接方法、装置、设备、介质和程序产品 (Intelligent outbound voice splicing method, device, equipment, medium and program product ) 是由牛伯宇陈永录刘浩韩萌于 2021-07-29 设计创作，主要内容包括：本公开提供了一种智能外呼语音拼接方法,可以应用于人工智能技术领域。该智能外呼语音拼接方法包括：获取智能外呼过程的通话日志、多个客户语音片段及与多个客户语音片段分别对应交互的多个预设话术文本；将多个预设话术文本分别转换为交互语音片段；基于通话日志记载的交互时序,将多个客户语音片段和各交互语音片段拼接,得到智能外呼过程的完整对话语音。本公开还提供了一种智能外呼语音拼接装置、设备、存储介质和程序产品。本公开通过将智能外呼过程中的预设话术文本转换为交互语音片段,且将交互语音片段和客户语音片段拼接起来形成智能外呼过程的完整对话语音,有利于提高对智能外呼过程的质检分析效率以及分析的准确度。(The present disclosure provides an intelligent outbound voice splicing method, which can be applied to the technical field of artificial intelligence. The intelligent outbound voice splicing method comprises the following steps: the method comprises the steps of obtaining a call log of an intelligent outbound process, a plurality of client voice segments and a plurality of preset call texts which are respectively and correspondingly interacted with the client voice segments; respectively converting a plurality of preset dialect texts into interactive voice fragments; and splicing the plurality of client voice segments and each interactive voice segment based on the interactive time sequence recorded by the call log to obtain the complete conversation voice of the intelligent outbound process. The present disclosure also provides an intelligent outbound voice splicing apparatus, device, storage medium and program product. According to the method and the device, the preset speech text in the intelligent outbound process is converted into the interactive speech fragment, and the interactive speech fragment and the client speech fragment are spliced to form the complete dialogue speech in the intelligent outbound process, so that the quality inspection analysis efficiency and the analysis accuracy in the intelligent outbound process are improved.)

1. An intelligent outbound voice splicing method comprises the following steps:

the method comprises the steps of obtaining a call log of an intelligent outbound process, a plurality of client voice segments and a plurality of preset call texts which are respectively and correspondingly interacted with the client voice segments;

respectively converting the preset dialect texts into interactive voice fragments;

and splicing the plurality of client voice segments and each interactive voice segment based on the interaction time sequence recorded by the call log to obtain the complete conversation voice of the intelligent outbound process.

2. The intelligent outbound voice splicing method according to claim 1, wherein the call log includes a first playing sequence of the preset call text and a starting and ending time of each of the client voice segments, and the splicing the plurality of client voice segments and each of the interactive voice segments based on the interaction timing recorded in the call log to obtain the complete dialogue voice of the intelligent outbound process includes:

obtaining a second playing sequence of the client voice clips according to the starting and stopping time;

sorting the plurality of customer voice clips based on the second playing order;

and according to the first playing sequence, inserting one interactive voice segment between every two adjacent client voice segments after sequencing respectively to form the complete conversation voice.

3. The intelligent outbound voice splicing method of claim 2, the method further comprising:

after the interactive voice segment is inserted, a mute segment is inserted between the interactive voice segment and the client voice segment adjacent to the interactive voice segment.

4. The intelligent outbound voice splicing method of claim 3, the method further comprising:

calculating the time interval between every two adjacent customer voice fragments in the plurality of customer voice fragments according to the starting and stopping time;

and after the interactive voice segment is inserted into the time interval between two adjacent client voice segments, the mute segment is inserted between the interactive voice segment and the client voice segment adjacent to the interactive voice segment.

5. The intelligent outbound voice splicing method according to claim 1, wherein the call log further includes a play record of each preset call text, and the splicing the plurality of client voice segments and each interactive voice segment based on the interaction timing recorded in the call log to obtain the complete dialogue voice of the intelligent outbound process further includes:

judging whether each preset dialect text is completely played in the intelligent outbound process based on the playing record;

when the preset voice text is not completely played, according to the playing progress of the preset voice text in the playing record, editing the interactive voice segment corresponding to the preset voice text to obtain a playing voice segment;

when the plurality of client voice clips are spliced with the interactive voice clips, replacing the interactive voice clip corresponding to the incompletely played preset verbal text with a played voice clip, and splicing the played voice clip with the client voice clip adjacent to the played voice clip.

6. The intelligent outbound speech splicing method of claim 1 wherein converting the plurality of preset conversational texts into interactive speech segments respectively employs TTS speech synthesis.

7. The intelligent outbound voice splicing method of claim 1, the method further comprising:

after the complete conversational speech is formed, the complete conversational speech is quality checked and analyzed.

8. An intelligent outbound voice splicing device, comprising:

the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a call log of an intelligent outbound process, a plurality of client voice fragments and a plurality of preset call texts which are respectively and correspondingly interacted with the client voice fragments;

the conversion module is used for respectively converting the preset conversational texts into interactive voice fragments; and

and the voice splicing module is used for splicing the plurality of client voice segments and each interactive voice segment based on the interaction time sequence recorded by the call log to obtain the complete conversation voice of the intelligent outbound process.

9. The intelligent outbound voice splicing apparatus according to claim 8, wherein the call log comprises a first playing order of the preset dialog text and a start-stop time of each of the client voice segments, and the voice splicing module comprises:

a second playing sequence unit, configured to obtain a second playing sequence of the client voice clip according to the start-stop time;

a sorting unit configured to sort the plurality of client voice clips based on the second playing order;

and the first inserting unit is used for respectively inserting one interactive voice segment between every two adjacent client voice segments after sequencing according to the first playing sequence to form the complete conversation voice.

10. The intelligent outbound voice splicing apparatus of claim 9, the voice splicing module further comprising:

and the second inserting unit is used for inserting a mute segment between the interactive voice segment and the client voice segment adjacent to the interactive voice segment after the interactive voice segment is inserted.

11. The intelligent outbound voice splicing apparatus of claim 10 further comprising:

the calculation module is used for calculating the time interval between every two adjacent customer voice fragments in the plurality of customer voice fragments according to the starting and stopping time;

and the third inserting module is used for inserting the mute segments between the interactive voice segment and the client voice segment adjacent to the interactive voice segment after the interactive voice segment is inserted into the time interval between the two client voice segments adjacent to the interactive voice segment.

12. The intelligent outbound voice splicing apparatus according to claim 8, wherein the call log further includes a play record of each of the preset dialect texts, and the voice splicing module further includes:

the judging unit is used for judging whether each preset dialect text is completely played in the intelligent outbound process based on the playing record;

the playing voice unit is used for clipping the interactive voice segment corresponding to the preset voice text according to the playing progress of the preset voice text in the playing record when the preset voice text is not completely played, so as to obtain a playing voice segment; when the plurality of client voice clips are spliced with the interactive voice clips, replacing the interactive voice clip corresponding to the incompletely played preset verbal text with a played voice clip, and splicing the played voice clip with the client voice clip adjacent to the played voice clip.

13. The intelligent callout voice splicing apparatus according to claim 8, wherein the conversion module comprises a TTS synthesis unit for converting the preset dialect texts into interactive voice segments, respectively.

14. The intelligent outbound voice splicing apparatus of claim 8 further comprising:

and the analysis module is used for performing quality inspection and analysis on the complete dialogue voice after the complete dialogue voice is formed.

15. An electronic device, comprising:

one or more processors;

a storage device for storing one or more programs,

wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method of any of claims 1-7.

16. A computer readable storage medium having stored thereon executable instructions which, when executed by a processor, cause the processor to perform the method of any one of claims 1 to 7.

17. A computer program product comprising a computer program which, when executed by a processor, implements a method according to any one of claims 1 to 7.

Technical Field

The present disclosure relates to the field of artificial intelligence, in particular to the field of intelligent outbound, and more particularly to a method, an apparatus, a device, a medium, and a program product for intelligent outbound voice splicing.

Background

With the rapid expansion and the increased competitive degree of the domestic credit market, a large number of financial credit institutions such as credit cards and petty loans appear, and the problem of poor credit and the like appears in the domestic credit market. For the problems of poor credit and the like, the banking financial institution strengthens the treatment and verification of poor credit assets, including the collection of poor assets such as overdue debt and the like, and generally collects overdue debt and the like by calling a corresponding telephone to a client. At present, an intelligent outbound platform is often used for calling a customer, man-machine interaction is carried out between the intelligent outbound platform and the customer, and then quality inspection analysis is carried out on corresponding call voice to obtain the requirement of the customer so as to improve the service processing efficiency. However, in the current intelligent outbound process, the call voice stored by the intelligent outbound platform is only a segment type audio, and only includes a client voice segment, and a banking financial institution cannot acquire complete conversation voice to perform quality inspection analysis, so that the problems of low quality inspection analysis efficiency and low accuracy exist.

Disclosure of Invention

In view of the foregoing, the present disclosure provides intelligent outbound voice splicing methods, apparatus, devices, media and program products that form complete conversational voice.

According to a first aspect of the present disclosure, an intelligent outbound voice splicing method is provided, including: the method comprises the steps of obtaining a call log of an intelligent outbound process, a plurality of client voice segments and a plurality of preset call texts which are respectively and correspondingly interacted with the client voice segments; respectively converting the preset dialect texts into interactive voice fragments; and splicing the plurality of client voice segments and each interactive voice segment based on the interaction time sequence recorded by the call log to obtain the complete conversation voice of the intelligent outbound process.

According to an embodiment of the present disclosure, the call log includes a first playing sequence of the preset call text and a start-stop time of each of the client voice segments, and the obtaining of the complete dialogue voice in the intelligent outbound process by splicing the plurality of client voice segments and each of the interactive voice segments based on the interaction time sequence recorded in the call log includes:

obtaining a second playing sequence of the client voice clips according to the starting and stopping time;

sorting the plurality of customer voice clips based on the second playing order;

According to an embodiment of the present disclosure, the method further comprises: after inserting the interactive voice segment, a mute segment is inserted between the interactive voice segment and a client voice segment adjacent thereto.

According to an embodiment of the present disclosure, the method further comprises: calculating the time interval between every two adjacent customer voice fragments in the plurality of customer voice fragments according to the starting and stopping time; and after the interactive voice segment is inserted into the time interval between two adjacent client voice segments, the mute segment is inserted between the interactive voice segment and the client voice segment adjacent to the interactive voice segment.

According to an embodiment of the present disclosure, the call log further includes a play record of each preset call text, and the obtaining of the complete dialogue voice in the intelligent outbound process by splicing the plurality of client voice segments and each interactive voice segment based on the interaction time sequence recorded in the call log further includes: judging whether each preset dialect text is completely played in the intelligent outbound process based on the playing record; when the preset voice text is not completely played, according to the playing progress of the preset voice text in the playing record, editing the interactive voice segment corresponding to the preset voice text to obtain a playing voice segment; when the plurality of client voice clips are spliced with the interactive voice clips, replacing the interactive voice clip corresponding to the incompletely played preset verbal text with a played voice clip, and splicing the played voice clip with the client voice clip adjacent to the played voice clip.

According to the embodiment of the present disclosure, the converting the plurality of preset conversational texts into the interactive voice segments respectively adopts TTS voice synthesis.

According to an embodiment of the present disclosure, the method further comprises: after the complete conversational speech is formed, the complete conversational speech is quality checked and analyzed.

A second aspect of the present disclosure provides an intelligent outbound voice splicing apparatus, including: the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a call log of an intelligent outbound process, a plurality of client voice fragments and a plurality of preset call texts which are respectively and correspondingly interacted with the client voice fragments; the conversion module is used for respectively converting the preset conversational texts into interactive voice fragments; and the voice splicing module is used for splicing the plurality of client voice segments and each interactive voice segment based on the interaction time sequence recorded by the call log to obtain the complete conversation voice of the intelligent outbound process.

According to an embodiment of the present disclosure, the call log includes a first playing sequence of the preset dialog text and a start-stop time of each of the client voice segments, and the voice splicing module includes: a second playing sequence unit, configured to obtain a second playing sequence of the client voice clip according to the start-stop time; a sorting unit configured to sort the plurality of client voice clips based on the second playing order; and the first inserting unit is used for respectively inserting one interactive voice segment between every two adjacent client voice segments after sequencing according to the first playing sequence to form the complete conversation voice.

According to the embodiment of the present disclosure, the voice concatenation module further includes: and the second inserting unit is used for inserting a mute segment between the interactive voice segment and the client voice segment adjacent to the interactive voice segment after the interactive voice segment is inserted.

According to an embodiment of the present disclosure, the apparatus further comprises: the calculation module is used for calculating the time interval between every two adjacent customer voice fragments in the plurality of customer voice fragments according to the starting and stopping time; and the third inserting module is used for inserting the mute segments between the interactive voice segment and the client voice segment adjacent to the interactive voice segment after the interactive voice segment is inserted into the time interval between the two client voice segments adjacent to the interactive voice segment.

According to the embodiment of the present disclosure, the call log further includes a play record of each of the preset dialogistic texts, and the voice concatenation module further includes: the judging unit is used for judging whether each preset dialect text is completely played in the intelligent outbound process based on the playing record; the playing voice unit is used for clipping the interactive voice segment corresponding to the preset voice text according to the playing progress of the preset voice text in the playing record when the preset voice text is not completely played, so as to obtain a playing voice segment; when the plurality of client voice clips are spliced with the interactive voice clips, replacing the interactive voice clip corresponding to the incompletely played preset verbal text with a played voice clip, and splicing the played voice clip with the client voice clip adjacent to the played voice clip.

According to an embodiment of the present disclosure, the conversion module includes a TTS synthesis unit for converting the plurality of preset dialect texts into interactive voice segments, respectively.

According to an embodiment of the present disclosure, the apparatus further comprises: and the analysis module is used for performing quality inspection and analysis on the complete dialogue voice after the complete dialogue voice is formed.

A third aspect of the present disclosure provides an electronic device, comprising: one or more processors; a memory for storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the intelligent outbound voice splicing method described above.

A fourth aspect of the present disclosure also provides a computer-readable storage medium having stored thereon executable instructions that, when executed by a processor, cause the processor to perform the above-mentioned intelligent outbound voice splicing method.

A fifth aspect of the present disclosure also provides a computer program product comprising a computer program, which when executed by a processor, implements the intelligent outbound voice splicing method described above.

The at least one technical scheme adopted in the embodiment of the disclosure can achieve the following beneficial effects:

according to the method and the device, the preset speech text in the intelligent outbound process is converted into the interactive speech fragment, and the interactive speech fragment and the client speech fragment are spliced to form the complete dialogue speech in the intelligent outbound process, so that the quality inspection analysis efficiency and the analysis accuracy in the intelligent outbound process are improved.

Drawings

The foregoing and other objects, features and advantages of the disclosure will be apparent from the following description of embodiments of the disclosure, which proceeds with reference to the accompanying drawings, in which:

FIG. 1 schematically illustrates a system architecture diagram of an intelligent outbound voice splicing method according to an embodiment of the present disclosure;

FIG. 2 schematically illustrates a flow chart of an intelligent outbound voice splicing method according to an embodiment of the present disclosure;

FIG. 3 schematically shows a detailed flowchart of step S230 of the intelligent outbound voice splicing method according to an embodiment of the present disclosure;

FIG. 4 schematically illustrates a timeline for an intelligent outbound voice splicing method according to an embodiment of the present disclosure;

FIG. 5 schematically illustrates a flow chart of an intelligent outbound voice splicing method according to an embodiment of the present disclosure;

FIG. 6 schematically illustrates a flow chart of an intelligent outbound voice splicing method according to another embodiment of the present disclosure;

FIG. 7 schematically shows a block diagram of an intelligent outbound voice splicing apparatus according to an embodiment of the present disclosure;

FIG. 8 schematically illustrates a block diagram of the intelligent outbound voice splicing means voice splicing module 730 according to an embodiment of the present disclosure;

FIG. 9 schematically illustrates a block diagram of an intelligent outbound voice splicing apparatus according to an embodiment of the present disclosure; and

fig. 10 schematically illustrates a block diagram of an electronic device adapted to implement an intelligent outbound voice splicing method in accordance with an embodiment of the present disclosure.

Detailed Description

Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is illustrative only and is not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present disclosure.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.

All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It is noted that the terms used herein should be interpreted as having a meaning that is consistent with the context of this specification and should not be interpreted in an idealized or overly formal sense.

Where a convention analogous to "at least one of A, B and C, etc." is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., "a system having at least one of A, B and C" would include but not be limited to systems that have a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the personal information of the related user all accord with the regulations of related laws and regulations, necessary security measures are taken, and the customs of the public order is not violated.

The embodiment of the disclosure provides an intelligent outbound voice splicing method, which relates to the field of artificial intelligence, in particular to the field of intelligent outbound, and comprises the following steps: the method comprises the steps of obtaining a call log of an intelligent outbound process, a plurality of client voice segments and a plurality of preset call texts which are respectively and correspondingly interacted with the client voice segments; respectively converting a plurality of preset dialect texts into interactive voice fragments; and splicing the plurality of client voice segments and each interactive voice segment based on the interactive time sequence recorded by the call log to obtain the complete conversation voice of the intelligent outbound process.

The embodiment of the disclosure is based on the intelligent outbound call initiated by the intelligent outbound call platform to be expanded, in the intelligent outbound call process, according to the configured preset dialect text, the voice conversion characters provide corresponding interactive voice for the client, and simultaneously record the called preset dialect text content in the call log, for example, the preset dialect text is converted into corresponding interactive voice segments through a TTS engine; the client replies corresponding client voice after receiving the interactive voice sent by the intelligent outbound platform; after receiving the voice of the client, the intelligent outbound platform records the start time and the end time of the speaking of the client in a call log, calls an ASR engine through an MRCP protocol to perform voice recognition on the voice of the client and converts the voice of the client into characters; after receiving characters returned by the ASR engine, the intelligent outbound platform calls a service flow through an HTTP protocol to request conversation state updating, and after receiving a conversation state updating request of the intelligent outbound platform, the service flow conversation management calls a semantic analysis module to carry out semantic analysis on the characters converted by the client voice; and according to the content of semantic analysis, the intelligent outbound platform continues to provide TTS voice corresponding to the configured preset dialect text for the client, and the operation is repeatedly performed, so that the intelligent outbound process is completed.

Fig. 1 schematically shows a system architecture diagram of an intelligent outbound voice splicing method according to an embodiment of the present disclosure.

As shown in fig. 1, the system architecture 100 according to this embodiment may include a terminal device 101, a network 102, and a server 103. Network 102 is the medium used to provide communication links between terminal devices 101 and server 103. Network 102 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

A client may use terminal device 101 to interact with server 103 over network 102. The server 103 initiates a call request to the client through the network 102. Various messaging client applications, such as shopping-like applications, web browser applications, search-like applications, instant messaging tools, mailbox clients, social platform software, etc. (by way of example only) may be installed on terminal device 101.

The terminal device 101 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.

The server 103 may be a server providing various services, such as an intelligent outbound platform server (for example only) that may store and edit an intelligent outbound procedure, provide ASR service and TTS voice service, and the like during the interaction of the client with the server 103 using the terminal device 101.

It should be noted that the intelligent outbound voice splicing method provided by the embodiment of the present disclosure may be generally executed by the server 103. Accordingly, the intelligent outbound voice splicing apparatus provided by the embodiment of the present disclosure may be generally disposed in the server 103. The intelligent outbound voice splicing method provided by the embodiment of the present disclosure may also be executed by a server or a server cluster that is different from the server 103 and can communicate with the terminal device 101 and/or the server 103. Correspondingly, the intelligent outbound voice splicing apparatus provided by the embodiment of the present disclosure may also be disposed in a server or a server cluster that is different from the server 103 and can communicate with the terminal device 101 and/or the server 103.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

The intelligent outbound voice splicing method of the disclosed embodiment will be described in detail through fig. 2 to 6 based on the system architecture described in fig. 1.

Fig. 2 schematically shows a flow chart of an intelligent outbound voice splicing method according to an embodiment of the present disclosure.

As shown in fig. 2, the intelligent outbound voice splicing method of this embodiment includes operations S210 to S230, and the intelligent outbound voice splicing method may be executed by the server 103.

In operation S210, a call log of the intelligent outbound process, a plurality of client voice segments, and a plurality of preset dialect texts correspondingly interacting with the plurality of client voice segments are obtained.

According to the embodiment of the disclosure, the preset dialogs text is a universal dialogs preset in the intelligent outbound platform and is used for interacting with the customer in the intelligent outbound process, and currently in the interaction process, for example, "do you good, here, the outbound platform of the industrial banking, do not know whether you are convenient to answer now? "according to the related service transacted by you, I prompt you that the monthly letter card is in short of payment, please go to the related service transacting platform as soon as possible," thank you for answering and congratulating your life ", and so on. The preset conversational text also comprises an initial conversational text and an end conversational text and an intermediate conversational text, wherein the initial conversational text and the end conversational text are respectively used as a start identifier and an end identifier of the intelligent calling-out process.

According to the embodiment of the disclosure, the obtained preset speech text is a complete text configured in advance, in the intelligent outbound process, the interactive voice segment converted from the preset speech text is played completely or incompletely, wherein the initial speech text and the ending speech text are determined to be played completely in the intelligent outbound process, and the initial speech text and the ending speech text are contents such as service introduction related to the client and cannot be interrupted by the client.

According to an embodiment of the present disclosure, the call log includes a first playing order of the preset dialog text and a start-stop time of each client voice segment.

According to the embodiment of the disclosure, the call log further includes a play record of each preset dialect text, and the play record is used for judging whether each preset dialect text is completely played in the intelligent outbound process.

In operation S220, a plurality of preset dialog texts are respectively converted into interactive voice segments.

According to the embodiment of the disclosure, the TTS speech synthesis technology is adopted for respectively converting the preset conversational texts into the interactive speech segments.

In operation S230, based on the interaction timing recorded in the call log, the multiple client voice segments and each interactive voice segment are spliced to obtain a complete conversation voice in the intelligent outbound process.

FIG. 3 schematically shows a detailed flowchart of step S230 of the intelligent outbound voice splicing method according to an embodiment of the present disclosure; operation S230 may further include operations S231 through S233 according to an embodiment of the present disclosure.

As shown in fig. 3, according to the embodiment of the present disclosure, in operation S231, a second playing order of all the customer voice segments is obtained according to the start-stop time of each customer voice segment. The start-stop time of each client voice segment is a specific moment, and the second playing sequence of all the client voice segments in the intelligent outbound process can be obtained according to the sequence of the moments.

In operation S232, the plurality of client voice clips are sorted based on the second play order.

Fig. 4 schematically shows a time axis of an intelligent outbound voice splicing method according to an embodiment of the present disclosure.

As shown in fig. 4, for example, client voice clips in the intelligent outgoing call process are arranged on a time axis in time order, 1 on the time axis represents an interactive voice clip, and 2 represents a client voice clip, where bos1 and bos2 are start times of two client voice clips on the time axis, respectively, and eos1 and eos2 are end times of two client voice clips on the time axis, respectively, and the client voice clips are sorted according to the arrangement order of bos1, bos2, eos1 and eos2 on the time axis.

In operation S233, an interactive voice segment is inserted between the sequenced client voice segments according to the first play order, respectively, to form a complete conversational voice. By inserting the interactive voice segments between the client voice segments with the determined time lines, complete call voice is formed, and quality control analysis is facilitated.

FIG. 5 schematically illustrates a flow chart of an intelligent outbound voice splicing method according to an embodiment of the present disclosure; according to an embodiment of the present disclosure, the intelligent outbound voice splicing method of this embodiment further includes operation S240.

As shown in fig. 5, after the interactive voice segment is inserted, a mute segment is inserted between the interactive voice segment and the client voice segment adjacent thereto in operation S240.

According to the embodiment of the disclosure, according to the start-stop time, calculating the time interval between each two adjacent customer voice segments in the plurality of customer voice segments; and respectively inserting interactive voice segments in the time intervals, and respectively inserting mute segments between the interactive voice segments and the client voice segments adjacent to the interactive voice segments.

Referring to fig. 4, 3 on the time axis is a mute section, by calculating a time interval between two client voice sections on the time axis, i.e., a time interval between bos2 and eos1, a corresponding interactive voice section 1 is inserted in the time interval, the duration of the interactive voice section 1 is an interactive voice section obtained by converting a preset text into a corresponding speech speed, wherein the speech speed is a preset playing speed, and then a mute section having a duration equal to that of the remaining time interval is inserted between the interactive voice section and an adjacent client voice section.

According to the embodiment of the disclosure, the lengths of the silence segments between the interactive voice segment and the client voice segment adjacent to the interactive voice segment are equal.

For example, when silent clip insertion is performed, silent clips 3 on both sides of an interactive voice clip 2 on the time axis are made equal in duration.

According to the embodiment of the disclosure, in the actual intelligent outbound process, the intelligent outbound platform waits for a period of time to identify whether the client finishes the period of client voice, and after identifying that the client finishes speaking, the intelligent outbound platform continues to select the preset speech text to convert the preset speech text into the corresponding interactive speech segment for playing. In order to restore the actual intelligent outbound process, a mute segment is inserted between the interactive voice segment and the client voice segment adjacent to the interactive voice segment to represent the waiting time, so that a quality inspection analysis platform or personnel can distinguish the client voice segment and the interactive voice segment in the complete dialogue voice conveniently, the quality inspection analysis efficiency is improved, and a quality inspection analysis result with higher accuracy is obtained from the complete dialogue voice.

FIG. 6 schematically illustrates a flow chart of an intelligent outbound voice splicing method according to another embodiment of the present disclosure; according to the embodiment of the present disclosure, the intelligent outbound voice splicing method of the embodiment includes operation S601 to operation S606.

As shown in fig. 6, in operation S601, a call log of an intelligent outbound call process, a plurality of client voice segments, and a plurality of preset speech texts correspondingly interacting with the plurality of client voice segments are obtained.

According to the embodiment of the disclosure, the call log further includes a play record of each preset dialect text, the play record is used for judging whether each preset dialect text is completely played in the intelligent outbound process, and the play progress of each preset dialect text is recorded in the play record.

In operation S602, a plurality of preset dialog texts are respectively converted into interactive voice segments.

In operation S603, it is determined whether the preset dialect text is completely played in the intelligent outbound process based on the play record.

In operation S604, if a preset speech text is not completely played to the client in the intelligent outbound process, that is, interrupted by the client in the actual intelligent outbound process, the interactive voice segment corresponding to the preset speech text is edited according to the playing progress of the preset speech text in the playing record, so as to obtain the interactive voice segment that has been played to the client in the actual intelligent outbound process, that is, the played voice segment.

For example, the complete playing time of the preset speech text is compared with the actual playing time of the preset speech text in the playing record, where the complete playing time of the preset speech text is the playing time converted according to the preset speech speed, for example, the playing voice of the intelligent outbound platform to the client according to the preset speech text should be "today weather is fine, 36 ℃ and" the intelligent outbound platform only plays the content of "today weather" due to the interruption of the client in the actual outbound process, that is, the situation that the preset speech text is not completely played.

In operation S605, the interactive voice segments that are not completely played are replaced with played voice segments. Because only the preset speech text and the client voice fragment are acquired in the intelligent outbound process at present, the interactive information in the actual intelligent outbound process cannot be accurately analyzed, the preset speech text is judged to be incompletely played in the intelligent outbound process, and the interactive voice fragment which is incompletely played in the splicing process is replaced by the played voice fragment, so that the conversation content in the actual intelligent outbound process is restored, the quality inspection analysis accuracy is improved, and the demand analysis and the service progress tracking of the client are facilitated.

In operation S606, based on the interaction timing recorded in the call log, the multiple client voice segments are spliced with each interactive voice segment or the played voice segment to obtain a complete conversation voice in the intelligent outbound process.

And if all the preset speech texts are completely played to the client in the intelligent outbound process, directly splicing a plurality of client speech segments with the interactive speech segments corresponding to all the preset speech texts to obtain the complete conversational speech in the intelligent outbound process.

According to an embodiment of the present disclosure, after a complete dialogue speech is formed, the complete dialogue speech is subjected to quality inspection and analysis. The quality inspection and analysis are carried out on the complete conversation voice, the actual requirements of the customers are obtained, and the service progress is effectively tracked, so that the service processing efficiency is improved. By analyzing the tone and the semantics of the client in the complete conversation voice, whether the business execution in the intelligent outbound process is effective and whether the subsequent business follow-up is needed to be carried out on the client is judged, so that the quality inspection cost is reduced, and the resource waste caused by repeatedly calling the same client is avoided.

Based on the intelligent outbound voice splicing method, the disclosure also provides an intelligent outbound voice splicing device. The apparatus 700 will be described in detail below with reference to fig. 7.

Fig. 7 schematically shows a block diagram of an intelligent outbound voice splicing apparatus according to an embodiment of the present disclosure.

As shown in fig. 7, the intelligent outbound voice splicing apparatus 700 of this embodiment includes an obtaining module 710, a converting module 720 and a voice splicing module 730.

The obtaining module 710 is configured to obtain a call log of the intelligent outbound call process, a plurality of client voice segments, and a plurality of preset speech texts that are respectively and correspondingly interacted with the plurality of client voice segments. In one embodiment, the obtaining module 710 may be configured to perform the operation S210 described above, and store the client speech segment in the device 700 before the intelligent outbound platform converts the client speech segment into text through the ASR engine; storing each preset dialect text in the device 700 before text-to-speech by the intelligent outbound platform; and extracting a call log of the intelligent outbound process.

The conversion module 720 is configured to convert the preset dialog texts into interactive voice segments respectively. In an embodiment, the converting module 720 may be configured to perform the operation S220 described above, which is not described herein again.

The voice splicing module 730 is used for splicing the plurality of client voice segments and each interactive voice segment based on the interaction time sequence recorded by the call log to obtain the complete conversation voice of the intelligent outbound process. In an embodiment, the voice splicing module 730 can be configured to perform the operation S230 described above, which is not described herein again.

Fig. 8 schematically shows a block diagram of the voice concatenation module 730 of the intelligent outbound voice concatenation apparatus according to an embodiment of the present disclosure.

As shown in fig. 8, according to an embodiment of the present disclosure, the voice concatenation module 730 includes: a second playing sequence unit 731, configured to obtain a second playing sequence of the client voice segments according to the start-stop time; a sorting unit 732 for sorting the plurality of client voice clips based on the second playing order; the first inserting unit 733, configured to insert an interactive voice segment between each sequenced client voice segment according to a first playing order, to form a complete conversational voice.

Fig. 9 schematically shows a block diagram of an intelligent outbound voice splicing apparatus according to an embodiment of the present disclosure.

As shown in fig. 9, according to an embodiment of the present disclosure, the voice concatenation module 730 further includes: a second inserting unit 734, configured to insert a mute section between the interactive voice section and the client voice section adjacent to the interactive voice section after inserting the interactive voice section.

According to an embodiment of the present disclosure, the apparatus 700 further comprises: a calculating module 740, configured to calculate a time interval between each two adjacent client voice segments in the plurality of client voice segments according to the start-stop time; the third inserting module 750 is configured to insert the client voice segments in time intervals, and then insert the mute segments between the interactive voice segments and the interactive voice segments adjacent to the interactive voice segments.

According to the embodiment of the present disclosure, the call log further includes a play record of each preset call text, and the voice concatenation module 730 further includes: a judging unit 735, configured to judge whether each preset dialect text is completely played in the intelligent outbound process based on the play record; the playing voice unit 736 is configured to, when the preset dialect text is not completely played, clip an interactive voice segment corresponding to the preset dialect text according to the playing progress of the preset dialect text in the playing record to obtain a playing voice segment; when a plurality of client voice clips are spliced with each interactive voice clip, the interactive voice clip corresponding to the incompletely played preset dialect text is replaced by a played voice clip, and then the played voice clip is spliced with the interactive voice clip adjacent to the played voice clip.

According to an embodiment of the present disclosure, the conversion module 720 includes a TTS synthesis unit 760 for converting a plurality of preset dialect texts into interactive voice segments, respectively.

According to an embodiment of the present disclosure, the apparatus 700 further comprises: and the analysis module 770 is used for performing quality inspection and analysis on the complete dialogue voice after the complete dialogue voice is formed.

The utility model provides an intelligence is exhaled pronunciation splicing apparatus outward, through the function execution of obtaining module 710, conversion module 720, pronunciation splicing module 730, the preset speech skill text that will intelligence be exhaled the in-process outward converts interactive voice segment into, and splices interactive voice segment and customer's voice segment and forms the complete dialogue pronunciation of intelligence process of exhaling outward, is favorable to improving the quality control analysis efficiency and the degree of accuracy of analysis to intelligence process of exhaling outward.

According to the embodiment of the present disclosure, any plurality of the obtaining module 710, the converting module 720, the voice splicing module 730, the calculating module 740, the third inserting module 750, and the analyzing module 770 may be combined and implemented in one module, or any one of them may be split into a plurality of modules. Alternatively, at least part of the functionality of one or more of these modules may be combined with at least part of the functionality of the other modules and implemented in one module. According to an embodiment of the present disclosure, at least one of the obtaining module 710, the converting module 720, the voice splicing module 730, the calculating module 740, the third inserting module 750, and the analyzing module 770 may be at least partially implemented as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented by hardware or firmware in any other reasonable manner of integrating or packaging a circuit, or implemented by any one of three implementations of software, hardware, and firmware, or implemented by a suitable combination of any several of them. Alternatively, at least one of the obtaining module 710, the converting module 720 and the speech splicing module 730, the calculating module 740, the third inserting module 750, the TTS synthesizing module 760, the analyzing module 770 may be at least partially implemented as a computer program module which, when executed, may perform the corresponding functions.

Fig. 10 schematically illustrates a block diagram of an electronic device adapted to implement an intelligent outbound voice splicing method in accordance with an embodiment of the present disclosure.

As shown in fig. 10, an electronic device 1000 according to an embodiment of the present disclosure includes a processor 1001 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)1002 or a program loaded from a storage section 1008 into a Random Access Memory (RAM) 1003. Processor 1001 may include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or associated chipset, and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), among others. The processor 1001 may also include onboard memory for caching purposes. The processor 1001 may include a single processing unit or multiple processing units for performing different actions of a method flow according to embodiments of the present disclosure.

In the RAM1003, various programs and data necessary for the operation of the electronic apparatus 900 are stored. The processor 1001, ROM1002, and RAM1003 are connected to each other by a bus 1004. The processor 1001 performs various operations of the method flow according to the embodiments of the present disclosure by executing programs in the ROM1002 and/or the RAM 1003. Note that the programs may also be stored in one or more memories other than the ROM1002 and the RAM 1003. The processor 1001 may also perform various operations of the method flows according to embodiments of the present disclosure by executing programs stored in the one or more memories.

Electronic device 1000 may also include an input/output (I/O) interface 1005, the input/output (I/O) interface 1005 also being connected to bus 1004, according to an embodiment of the present disclosure. Electronic device 1000 may also include one or more of the following components connected to I/O interface 1005: an input section 1006 including a keyboard, a mouse, and the like; an output section 1007 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 1008 including a hard disk and the like; and a communication section 1009 including a network interface card such as a LAN card, a modem, or the like. The communication section 909 performs communication processing via a network such as the internet. The driver 1010 is also connected to the I/O interface 1005 as necessary. A removable medium 1011 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 1010 as necessary, so that a computer program read out therefrom is mounted into the storage section 1008 as necessary.

The present disclosure also provides a computer-readable storage medium, which may be contained in the apparatus/device/system described in the above embodiments; or may exist separately and not be assembled into the device/apparatus/system. The computer readable storage medium carries one or more programs which, when executed, implement the intelligent outbound voice splicing method according to embodiments of the present disclosure.

According to embodiments of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. For example, according to embodiments of the present disclosure, a computer-readable storage medium may include the ROM1002 and/or the RAM1003 described above and/or one or more memories other than the ROM1002 and the RAM 1003.

Embodiments of the present disclosure also include a computer program product comprising a computer program containing program code for performing the method illustrated in the flow chart. When the computer program product runs in a computer system, the program code is used for causing the computer system to realize the intelligent outbound voice splicing method provided by the embodiment of the disclosure.

The computer program performs the above-described functions defined in the system/apparatus of the embodiments of the present disclosure when executed by the processor 1001. The systems, apparatuses, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the present disclosure.

In one embodiment, the computer program may be hosted on a tangible storage medium such as an optical storage device, a magnetic storage device, or the like. In another embodiment, the computer program may also be transmitted in the form of a signal on a network medium, distributed, downloaded and installed via the communication part 1009, and/or installed from the removable medium 1011. The computer program containing program code may be transmitted using any suitable network medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

In such an embodiment, the computer program may be downloaded and installed from a network through the communication part 1009 and/or installed from the removable medium 1011. The computer program performs the above-described functions defined in the system of the embodiment of the present disclosure when executed by the processor 1001. The systems, devices, apparatuses, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the present disclosure.

In accordance with embodiments of the present disclosure, program code for executing computer programs provided by embodiments of the present disclosure may be written in any combination of one or more programming languages, and in particular, these computer programs may be implemented using high level procedural and/or object oriented programming languages, and/or assembly/machine languages. The programming language includes, but is not limited to, programming languages such as Java, C + +, python, the "C" language, or the like. The program code may execute entirely on the user computing device, partly on the user device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Those skilled in the art will appreciate that various combinations and/or combinations of features recited in the various embodiments and/or claims of the present disclosure can be made, even if such combinations or combinations are not expressly recited in the present disclosure. In particular, various combinations and/or combinations of the features recited in the various embodiments and/or claims of the present disclosure may be made without departing from the spirit or teaching of the present disclosure. All such combinations and/or associations are within the scope of the present disclosure.

The embodiments of the present disclosure have been described above. However, these examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. Although the embodiments are described separately above, this does not mean that the measures in the embodiments cannot be used in advantageous combination. The scope of the disclosure is defined by the appended claims and equivalents thereof. Various alternatives and modifications can be devised by those skilled in the art without departing from the scope of the present disclosure, and such alternatives and modifications are intended to be within the scope of the present disclosure.

19页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种语音合成方法及语音合成模型的训练方法

Intelligent outbound voice splicing method, device, equipment, medium and program product

相关技术

网友询问留言