Voice data processing method, device, equipment, storage medium and program product

文档序号：1923544 发布日期：2021-12-03 浏览：20次中文

阅读说明：本技术 语音数据处理方法、装置、设备、存储介质及程序产品 (Voice data processing method, device, equipment, storage medium and program product ) 是由赵伟伟姜迪于 2021-09-17 设计创作，主要内容包括：本申请提供一种语音数据处理方法、装置、设备、存储介质及程序产品,所述方法包括：获取采集到的待处理语音数据；获取更新后的识别模型,更新后的识别模型由终端根据个性训练数据集对初始识别模型进行更新得到,初始识别模型由服务端基于公共训练数据训练得到,个性训练数据集至少包括采集到的语音数据；将待处理语音数据输入至更新后的识别模型进行识别,得到识别结果；确定识别结果对应的控制指令,并执行控制指令。如此,在确保用户语音数据不出本地、确保用户隐私不被泄露的前提下,实现个性化、识别准确率高的语音识别服务。(The application provides a voice data processing method, a device, equipment, a storage medium and a program product, wherein the method comprises the following steps: acquiring collected voice data to be processed; acquiring an updated recognition model, wherein the updated recognition model is obtained by updating an initial recognition model according to an individual training data set by a terminal, the initial recognition model is obtained by training a server based on public training data, and the individual training data set at least comprises collected voice data; inputting the voice data to be processed into the updated recognition model for recognition to obtain a recognition result; and determining a control instruction corresponding to the identification result, and executing the control instruction. Therefore, on the premise of ensuring that the voice data of the user does not go out of the local area and ensuring that the privacy of the user is not revealed, the voice recognition service with individuation and high recognition accuracy is realized.)

1. A voice data processing method is applied to a terminal, and the method comprises the following steps:

acquiring collected voice data to be processed;

acquiring an updated recognition model, wherein the updated recognition model is obtained by updating an initial recognition model by the terminal according to an individual training data set, the initial recognition model is obtained by training a server based on public training data, and the individual training data set at least comprises collected voice data;

inputting the voice data to be processed into the updated recognition model for recognition to obtain a recognition result;

and determining a control instruction corresponding to the identification result, and executing the control instruction.

2. The method of claim 1, wherein obtaining the updated recognition model comprises:

acquiring the initial recognition model sent by the server, and acquiring a training data set, wherein the training data set also comprises text data corresponding to the voice data;

performing migration learning on the initial recognition model based on the voice data and the text data to obtain a migration model;

and updating the initial recognition model based on the migration model to obtain an updated recognition model.

3. The method of claim 2, wherein the obtaining a training data set comprises:

acquiring collected first voice data;

when target reference data corresponding to the first voice data do not exist in a reference data set, second voice data collected within a preset time length are obtained, and the reference data set is determined by the server and sent to the terminal;

determining a set of training data based on the first speech data and the second speech data when there is target reference data corresponding to the second speech data in the reference data set;

and constructing a training data set based on the multiple groups of training data obtained by multiple determinations.

4. The method of claim 3, wherein after the acquiring the collected first voice data, the method further comprises:

recognizing the first voice data based on the initial recognition model to obtain first text data;

determining the matching degree of the first text data and each reference text data, wherein each reference data in the reference data set comprises reference voice data and reference text data;

when reference text data with the matching degree larger than a preset matching degree threshold value does not exist in the reference data set, determining that target reference data corresponding to the first voice data does not exist in the reference data set;

when reference text data with the matching degree larger than a preset matching degree threshold exists in the reference data set, determining that target reference data corresponding to the first voice data exists in the reference data set; the target reference data is reference data comprising target reference text data, and the target reference text data is reference text data with the matching degree larger than a preset matching degree threshold value.

5. The method of claim 4, wherein determining a set of training data based on the first speech data and the second speech data comprises:

recognizing the second voice data based on the initial recognition model to obtain second text data;

determining the first speech data, the first text data, the second speech data, and the second text data as a set of training data.

6. The method of claim 2, wherein performing a transfer learning on the initial recognition model based on the speech data and the text data to obtain a transfer model comprises:

acquiring the number of training data included in the training data set;

when the number of the training data reaches a first number threshold, preprocessing the training data set to obtain a target training data set, wherein the target training data set comprises target training data;

and performing transfer learning on the initial recognition model according to the target training data set to obtain at least one transfer model.

7. The method of claim 6, wherein preprocessing the training data set to obtain a target training data set comprises:

acquiring state information of the terminal, wherein the state information comprises an operating state and residual electric energy, and the operating state comprises an idle state and a working state;

when the operating state is an idle state and the residual electric energy is greater than a preset electric energy threshold value, preprocessing each group of training data in the training data set to obtain target training data corresponding to each group of training data;

and determining a target training data set based on the target training data corresponding to each group of training data.

8. The method of claim 7, wherein preprocessing a set of training data in the set of training data to obtain target training data corresponding to the set of training data comprises:

respectively determining the similarity between each first text data included in the training data group and the second text data included in the training data group;

determining each first text data with the similarity larger than a preset similarity threshold as a target first text data;

and determining each target first text data and the first voice data corresponding to each target first text data as the target training data corresponding to the group of training data.

9. The method of claim 8, wherein determining a target training data set based on the target training data corresponding to the respective sets of training data comprises:

acquiring the quantity of target training data according to the target training data corresponding to each group of training data;

when the number of the target training data is larger than a second number threshold, determining training data to be deleted from the target training data according to the acquisition time of each first voice data;

and deleting the training data to be deleted from the target training data, and forming a target training data set by using the residual target training data.

10. The method of claim 6, wherein said performing transfer learning on said initial recognition model according to said target training data set to obtain at least one transfer model comprises:

performing migration learning on the initial recognition model according to each target training data to obtain a first migration model;

inputting each first voice data included in each target training data into the first migration model to obtain a recognition text corresponding to each first voice data;

cleaning the target training data set based on the recognition texts corresponding to the first voice data to obtain an updated target training data set;

according to the updated target training data set, continuously performing migration learning on the initial recognition model until a migration finishing condition is reached;

and storing a plurality of migration models obtained by performing migration learning for a plurality of times.

11. The method according to claim 10, wherein the cleaning the target training data set based on the recognized text corresponding to each first speech data to obtain an updated target training data set comprises:

determining a word error rate of the recognition text corresponding to each first voice data based on the target first text data corresponding to each first voice data;

and deleting the first voice data and the first target text data with the word error rate larger than a preset threshold value from the target training data to obtain updated target training data.

12. The method of claim 10, wherein updating the initial recognition model based on the migration model to obtain an updated recognition model comprises:

fusing at least one migration model and the initial recognition model to obtain an updated recognition model; alternatively, the first and second electrodes may be,

and when receiving an updated initial identification model from the server, fusing at least one migration model, the initial identification model and the updated initial identification model to obtain an updated identification model, wherein the updated initial identification model is obtained by updating the initial identification model by the server.

13. The method of claim 12, wherein fusing the at least one migration model with the initial recognition model to obtain an updated recognition model comprises:

updating the reference data set according to the updated target training data set to obtain an updated reference data set;

and fusing at least one migration model and the initial recognition model according to the updated reference data set to obtain an updated recognition model.

14. The method of claim 13, wherein updating the reference data set according to the updated target training data set to obtain an updated reference data set comprises:

acquiring the number of reference data included in a reference data set;

selecting a plurality of target training data from all target training data included in the updated target training data set according to the number of the reference data;

and adding the target training data serving as reference data into the reference data set to obtain an updated reference data set.

15. A speech data processing apparatus, characterized in that the apparatus comprises:

the first acquisition module is used for acquiring the acquired voice data to be processed;

the second acquisition module is used for acquiring an updated recognition model, the updated recognition model is obtained by updating an initial recognition model according to an individual training data set by a terminal, the initial recognition model is obtained by training a server based on public training data, and the individual training data set at least comprises collected voice data;

the recognition module is used for inputting the voice data to be processed into the updated recognition model for recognition to obtain a recognition result;

and the execution module is used for determining the control instruction corresponding to the identification result and executing the control instruction.

16. An electronic device, characterized in that the device comprises:

a memory for storing executable instructions;

a processor for implementing the speech recognition method of any one of claims 1 to 14 when executing executable instructions stored in the memory.

17. A computer-readable storage medium having stored thereon executable instructions for causing a processor, when executed, to implement the speech recognition method of any one of claims 1 to 14.

18. A computer program product comprising a computer program, characterized in that the computer program realizes the speech recognition method of any one of claims 1 to 14 when executed by a processor.

Technical Field

The present application relates to the field of artificial intelligence technology, and relates to, but is not limited to, a method, an apparatus, a device, a storage medium, and a program product for processing voice data.

Background

With the development of the fields of artificial intelligence, intelligent hardware and the like, a man-machine interaction mode based on voice recognition is more and more approved by users. Especially in the on-vehicle scene, the driver awakens up on-vehicle intelligent interactive system through pronunciation, can issue control command through pronunciation, and is both convenient and safe.

In the related art, the speech recognition service in the vehicle-mounted intelligent interactive system provides two modes: one is cloud voice recognition service, a vehicle-mounted terminal uploads the voice of a user to a cloud service provider server, voice recognition software on the cloud service provider server translates the voice into characters and returns the characters to the user, the cloud service voice recognition service has the advantages of being high in computing capacity and capable of continuously updating a model, effects are improved under the condition that the user is not sensitive, functions are powerful, the premise is that networking is needed, the voice recognition service cannot be used after the network is disconnected, the voice of the user needs to be uploaded, the risk of revealing voiceprint information of the user exists, and safety of privacy information such as user identity cannot be guaranteed; the other method is to translate the user voice into a text through the voice recognition service privately deployed on the vehicle-mounted terminal, has the advantages of no need of networking and no privacy disclosure problem, but is limited by factors such as voice recognition technology, terminal calculation and storage capacity, complex instruction logic, complex background sound, user accent and the like, so that the recognition capability is weak, the recognition success rate is low, and the recognition can be completed only by the cooperation of users (including the use of standard mandarin, large voice and quiet background environment), thereby bringing inconvenience to the users.

Disclosure of Invention

The embodiment of the application provides a voice data processing method, a voice data processing device, a voice data processing equipment, a computer readable storage medium and a computer program product, which can not only protect the privacy and safety of users, but also realize personalized voice recognition service with high recognition accuracy.

The technical scheme of the embodiment of the application is realized as follows:

the embodiment of the application provides a voice data processing method, which is applied to a terminal and comprises the following steps:

acquiring collected voice data to be processed;

inputting the voice data to be processed into the updated recognition model for recognition to obtain a recognition result;

and determining a control instruction corresponding to the identification result, and executing the control instruction.

An embodiment of the present application provides a voice data processing apparatus, the apparatus includes:

the first acquisition module is used for acquiring the acquired voice data to be processed;

the second acquisition module is used for acquiring an updated recognition model, the updated recognition model is obtained by updating an initial recognition model according to an individual training data set by the terminal, the initial recognition model is obtained by training a server based on public training data, and the individual training data set at least comprises collected voice data;

the recognition module is used for inputting the voice data to be processed into the updated recognition model for recognition to obtain a recognition result;

and the execution module is used for determining the control instruction corresponding to the identification result and executing the control instruction.

An embodiment of the present application provides a voice data processing apparatus, including:

a memory for storing executable instructions;

and the processor is used for realizing the voice data processing method provided by the embodiment of the application when the processor executes the executable instructions stored in the memory.

The embodiment of the application provides a computer-readable storage medium, and executable instructions are stored on the computer-readable storage medium and used for causing a processor to execute the computer-readable storage medium to realize the voice data processing method provided by the embodiment of the application.

The embodiment of the present application provides a computer program product, which includes a computer program, and when the computer program is executed by a processor, the method for processing voice data provided by the embodiment of the present application is implemented.

The embodiment of the application has the following beneficial effects:

in the voice data processing method provided by the embodiment of the application, the server obtains an initial recognition model based on public training data training, sends the initial recognition model to the terminal, and the terminal updates the initial recognition model according to an individual training data set to obtain an updated recognition model, wherein the individual training data set at least comprises collected voice data. When the terminal acquires the collected voice data to be processed; inputting the voice data to be processed into the updated recognition model for recognition to obtain a recognition result; and finally, determining a control instruction corresponding to the identification result, and executing the control instruction. Therefore, on the premise of ensuring that the voice data of the user does not go out of the local area and ensuring that the privacy of the user is not revealed, the voice recognition service with individuation and high recognition accuracy is realized.

Drawings

FIG. 1 is a schematic diagram of a network architecture of a voice data processing system according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of an electronic device provided in an embodiment of the present application;

fig. 3 is a schematic flow chart of an implementation of a voice data processing method according to an embodiment of the present application;

fig. 4 is a schematic flowchart of another implementation of a voice data processing method according to an embodiment of the present application;

FIG. 5 is a flow chart of an implementation of a self-learning method of speech recognition services provided by an embodiment of the present application;

fig. 6 is a schematic flow chart of an implementation process of the self-learning engine for performing the migration learning according to the user data according to the embodiment of the present application.

Detailed Description

In order to make the objectives, technical solutions and advantages of the present application clearer, the present application will be described in further detail with reference to the attached drawings, the described embodiments should not be considered as limiting the present application, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.

In the following description, references to the terms "first \ second \ third" are only used to distinguish similar objects and do not denote a particular order, but rather the terms "first \ second \ third" are used to interchange specific orders or sequences, where permissible, so that the embodiments of the present application described herein can be practiced in other than the order shown or described herein.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the application.

Before further detailed description of the embodiments of the present application, terms and expressions referred to in the embodiments of the present application will be described, and the terms and expressions referred to in the embodiments of the present application will be used for the following explanation.

1) The migration learning method is a machine learning method, and is characterized in that a model developed for a task A is used as an initial point and is reused in the process of developing the model for a task B.

2) Weak supervised learning (weak supervised learning), a branch of the field of machine learning, uses limited, noisy or inaccurately labeled data for training of model parameters, as compared to traditional supervised learning.

Based on the above explanations of terms and terms involved in the embodiments of the present application, first, a voice data processing system provided in the embodiments of the present application is described, referring to fig. 1, fig. 1 is a schematic network architecture diagram of a voice data processing system provided in the embodiments of the present application, the voice data processing system includes at least one terminal 100, a server 200, and a network 300, where fig. 1 illustrates 1 terminal 100. The terminal 100 is connected to the server 200 through a network 300, and the network 300 may be a wide area network or a local area network, or a combination of the two, and uses a wireless link to implement data transmission.

In some embodiments, the terminal 100 may be, but is not limited to, a smart phone, a vehicle-mounted terminal, a laptop computer, a tablet computer, a desktop computer, a dedicated messaging device, a portable gaming device, a smart speaker, a smart watch, and the like. The server 200 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, a cloud function, cloud storage, Network service, cloud communication, middleware service, domain name service, security service, Content Delivery Network (CDN), big data and an artificial intelligence platform. The network 300 may be a wide area network or a local area network, or a combination of both. The terminal 100 and the server 200 may be directly or indirectly connected through wired or wireless communication, and the embodiment of the present application is not limited thereto.

And the server 200 is configured to obtain an initial recognition model according to training of public training data, where the public training data includes public voice data and text data corresponding to the public voice data. After the training is completed, the server 200 sends the trained initial recognition model to the terminal 100.

A terminal 100 for receiving an initial recognition model from the server 200; determining an individual training data set according to the collected user voice data, and then updating the initial recognition model according to the individual training data set to obtain an updated recognition model; after acquiring the collected voice data to be processed, inputting the voice data to be processed into the updated recognition model for recognition to obtain a recognition result; and determining a control instruction corresponding to the identification result, and executing the control instruction. The voice data of the user does not need to be sent to the server 200, the privacy and the safety of the user can be protected, the terminal 100 receives the trained initial recognition model from the server 200, and then the initial recognition model is migrated and learned locally at the terminal 100 by using the voice data of the user to obtain an updated recognition model suitable for the terminal user, so that the voice recognition service with individuation and high recognition accuracy can be realized.

Referring to fig. 2 and fig. 2 are schematic structural diagrams of a composition of an electronic device according to an embodiment of the present application, in practical applications, an electronic device 10 may be implemented as the terminal 100 or the server 200 in fig. 1, and an electronic device implementing a voice data processing method according to an embodiment of the present application is described by taking the electronic device 10 as the terminal 100 shown in fig. 1 as an example. The electronic device 10 shown in fig. 2 includes: at least one processor 110, memory 150, at least one network interface 120, and a user interface 130. The various components in electronic device 10 are coupled together by a bus system 140. It will be appreciated that the bus system 140 is used to enable communications among the components. The bus system 140 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 140 in fig. 2.

The Processor 110 may be an integrated circuit chip having Signal processing capabilities, such as a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like, wherein the general purpose Processor may be a microprocessor or any conventional Processor, or the like.

The user interface 130 includes one or more output devices 131, including one or more speakers and/or one or more visual display screens, that enable the presentation of media content. The user interface 130 also includes one or more input devices 132 including user interface components that facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, other input buttons and controls.

The memory 150 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard disk drives, optical disk drives, and the like. Memory 150 optionally includes one or more storage devices physically located remotely from processor 110.

The memory 150 includes volatile memory or nonvolatile memory, and may include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read Only Memory (ROM), and the volatile Memory may be a Random Access Memory (RAM). The memory 150 described in embodiments herein is intended to comprise any suitable type of memory.

In some embodiments, memory 150 is capable of storing data to support various operations, examples of which include programs, modules, and data structures, or subsets or supersets thereof, as exemplified below.

An operating system 151 including system programs for processing various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and processing hardware-based tasks;

a network communication module 152 for communicating to other computing devices via one or more (wired or wireless) network interfaces 120, exemplary network interfaces 120 including: bluetooth, wireless compatibility authentication (WiFi), and Universal Serial Bus (USB), etc.;

a presentation module 153 for enabling presentation of information (e.g., user interfaces for operating peripherals and displaying content and information) via one or more output devices 131 (e.g., display screens, speakers, etc.) associated with the user interface 130;

an input processing module 154 for detecting one or more user inputs or interactions from one of the one or more input devices 132 and translating the detected inputs or interactions.

In some embodiments, the voice data processing apparatus provided in the embodiments of the present application may be implemented in software, and fig. 2 shows the voice data processing apparatus 155 stored in the memory 150, which may be software in the form of programs and plug-ins, and the like, and includes the following software modules: a first obtaining module 1551, a second obtaining module 1552, an identifying module 1553 and an executing module 1554, which are logical, and thus can be arbitrarily combined or further separated according to the implemented functions. The functions of the respective modules will be explained below.

In other embodiments, the voice data processing apparatus provided in this embodiment may be implemented in hardware, and for example, the voice data processing apparatus provided in this embodiment may be a processor in the form of a hardware decoding processor, which is programmed to execute the voice data processing method provided in this embodiment, for example, the processor in the form of the hardware decoding processor may be one or more Application Specific Integrated Circuits (ASICs), DSPs, Programmable Logic Devices (PLDs), Complex Programmable Logic Devices (CPLDs), Field Programmable Gate Arrays (FPGAs), or other electronic components.

The following describes a voice data processing method provided in an embodiment of the present application. In some embodiments, the voice data processing method provided in the embodiment of the present application may be implemented by a terminal or a server of the network architecture shown in fig. 1 alone, or implemented by the terminal and the server in a cooperative manner, and then, taking the implementation of the terminal as an example, refer to fig. 3, where fig. 3 is a schematic implementation flow diagram of the voice data processing method provided in the embodiment of the present application, and will be described with reference to the steps shown in fig. 3.

Step S301, acquiring the collected voice data to be processed.

The voice data to be processed is voice data of a user holding the terminal, the terminal comprises a voice acquisition device, when the user wants the terminal to execute a certain operation, the voice corresponding to the operation is spoken to the terminal, and the voice acquisition device acquires the voice to obtain the voice data to be processed. For example, the terminal is a vehicle-mounted terminal, and when a user wants to listen to music, the user speaks "play music" to the vehicle-mounted terminal, and a voice acquisition device of the vehicle-mounted terminal acquires to-be-processed voice data "play music".

Step S302, obtaining the updated recognition model.

Here, the obtaining of the updated recognition model may be that the terminal updates the initial recognition model according to the personality training data set to obtain the updated recognition model. The personality training data set comprises collected voice data and text data corresponding to the voice data. The initial recognition model is obtained by the server side based on the training data and sent to the terminal.

After receiving the initial recognition model, the terminal collects user voice data, recognizes the voice data by using the initial recognition model to obtain a recognition result (namely text data), and if a certain control instruction corresponding to the recognition result exists in all the control instructions, the initial recognition model can successfully recognize the voice data and execute the control instruction; if the control instruction corresponding to the recognition result does not exist in all the control instructions, it is indicated that the initial recognition model cannot successfully recognize the voice data, and it cannot be determined which control instruction the user wants to execute, at this time, the voice data of the user needs to be collected again, the recognition process is continuously performed on the newly collected voice data until the recognition is successful, and the voice data which is successfully recognized and the voice data which is not successfully recognized before, as well as the text data obtained by recognizing the voice data are stored as a group of training data. When the data is stored, the data can be encrypted firstly and then stored, so that the security of the data is enhanced. And performing recording for multiple times as above, and forming a training data set by the obtained multiple groups of training data. The training data set is acquired by a user holding the terminal, and the acquired voice data has individual characteristics of the user, such as using habits, accents, using scenes and the like, so that the training data set is the individual training data set of the terminal, the user voice data does not need to be sent to other terminals or service terminals on the premise of protecting the user voice data, and the privacy information of the user is ensured not to be leaked.

Step S303, inputting the voice data to be processed into the updated recognition model for recognition to obtain a recognition result.

After the updated recognition model is acquired, the acquired to-be-processed voice data is input to the updated recognition model for voice recognition, and a recognition result is acquired, for example, the acquired voice data input by the user is "going xx ground", and the recognition result acquired after recognition is "going xx ground".

And step S304, determining a control instruction corresponding to the identification result, and executing the control instruction.

And determining and executing the corresponding control command according to the identification result obtained in the step S303. As described above, the control command corresponding to the recognition result is analyzed and determined as "navigation destination xx", a navigation application (App) is opened, and the navigation destination is a route of the xx destination.

According to the voice data processing method provided by the embodiment of the application, the terminal acquires the collected voice data to be processed; acquiring an updated recognition model, wherein the updated recognition model is obtained by updating an initial recognition model according to an individual training data set by a terminal, the initial recognition model is obtained by training a server based on public training data, and the individual training data set at least comprises collected voice data; inputting the voice data to be processed into the updated recognition model for recognition to obtain a recognition result; and determining a control instruction corresponding to the identification result, and executing the control instruction. Therefore, on the premise of ensuring that the voice data of the user does not go out of the local area and ensuring that the privacy of the user is not revealed, the voice recognition service with individuation and high recognition accuracy is realized.

In some embodiments, the step S302 "obtaining the updated recognition model" can be implemented as the following steps:

step S3021, acquiring an initial recognition model sent by the server.

The method comprises the steps that a server side obtains public training data, an original recognition model is built, the original recognition model is trained according to the public training data to obtain a trained initial recognition model, and then the initial recognition model is sent to a terminal, so that the terminal updates the initial recognition model by utilizing private voice data of the terminal.

Step S3022, a training data set is acquired.

The training data set includes a plurality of sets of training data, each set of training data including speech data and text data corresponding to the speech data. And the text data corresponding to the voice data is obtained by recognition according to the initial recognition model. In the embodiment of the present application, the training data set may be obtained through the following steps:

step S30221, acquiring the collected first voice data.

For a clearer description of the training data, the following description will be given by taking music a as an example, and the first speech data may be "music a (dialect).

In step S30222, it is determined whether target reference data corresponding to the first voice data exists in the reference data set.

The reference data set is determined by the server and sent to the terminal. The server sets at least one corresponding reference data for each control instruction, and all the reference data form a reference data set and are sent to the terminal, wherein each reference data comprises reference voice data and reference text data. Recognizing the first voice data based on the initial recognition model to obtain first text data; determining the matching degree of the first text data and each reference text data; when reference text data with the matching degree larger than a preset matching degree threshold value does not exist in the reference data set, determining that target reference data corresponding to the first voice data does not exist in the reference data set, and then entering step S30223; when reference text data with the matching degree larger than a preset matching degree threshold exists in the reference data set, determining that target reference data corresponding to the first voice data exists in the reference data set, determining a control instruction corresponding to the target reference data, and executing the control instruction. The target reference data is reference data including target reference text data, and the target reference text data is reference text data with a matching degree greater than a preset matching degree threshold.

Still by way of example, the first speech data "play xxx music (dialect)" is recognized based on the initial recognition model, resulting in the first text data "do not play xxx music"; determining the matching degree of the 'unreleased xxx music' and the reference text data included in each reference data in the reference data set; when reference text data with the matching degree larger than a preset matching degree threshold value does not exist in the reference data set, determining that target reference data corresponding to the 'play xxx music (dialect) does not exist in the reference data set, that is, determining that a control instruction corresponding to the' play xxx music (dialect) fails, and then entering step S30223 to re-collect the data; when reference text data with the matching degree larger than a preset matching degree threshold exists in the reference data set, target reference data corresponding to the 'play xxx music (dialect)' is determined to exist in the reference data set, that is, a control instruction corresponding to the 'play xxx music (dialect)' is determined to be successful, if the control instruction determined according to the target reference data is 'turn on a music player and start to play xxx music', then a control instruction 'turn on the music player and start to play xxx music' is executed.

Step S30223, acquiring second voice data acquired within a preset duration.

Here, the second voice data may correspond to the same control instruction as the first voice data, that is, when the first voice data of the user fails to determine the control instruction and the terminal does not execute the control instruction, the user issues the second voice data for executing the same control instruction again, for example, "play xxx music (mandarin)", or "play xxx music (dialect)" by raising the volume, and the recognition accuracy is raised by adjusting the content or volume of the utterance.

The second voice data may also correspond to different control instructions from the first voice data, for example, when the user starts to listen to xxx music and does not play the xxx music successfully, the user changes to listen to a yyy radio station, and at this time, the first voice data and the second voice data correspond to different control instructions.

It should be noted that the interval duration between the first collection time of the first voice data and the second collection time of the second voice data is within a preset duration, and the preset duration may be set to any value between 30s (seconds) and 2min (minutes), or may be set to other longer or shorter values according to practical applications. And when the user inputs the voice again after the preset time length is exceeded, taking the voice input again as the first voice data.

In step S30224, it is determined whether target reference data corresponding to the second voice data exists in the reference data set.

Here, whether or not the target reference data corresponding to the second voice data exists in the reference data set is determined in the same manner as the determination of whether or not the target reference data corresponding to the first voice data exists in the reference data set, as described in detail in step S30222 above.

When target reference data corresponding to the second voice data exists in the reference data set, the flow proceeds to step S30225; when the target reference data corresponding to the second voice data does not exist in the reference data set, the second voice data is saved as the second first voice data, and then the process returns to step S30223 to re-acquire new second voice data. For example, the first voice data is "play xxx music (dialect)", the user only increases the volume, but the collected second voice data is also "play xxx music (dialect)", the target reference data corresponding to the voice data is not determined in the reference data set by the two collections, the stored first voice data includes "play xxx music (dialect)" and "play xxx music (dialect)", then the collection is continued until the second voice data corresponding to the target reference data is collected in the reference data set, and at this time, the collection is stopped, and the process proceeds to step S30225. Or after the nth voice data is acquired for the nth time, the (N + 1) th voice data is not acquired within the preset time length, at this time, all the nth voice data is deleted, and the step returns to step S30221 to perform acquisition again, where N is a positive integer, and N is 1, 2, …, N.

In step S30225, a set of training data is determined based on the first speech data and the second speech data.

After target reference data corresponding to the second voice data exist in the reference data set, the N pieces of first voice data are recognized based on the initial recognition model to obtain N pieces of first text data, the second voice data are recognized based on the initial recognition model to obtain second text data, and the N pieces of first voice data, the N pieces of first text data, the second voice data and the second text data are determined to be a group of training data.

Step S30226, a training data set is constructed based on the multiple sets of training data determined multiple times.

And each group of training data is added into the training data set, and after a certain amount of training data is accumulated, the initial recognition model is trained.

Step S3023, performing migration learning on the initial recognition model based on the voice data and the text data to obtain a migration model.

And performing multiple rounds of migration learning on the initial recognition model by using the voice data and the text data in the training data set to obtain multiple migration models.

And step S3024, updating the initial recognition model based on the migration model to obtain an updated recognition model.

In one implementation, obtaining the updated recognition model may be implemented as: and fusing the at least one migration model and the initial recognition model to obtain an updated recognition model. Specifically, all the migration models and the initial recognition models obtained in step S3023 may be fused, and the new model obtained by the fusion may be determined as the updated recognition model; alternatively, a part of the migration models may be first screened from all the migration models obtained in step S3023, the screened part of the migration models and the initial recognition model may be fused, and the new model obtained by the fusion may be determined as the updated recognition model.

In another implementation manner, if the server also updates the initial recognition model, that is, the server updates the previously trained initial recognition model based on new public training data to obtain an updated initial recognition model, and the server sends the updated initial recognition model to the terminal, at this time, the obtained updated recognition model may be implemented as: and fusing the at least one migration model, the initial recognition model and the updated initial recognition model to obtain an updated recognition model. Specifically, all the migration models obtained in step S3023, the initial recognition model received from the server, and the updated initial recognition model may be fused to obtain an updated recognition model; or, a part of the migration models may be screened from all the migration models obtained in step S3023, and the screened part of the migration models, the initial recognition model received from the server, and the updated initial recognition model may be fused to obtain an updated recognition model.

In the embodiment of the present application, during the fusion, unconstrained fusion, such as average fusion, may be performed, or fusion may be performed based on constraint conditions. The following describes a fusion method by taking an example in which the terminal fuses at least one migration model and the initial recognition model to obtain an updated recognition model.

In one implementation of unconstrained condition fusion, the weights corresponding to the models may be averaged, and at least one of the migration models and the initial recognition model may be fused to obtain an updated recognition model.

In one implementation of the fusion with the constraint condition, the reference data set may be used as the constraint condition, and the at least one migration model and the initial recognition model are fused based on the reference data set to obtain an updated recognition model, where the updated recognition model is an optimal model that meets the constraint condition.

In another implementation manner of performing fusion with constraint conditions, the reference data set may be updated according to the updated target training data set to obtain an updated reference data set; and taking the updated reference data set as a constraint condition, and fusing at least one migration model and the initial recognition model according to the updated reference data set to obtain an updated recognition model, wherein the updated recognition model is an optimal model meeting the constraint condition.

Wherein, updating the reference data set can be realized as: acquiring the number of reference data included in a reference data set; selecting a plurality of target training data from all target training data included in the updated target training data set according to the number of the reference data; and adding the plurality of target training data serving as reference data into the reference data set to obtain an updated reference data set.

Here, the number of the selected target training data is not greater than one half of the number of the reference data sets, that is, each time of updating, the number of the reference data sets including the reference data after updating is at most equal to 1.5 times of the number of the reference data sets including the reference data before updating, so that the reference data sets are not greatly increased, and the accuracy and stability of the verification data adopted in the fusion are ensured.

In some embodiments, the step S3023 "performing migration learning on the initial recognition model based on the speech data and the text data to obtain a migration model" may be implemented as the following steps:

in step S30231, the number of training data included in the training data set is acquired.

When the training data are less, a stable model cannot be obtained, in the embodiment of the application, when the training data in the training data set reach a certain amount, the initial recognition model is updated, and the stability of the updated initial recognition model can be ensured.

In step S30232, it is determined whether the amount of training data reaches a first amount threshold.

When the number of training data reaches the first number threshold, which indicates that there is more training data, then step S30233 is performed; when the number of training data does not reach the first number threshold, it indicates that there is less training data, and then returns to step S30221 to continue to collect new training data.

Step S30233, preprocessing the training data set to obtain a target training data set.

The target training data set includes target training data. The process of preprocessing the training data set is described in step S2331 to step S2335 below.

Step S30234, performing migration learning on the initial recognition model according to the target training data set to obtain at least one migration model.

In the embodiment of the application, the training data set is preprocessed, the training data which do not meet the requirements are removed, the target training data set is obtained, the initial recognition model is migrated and learned according to the target training data set, and compared with the migration and learning of the initial recognition model according to the training data set, the obtained migration model is higher in recognition accuracy.

In some embodiments, the preprocessing of the training data set in step S30233 "to obtain a target training data set, where the target training data set includes target training data" may be implemented as the following steps:

in step S2331, status information of the terminal is acquired.

Here, the state information of the terminal includes an operation state and a remaining power, and the operation state includes an idle state and an operation state. When the terminal is executing the control command, the terminal is determined to be in a working state, and when the terminal is not executing the control command, such as in a standby state, the terminal is determined to be in an idle state. In the embodiment of the application, the control instruction refers to a control instruction for executing a certain operation according to the voice control of a user, and does not include a system instruction for controlling the standby of the terminal. The residual power can be understood as the current residual power of the terminal, when the residual power is less, the residual power may not be enough to support the completion of the updating process of the initial identification model, in order to ensure the completeness of the updating, the residual power of the terminal is obtained before the updating, and when the residual power is lower than the power required for updating the initial identification model, the updating is not performed.

In step S2332, it is determined whether the operating state is an idle state.

When the operation status is the idle status, go to step S2333; when the operation status is the active status, the process proceeds to step S2331 to re-acquire the status information of the terminal. In the embodiment of the application, the initial identification model is updated when the user does not use the terminal, so that on one hand, the normal use of the terminal is not influenced, on the other hand, the updating time can be shortened, and the updating efficiency is improved.

In step S2333, it is determined whether the remaining power is greater than a preset power threshold.

When the residual electric energy is greater than the preset electric energy threshold, determining that the terminal can update the model, and then entering step S2334; when the remaining power is less than or equal to the preset power threshold, it indicates that the current power of the terminal is less and may not be enough to support the completion of the model update, and the step S2331 is returned to obtain the status information of the terminal again.

Step S2334, preprocessing each set of training data in the training data set to obtain target training data corresponding to each set of training data.

The preprocessing of one set of training data, that is, "preprocessing of one set of training data in the training data set to obtain target training data corresponding to one set of training data" may be implemented as: similarity of each first text data included in a set of training data and second text data included in a set of training data is determined respectively. And determining each first text data with the similarity larger than a preset similarity threshold as a target first text data. And determining the first target text data and the first voice data corresponding to the first target text data as target training data corresponding to a group of training data.

When a plurality of pieces of voice data input successively by a user in a preset time period correspond to different control instructions, the set of training data affects the recognition result, for example, the first voice data includes "play xxx music (dialect)", "open yyy radio station (dialect)", the first text data includes "not play xxx music", "open yy tower", the second voice data is "open yy radio station", the second text data is "open yy radio station", the similarity between "not play xxx music" and "open yy radio station" is determined to be 0, the similarity between "open yy tower" and "open yy radio station" is determined to be 0.8, the preset similarity threshold value is 0.6, the "open yy tower" greater than 0.6 is determined as target first text data, the "open yy radio station (dialect)" and the "open yy tower" are determined as a set of training data, and the last text data in the training data, i.e. the second text data is used as the labeled text data of the set of target training data.

When the similarity between M first text data and second text data in N first text data corresponding to N first voice data included in a set of training data is larger than a preset similarity threshold, M target first text data are obtained, the M target first text data and M first voice data corresponding to the M target first text data are determined to obtain M sets of target training data, wherein M is a natural number smaller than N, namely, a set of training data can correspond to M sets of target training data, and labeled text data of the M sets of target training data are the same and are the second text data.

Step S2335, a target training data set is determined based on the target training data corresponding to each set of training data.

And cleaning all groups of training data in the training data set to obtain target training data corresponding to all groups of training data, and forming the groups of target training data into a target training data set. When the target training data is too much, the updating speed of the model is influenced. In the embodiment of the application, on the premise of ensuring the identification precision, when the number of the target training data is greater than the second number threshold, the target training data is deleted. The method can be specifically realized as follows: acquiring the quantity of target training data according to the target training data corresponding to each group of training data; when the number of the target training data is larger than a second number threshold, determining training data to be deleted from the target training data according to the acquisition time of each first voice data; and deleting the training data to be deleted from the target training data, and forming a target training data set by using the residual target training data.

When the training data to be deleted is determined, the target training data which is longer than the current time in the target training data may be determined as the training data to be deleted according to the time.

In some embodiments, the step S30234 "performing migration learning on the initial recognition model according to the target training data set to obtain at least one migration model" may be implemented by:

step S2341, according to each target training data, performing migration learning on the initial recognition model to obtain a first migration model.

And carrying out primary transfer learning on the initial recognition model according to each target training data in the target training data set to obtain a first transfer model.

Step S2342, inputting each first speech data included in each target training data to the first migration model, to obtain a recognition text corresponding to each first speech data.

Step S2343, cleaning the target training data set based on the recognition texts corresponding to the first voice data to obtain an updated target training data set.

Determining a word error rate of the recognition text corresponding to each first voice data based on the target first text data corresponding to each first voice data; and deleting the first voice data and the first target text data with the word error rate larger than a preset threshold value from the target training data to obtain updated target training data, wherein the updated target training data form an updated target training data set.

After the primary migration is completed, the first migration model is used for predicting the voice data included in each target training data, namely, each first voice data included in each target training data is input into the first migration model to obtain the identification text corresponding to each first voice data, the word error rate of each identification text is determined according to the labeled text data, namely, whether the labeled text data and the identification text are the same or not is compared word by word, the quotient of the number of different words and the total number of words of the labeled text is used as the word error rate, or the quotient of the number of different words and the total number of words of the identification text is used as the word error rate. And judging whether the word error rate of the recognized text corresponding to each first voice data is greater than a preset threshold value, deleting the first voice data with the word error rate greater than the preset threshold value and the text data corresponding to the first voice data, wherein the residual target training data after deletion is updated target training data.

And step S2344, continuously performing migration learning on the initial recognition model according to the updated target training data set to obtain a second migration model.

And performing transfer learning on the initial recognition model according to the target training data included in the updated target training data set to obtain a second transfer model.

In step S2345, it is determined whether or not the migration end condition is met.

Here, the migration end condition may be that the target training data included in the updated target training data set is not changed, or that the cumulative number of times of migration learning reaches a preset number threshold. When the migration end condition is reached, stopping the migration learning, and proceeding to step S2346; and when the migration ending condition is not met, returning to the step S2342, continuing to update the target training data, continuing to perform migration learning, and performing migration learning every time to obtain a migration model, namely performing migration learning for k times to obtain a kth migration model, wherein k is a positive integer.

Step S2346 is to store a plurality of migration models obtained by performing migration learning a plurality of times.

And storing k migration models obtained by k times of migration learning, and then entering step S3024 to perform model fusion.

Through the steps S2341 to S2346, the initial recognition model is subjected to migration learning according to the target training data set, and a plurality of migration models are obtained for updating the initial model.

Based on the foregoing embodiments, an embodiment of the present application further provides a voice data processing method, and fig. 4 is a schematic flow chart of another implementation of the voice data processing method provided in the embodiment of the present application, which is applied to the network architecture shown in fig. 1, as shown in fig. 4, the voice data processing method includes the following steps:

step S401, the server side obtains public training data and an original recognition model.

The public training data is public speech data.

And S402, training the original recognition model by the server based on the public training data to obtain an initial recognition model.

And step S403, the server side sends the initial recognition model to the terminal.

Step S404, the terminal acquires a training data set.

The acquisition of the training data set by the terminal may be implemented as: acquiring collected first voice data; when target reference data corresponding to the first voice data do not exist in the reference data set, second voice data collected within a preset time length are obtained, and the reference data set is determined by the server and sent to the terminal; when target reference data corresponding to the second voice data exists in the reference data set, determining a set of training data based on the first voice data and the second voice data; and constructing a training data set based on the multiple groups of training data obtained by multiple determinations.

In the embodiment of the application, in the process of acquiring the training data by the terminal, whether the current training data set meets the training condition is judged, and when the training data in the training data set reaches a certain amount, the initial recognition model is updated, so that the stability of the updated initial recognition model can be ensured.

In some embodiments, determining whether the current training data set satisfies the training condition may be implemented as: acquiring the number of training data included in a training data set; judging whether the quantity of the training data reaches a first quantity threshold value or not; when the number of training data reaches the first number threshold, determining that a training condition is satisfied, and then entering step S405; when the number of training data does not reach the first number threshold, it is determined that the training condition is not satisfied, and the step S404 is continuously performed to obtain more training data.

Step S405, the terminal preprocesses the training data set to obtain a target training data set.

The target training data set includes target training data. Acquiring the target training data set may be implemented as: acquiring state information of a terminal, wherein the state information comprises an operating state and residual electric energy, and the operating state comprises an idle state and a working state; when the running state is an idle state and the residual electric energy is greater than a preset electric energy threshold value, preprocessing each group of training data in the training data set to obtain target training data corresponding to each group of training data; and determining a target training data set based on the target training data corresponding to each group of training data.

The method comprises the following steps of preprocessing a group of training data in a training data set to obtain target training data corresponding to the group of training data, and can be realized as follows: respectively determining the similarity between each first text data included in a group of training data and the second text data included in a group of training data; determining each first text data with the similarity larger than a preset similarity threshold as a target first text data; and determining the first target text data and the first voice data corresponding to the first target text data as target training data corresponding to a group of training data.

Wherein, based on the target training data corresponding to each group of training data, the target training data set is determined, which can be implemented as: acquiring the quantity of target training data according to the target training data corresponding to each group of training data; when the number of the target training data is larger than a second number threshold, determining training data to be deleted from the target training data according to the acquisition time of each first voice data; and deleting the training data to be deleted from the target training data, and forming a target training data set by using the residual target training data.

Step S406, the terminal performs transfer learning on the initial recognition model according to the target training data set to obtain at least one transfer model.

Performing multiple rounds of migration learning on the initial recognition model to obtain multiple migration models, which can be realized as follows: performing migration learning on the initial recognition model according to each target training data to obtain a first migration model; inputting each first voice data included in each target training data into a first migration model to obtain a recognition text corresponding to each first voice data; cleaning the target training data set based on the recognition texts corresponding to the first voice data to obtain an updated target training data set; according to the updated target training data set, continuously performing migration learning on the initial recognition model until a migration finishing condition is reached; and storing a plurality of migration models obtained by performing migration learning for a plurality of times.

After each round of migration learning, the target training data set is cleaned, and the target training data with high recognition error rate are removed, which can be specifically realized as follows: determining a word error rate of the recognition text corresponding to each first voice data based on the target first text data corresponding to each first voice data; and deleting the first voice data and the target first text data with the word error rate larger than a preset threshold value from the target training data to obtain updated target training data.

Step S407, the terminal updates the initial recognition model based on the migration model to obtain an updated recognition model.

When the server does not update the initial recognition model, the steps can be implemented as follows: fusing at least one migration model and the initial recognition model to obtain an updated recognition model; when the server also updates the initial recognition model by using the updated common training data, the terminal receives the updated initial recognition model from the server, and the step can be implemented as follows: and fusing the at least one migration model, the initial recognition model and the updated initial recognition model to obtain an updated recognition model, and updating the initial recognition model by the server side to obtain the updated initial recognition model.

In the embodiment of the present application, a fusion manner will be described by taking an example in which a terminal fuses at least one migration model and an initial recognition model to obtain an updated recognition model.

In one implementation, the average value of the weights corresponding to the models may be taken, and at least one migration model and the initial recognition model are fused to obtain an updated model.

In another implementation, the reference data set may be used as a constraint condition to fuse the at least one migration model and the initial recognition model to obtain an updated model.

In another implementation, the reference data set may be updated according to the updated target training data set to obtain an updated reference data set; and then fusing the at least one migration model and the initial recognition model according to the updated reference data set to obtain an updated recognition model. Updating the reference data set may be implemented as: acquiring the number of reference data included in a reference data set; selecting a plurality of target training data from all target training data included in the updated target training data set according to the number of the reference data; and adding the plurality of target training data serving as reference data into the reference data set to obtain an updated reference data set.

Step S408, the terminal acquires the collected voice data to be processed.

And step S409, the terminal inputs the voice data to be processed into the updated recognition model for recognition to obtain a recognition result.

And step S410, the terminal determines a control instruction corresponding to the identification result and executes the control instruction.

After the updated recognition model is obtained, the collected voice data to be processed is input into the updated recognition model for voice recognition, a recognition result is obtained, a corresponding control instruction is determined and executed according to the recognition result, and personalized voice recognition service with high recognition accuracy is realized on the premise that the voice data of the user cannot be locally found and the privacy of the user is not leaked.

Next, an exemplary application of the embodiment of the present application in a practical application scenario will be described.

Voiceprints are a unique physiological feature of every person, similar to fingerprints, that can be accurately located to a specific individual using voiceprint technology. Current voiceprint technology can generate high quality voiceprints (e.g., spoken speech using speech recognition services) using short phrases of arbitrary content text (typically around 10 s), which can cause user identity to be compromised. It can be seen that while enjoying the convenience brought by the technology, the user himself also bears the risk of privacy disclosure. The voice recognition service manufacturer will make best efforts to protect the user's voice security, including but not limited to deleting voice in time, encrypting storage, etc.

There are two ways provided by the speech recognition service in the related art: one is cloud speech recognition service, namely user side equipment uploads the speech of a user to a cloud service provider server through communication protocol networking, speech recognition software on the cloud service provider server translates the speech into characters and returns the characters to the user, the cloud service speech recognition service has the advantages of being strong in computing capacity and capable of continuously updating a model, effects are improved under the condition that the user is not sensitive, functions are strong, the defects that networking is needed, the network-disconnected speech recognition service is unavailable, the speech (or information after special processing) of the user needs to be uploaded, and potential risks of revealing voiceprint information of the user exist, so that privacy is revealed; the other method is that the voice of the user is translated into text through the voice recognition service deployed privately at the end side, and the method has the advantages that networking is not needed, the service is completed in user equipment, the privacy disclosure problem does not exist, but a model deployed at the end side is limited by the hardware capability of the end side, the function is limited, the cutting is generally performed under the condition that basic indexes are not influenced, the recognition capability is weak, the recognition is completed by cooperation of the user (including using standard common speech, big voice and quiet background), the generalization capability is influenced to a certain extent, and the effect is influenced by the difference of the user and the difference of the use scene.

The cloud voice recognition and the local privatization voice recognition are deployed at the end side simultaneously, the cloud voice recognition service is used in a normal environment, the local privatization voice recognition is used in a weak network or a broken network, although user end side equipment has the privatization voice recognition model capability of running completely, the cloud voice recognition service is limited by the voice recognition technology, the calculation and storage capability of the end side, factors such as complex instruction logic, complex background sound and user voice are influenced, certain influence is caused on experience, user voice data still needs to be sent to a cloud service provider server, a model is trained at the service end, the end side model is updated regularly, and privacy leakage still can be caused.

Based on this, from the viewpoints of privacy safety and improvement of user side voice recognition experience, the embodiment of the application provides a method for carrying out self-learning of privatized voice recognition service on a user side.

Fig. 5 is a flowchart illustrating an implementation of a self-learning method for speech recognition service according to an embodiment of the present application, and as shown in fig. 5, the method includes the following steps:

in step S501, the program is started.

The user installs the voice recognition service on the end side (vehicle, mobile phone, computer, sound box, etc., corresponding to the above terminal), and the installed contents include but are not limited to: a speech recognition service model (corresponding to the initial recognition model above), a baseline effect verification dataset (comprising speech and corresponding text, wherein the text corresponds to the reference dataset above), a self-learning engine; wherein the benchmark effect verification data set covers all instructions (corresponding to control instructions above) supported by the present speech recognition service model.

Step S502 receives a voice input by a user, and translates the voice into a text.

The user starts to use, the self-learning engine records the use process of the user, receives the voice of the user income, and the voice is recognized by the voice recognition service model and is translated into the text.

Step S503, judging whether to trigger the weak supervision scene.

If it is detected that the user uses the end-testing voice recognition service for multiple times continuously and finally hits the instruction related to the service, the weak supervision scenario is considered to be triggered at this time, and the process proceeds to step S504. If the user end-use voice recognition service is detected, the instruction related to the service is hit once, and at this time, the weak supervision scene is considered not to be triggered, the voice data does not need to be cached, and the step S507 is entered.

Here, it is determined whether the voice hits a command related to the service based on the reference effect verification data set, and when there is a text match between the corresponding verification data and the voice in the reference effect verification data set, it is determined that the voice hits the command related to the service, and the end side controls an application program corresponding to the command on the end side to execute the command.

Step S504, the voice and the text which meet the requirements are cached.

The user detected in step S503 continuously records the continuous recording (corresponding to the first voice data and the second voice data) and the end-side decoded text (corresponding to the first text data and the second text data) of the end-side-detection voice recognition service for multiple times in an encrypted manner, so as to form an original training data unit (corresponding to the above set of training data).

Recording the voice and the corresponding text of the user, which can be used for transfer learning, in the using process of the user, and generating the weakly supervised training data.

Step S505, determining whether to trigger self-learning.

Judging whether the recorded original training data units reach a certain number, and if so, entering the step S506; otherwise, the process advances to step S507.

Step S506, the self-learning engine is used for carrying out transfer learning on all the current original training data units.

Step S507, determining whether to end the self-learning.

When the condition for finishing the self-learning is satisfied, the step S508 is executed; otherwise, go to step S502 to continue the service user.

And step S508, ending.

Fig. 6 is a schematic diagram of an implementation process of the migration learning performed by the self-learning engine according to the user data according to the embodiment of the present application, where the self-learning process is an implementation manner of the step S506, and as shown in fig. 6, the self-learning process includes the following steps:

step S601, waiting for a self-learning trigger signal.

Step S602, judging whether a signal exists and the self-learning requirement is met.

Waiting for a request signal triggering the self-learning module to start, if the request signal exists, the self-learning module is in a working state, and ignoring the current request to step S601; if a request signal is received and the self-learning model is in an idle state, setting the self-learning module to be in a working state, detecting the environment of the end side, and if the equipment is in an idle state and the power is abundant (such as a night charging time period), turning to step S603; if the request information exists and the self-learning model is in an idle state, setting the self-learning module to be in a working state, detecting the end-side environment, and if the equipment is in a busy state or insufficient power, sleeping the self-learning engine for a period of time and then returning to the step S601.

Step S603, cleaning the original training data unit set to generate a training set.

In one implementation, cleansing the set of raw training data units may be implemented as:

step S6031, sequentially fetching a most recent original training data unit (including a plurality of audio records and corresponding decoded texts) in time order.

Step S6032, calculating the similarity (editing distance can be used) between the text of each sentence in the current original training data unit and the text of the last sentence, and keeping the speech and the text with the similarity greater than the threshold, where the labeled text of these sentences is the decoded text of the last sentence, and storing them in the formal training set.

Step S6033, repeating the step S6031 and generating a formal training set; if the formal training set is too large, the voices with longer time can be deleted, and the voices are properly deleted, so that the excessive training set number is avoided.

In step S604, migration learning is performed on the formal speech recognition models on the end side using the formal training set, and a plurality of migration models are generated.

The model is generally finely tuned, a migrated model is generated each time of learning, and finally a model set is generated. The specific implementation process is as follows:

and step S6041, performing migration learning on the formal model once by using the current formal model and the formal training data set, iterating for multiple rounds, and storing the current migration model after completion.

Step S6042, the speech of the formal training set is predicted by using the current migration model, and the speech with high word error rate is removed from the formal training set.

Step S6043, if the formal training set size is not changed any more or the maximum learning times is reached, go to step S605, otherwise continue to step S6041.

By cutting out the formal training set, the training data included in the formal training set is updated, so that multiple times of transfer learning can be performed, and multiple transfer models can be generated.

And step S605, taking partial data from the formal training set, and combining the partial data with a preset verification set to form a current verification set.

Here, the number of partial data taken does not exceed half of the preset authentication set.

Step S606, the current formal model, the plurality of migration models and the model updated from the server are fused into an optimal model under the constraint of the current verification set.

The migration model generated by each round of migration learning, the current formal model and possibly the model updated from the server are collected to form a model set, and a model fusion method (including but not limited to an evolutionary algorithm) is used to fuse an optimal model according to the current verification set generated in step S605 to replace the current formal model.

And step S607, replacing the formal model of the current work by the optimal model.

And the self-learning engine performs transfer learning according to the user data to generate a new model to replace the formal working model.

In step S608, it is determined whether the process is finished.

And if the current self-learning process is finished, determining to finish, and changing the state of the self-learning module to be idle. Otherwise, returning to step S601 to continue training.

The embodiment of the application provides a method for enabling an end-side voice recognition model to perform transfer learning through user voice, which can improve the effect of identifying the current user voice through an end-side voice recognition service and avoid the audio data leakage of a user. And the system is deployed at the end side, self-learning is carried out at the end side according to the use habits of users, the voice data of the clients are not revealed, and the use experience is improved.

Continuing with the exemplary structure of the voice data processing apparatus provided by the embodiment of the present application implemented as a software module, in some embodiments, as shown in fig. 2, the voice data processing apparatus 155 stored in the memory 150 is applied to a terminal, and the software module in the voice data processing apparatus 155 may include:

the first acquisition module 1551 is used for acquiring the acquired voice data to be processed;

a second obtaining module 1552, configured to obtain an updated recognition model, where the updated recognition model is obtained by updating, by the terminal, an initial recognition model according to a personalized training data set, where the initial recognition model is obtained by a server through training based on public training data, and the personalized training data set at least includes collected voice data;

the recognition module 1553 is used for inputting the voice data to be processed into the updated recognition model for recognition to obtain a recognition result;

an executing module 1554, configured to determine the control instruction corresponding to the identification result, and execute the control instruction.

In some embodiments, the second obtaining module 1552 comprises:

a first obtaining unit, configured to obtain the initial identification model sent by the server;

a second obtaining unit, configured to obtain a training data set, where the training data set further includes text data corresponding to the voice data;

the transfer learning unit is used for carrying out transfer learning on the initial recognition model based on the voice data and the text data to obtain a transfer model;

and the updating unit is used for updating the initial recognition model based on the migration model to obtain an updated recognition model.

In some embodiments, the second obtaining unit is further configured to: acquiring collected first voice data; when target reference data corresponding to the first voice data do not exist in a reference data set, second voice data collected within a preset time length are obtained, and the reference data set is determined by the server and sent to the terminal; determining a set of training data based on the first speech data and the second speech data when there is target reference data corresponding to the second speech data in the reference data set; and constructing a training data set based on the multiple groups of training data obtained by multiple determinations.

In some embodiments, the second obtaining unit is further configured to: recognizing the first voice data based on the initial recognition model to obtain first text data; determining the matching degree of the first text data and each reference text data, wherein each reference data in the reference data set comprises reference voice data and reference text data; when reference text data with the matching degree larger than a preset matching degree threshold value does not exist in the reference data set, determining that target reference data corresponding to the first voice data does not exist in the reference data set; when reference text data with the matching degree larger than a preset matching degree threshold exists in the reference data set, determining that target reference data corresponding to the first voice data exists in the reference data set; the target reference data is reference data comprising target reference text data, and the target reference text data is reference text data with the matching degree larger than a preset matching degree threshold value.

In some embodiments, the second obtaining unit is further configured to: recognizing the second voice data based on the initial recognition model to obtain second text data; determining the first speech data, the first text data, the second speech data, and the second text data as a set of training data.

In some embodiments, the migration learning unit is further configured to: acquiring the number of training data included in the training data set; when the number of the training data reaches a first number threshold, preprocessing the training data set to obtain a target training data set, wherein the target training data set comprises target training data; and performing transfer learning on the initial recognition model according to the target training data set to obtain at least one transfer model.

In some embodiments, the migration learning unit is further configured to: acquiring state information of the terminal, wherein the state information comprises an operating state and residual electric energy, and the operating state comprises an idle state and a working state; when the operating state is an idle state and the residual electric energy is greater than a preset electric energy threshold value, preprocessing each group of training data in the training data set to obtain target training data corresponding to each group of training data; and determining a target training data set based on the target training data corresponding to each group of training data.

In some embodiments, the migration learning unit is further configured to: respectively determining the similarity between each first text data included in the training data group and the second text data included in the training data group; determining each first text data with the similarity larger than a preset similarity threshold as a target first text data; and determining each target first text data and the first voice data corresponding to each target first text data as the target training data corresponding to the group of training data.

In some embodiments, the migration learning unit is further configured to: acquiring the quantity of target training data according to the target training data corresponding to each group of training data; when the number of the target training data is larger than a second number threshold, determining training data to be deleted from the target training data according to the acquisition time of each first voice data; and deleting the training data to be deleted from the target training data, and forming a target training data set by using the residual target training data.

In some embodiments, the migration learning unit is further configured to: performing migration learning on the initial recognition model according to each target training data to obtain a first migration model; inputting each first voice data included in each target training data into the first migration model to obtain a recognition text corresponding to each first voice data; cleaning the target training data set based on the recognition texts corresponding to the first voice data to obtain an updated target training data set; according to the updated target training data set, continuously performing migration learning on the initial recognition model until a migration finishing condition is reached; and storing a plurality of migration models obtained by performing migration learning for a plurality of times.

In some embodiments, the migration learning unit is further configured to: determining a word error rate of the recognition text corresponding to each first voice data based on the target first text data corresponding to each first voice data; and deleting the first voice data and the first target text data with the word error rate larger than a preset threshold value from the target training data to obtain updated target training data.

In some embodiments, the update unit is further configured to: fusing at least one migration model and the initial recognition model to obtain an updated recognition model; or, when receiving an updated initial recognition model from the server, fusing at least one migration model, the initial recognition model and the updated initial recognition model to obtain an updated recognition model, wherein the updated initial recognition model is obtained by updating the initial recognition model by the server.

In some embodiments, the update unit is further configured to: updating the reference data set according to the updated target training data set to obtain an updated reference data set; and fusing at least one migration model and the initial recognition model according to the updated reference data set to obtain an updated recognition model.

In some embodiments, the update unit is further configured to: acquiring the number of reference data included in a reference data set; selecting a plurality of target training data from all target training data included in the updated target training data set according to the number of the reference data; and adding the target training data serving as reference data into the reference data set to obtain an updated reference data set.

Here, it should be noted that: the above description of the embodiment of the voice data processing apparatus is similar to the above description of the method, and has the same advantageous effects as the embodiment of the method. For technical details not disclosed in the embodiments of the speech data processing device of the present application, a person skilled in the art should understand with reference to the description of the embodiments of the method of the present application.

Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the voice data processing method described in the embodiment of the present application.

Embodiments of the present application provide a storage medium having stored therein executable instructions, which when executed by a processor, will cause the processor to perform the methods provided by embodiments of the present application, for example, the methods as illustrated in fig. 3 to 6.

In some embodiments, the storage medium may be memory such as FRAM, ROM, PROM, EPROM, EEPROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; or may be various devices including one or any combination of the above memories.

In some embodiments, executable instructions may be written in any form of programming language (including compiled or interpreted languages), in the form of programs, software modules, scripts or code, and may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.

By way of example, executable instructions may correspond, but do not necessarily have to correspond, to files in a file system, and may be stored in a portion of a file that holds other programs or data, such as in one or more scripts in a hypertext Markup Language (HTML) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).

By way of example, executable instructions may be deployed to be executed on one computing device or on multiple computing devices at one site or distributed across multiple sites and interconnected by a communication network.

The above description is only an example of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, and improvement made within the spirit and scope of the present application are included in the protection scope of the present application.

26页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：识别电话接听行为的方法及装置

Voice data processing method, device, equipment, storage medium and program product

相关技术

网友询问留言