Audio data evaluation method and device, electronic equipment and storage medium

文档序号:190524 发布日期:2021-11-02 浏览:28次 中文

阅读说明:本技术 音频数据的评测方法、装置、电子设备及存储介质 (Audio data evaluation method and device, electronic equipment and storage medium ) 是由 林炳怀 王丽园 于 2021-02-23 设计创作,主要内容包括:本申请实施例提供了一种音频数据的评测方法、装置、电子设备及存储介质,涉及人工智能技术领域,可以用于口语评测等场景。该方法包括:获取音频数据和与该音频数据对应的文本数据;基于音频数据与文本数据进行不确定性分析,确定采用评测模型对音频数据进行评测所得结果的不确定性分析结果;基于不确定性分析结果确定采用评测模型或其他评测方式对音频数据进行评测的评测结果作为最终评测结果。本申请方案的实施可以有效提高音频数据评测的准确性。(The embodiment of the application provides an audio data evaluation method and device, electronic equipment and a storage medium, relates to the technical field of artificial intelligence, and can be used in scenes such as spoken language evaluation. The method comprises the following steps: acquiring audio data and text data corresponding to the audio data; performing uncertainty analysis based on the audio data and the text data, and determining uncertainty analysis results of results obtained by evaluating the audio data by adopting an evaluation model; and determining an evaluation result for evaluating the audio data by adopting an evaluation model or other evaluation modes as a final evaluation result based on the uncertainty analysis result. The method and the device for evaluating the audio data can effectively improve the accuracy of the evaluation of the audio data.)

1. A method for evaluating audio data, comprising:

acquiring audio data and text data corresponding to the audio data;

performing uncertainty analysis based on the audio data and the text data, and determining uncertainty analysis results of results obtained by evaluating the audio data by adopting an evaluation model;

and determining an evaluation result for evaluating the audio data by adopting an evaluation model or other evaluation modes as a final evaluation result based on the uncertainty analysis result.

2. The method according to claim 1, wherein the determining uncertainty analysis results of the evaluation results of the audio data using an evaluation model based on the uncertainty analysis of the audio data and the text data comprises:

performing voice recognition based on the audio data and the text data, and determining time information of voice and text alignment;

and performing uncertainty analysis based on the audio data and the time information, and determining an uncertainty analysis result of a result obtained by evaluating the audio data by adopting an evaluation model.

3. The method according to claim 2, wherein the performing uncertainty analysis based on the audio data and the time information and determining uncertainty analysis results of the evaluation of the audio data using an evaluation model comprises:

extracting acoustic characteristic information in the audio data;

determining a feature representation of the audio data based on the acoustic feature information and the temporal information;

determining an uncertainty parameter of the audio data based on the feature representation of the audio data and training data for training the evaluation model;

and determining an uncertainty analysis result of a result obtained by evaluating the audio data by adopting an evaluation model based on the uncertainty parameters.

4. The method of claim 3, wherein determining the feature representation of the audio data based on the acoustic feature information and the temporal information comprises:

determining label information of the audio data based on the acoustic feature information by adopting a pre-constructed acoustic feature extractor;

determining the time length corresponding to each vocabulary based on the label information and the time information, and averaging the characteristics of the corresponding frame number based on the time length to obtain the characteristic representation of each vocabulary;

and averaging the feature representations of all the vocabularies to obtain the feature representation of the corresponding audio data.

5. The method of claim 4, wherein the step of training the acoustic feature extractor comprises:

acquiring training data, wherein the training data comprises acoustic feature information at a frame level and corresponding real label information;

training the acoustic feature extractor with the training data such that network parameters of the acoustic feature extractor are adjusted based on a cross-loss function; the cross loss function is determined based on the probability of predicting the label information corresponding to the acoustic feature information of each frame during training and the real label information.

6. The method according to claim 3, wherein the determining the uncertainty parameter of the audio data based on the feature representation of the audio data and training data for training the profile model comprises:

determining training feature representations included under each training label in training data used for training the evaluation model;

calculating the similarity between training feature representations included under each training label, and determining the aggregation degree metric of each training label;

calculating the similarity between the feature representation of the audio data and the training feature representation of the training data, and determining the similarity value of the audio data and the training data under each training label;

and normalizing the similarity value based on the aggregation degree metric, and determining the result of the normalization processing as an uncertainty parameter of the audio data.

7. The method according to claim 3, wherein the determining, based on the uncertainty parameter, an uncertainty analysis result of the result of evaluating the audio data using an evaluation model comprises any one of:

sorting the uncertainty parameters of all the audio data in a descending order, determining the uncertainty analysis result of the audio data corresponding to the preset percentage with the lowest sorting as uncertain, and determining the uncertainty analysis results of other audio data as confirmed;

calculating the mean value and the standard deviation of uncertainty parameters of all audio data, determining a threshold value based on the mean value and the standard deviation, determining uncertainty analysis results of the audio data with the uncertainty parameters lower than or equal to the threshold value as uncertain, and determining uncertainty analysis results of other audio data as determined.

8. The method according to claim 1, wherein the determining, based on the uncertainty analysis result, an evaluation result for evaluating the audio data by using an evaluation model or other evaluation methods as a final evaluation result includes:

when the uncertainty analysis result is determined, determining an evaluation result obtained by evaluating the audio data by using an evaluation model as a final evaluation result;

and when the uncertainty analysis result is uncertain, determining an evaluation result obtained by evaluating the audio data by adopting other evaluation modes as a final evaluation result.

9. The method according to claim 1, wherein the evaluating the audio data using an evaluation model comprises:

performing voice recognition based on the audio data and the text data, and determining voice characteristic information;

and determining an evaluation result of the audio data by adopting the evaluation model based on the voice characteristic information.

10. The method of claim 1, further comprising:

and feeding back the final evaluation result to a corresponding user side so as to display the final evaluation result at the user side.

11. An apparatus for evaluating audio data, comprising:

the acquisition module is used for acquiring audio data and text data corresponding to the audio data;

the analysis module is used for carrying out uncertainty analysis on the audio data and the text data and determining an uncertainty analysis result of a result obtained by evaluating the audio data by adopting an evaluation model;

and the determining module is used for determining an evaluation result for evaluating the audio data by adopting an evaluation model or other evaluation modes as a final evaluation result based on the uncertainty analysis result.

12. An electronic device, characterized in that the electronic device comprises:

one or more processors;

a memory;

one or more computer programs, wherein the one or more computer programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to: carrying out the method according to any one of claims 1 to 10.

13. A computer-readable storage medium for storing computer instructions which, when executed on a computer, cause the computer to perform the method of any of claims 1 to 10.

Technical Field

The application relates to the technical field of artificial intelligence, in particular to an audio data evaluation method and device, electronic equipment and a storage medium.

Background

With the development of artificial intelligence technology, the artificial intelligence technology has occupied more important roles in various fields. In the field of computer-aided teaching, the automatic spoken language evaluation technology plays an important role, and the implementation of the automatic spoken language evaluation technology can effectively improve the efficiency of spoken language evaluation.

However, the automatic spoken language evaluation technology aims at a large number of people including people of different ages and different spoken language levels, and meanwhile, training scoring data for spoken language evaluation often needs to be manually marked, which is time-consuming and has a high requirement on the expertise of marking operators, and all the problems cause that training data of a spoken language evaluation model often cannot completely cover all the characteristics of a person to be evaluated, so that the final score output by the spoken language evaluation model has uncertainty or errors, that is, the accuracy is low.

Disclosure of Invention

The technical scheme provided by the application aims to at least solve one of the technical defects, particularly the technical defect that the accuracy of the audio data evaluation result is low. The technical scheme is as follows:

in a first aspect of the present application, a method for evaluating audio data is provided, including:

acquiring audio data and text data corresponding to the audio data;

performing uncertainty analysis based on the audio data and the text data, and determining uncertainty analysis results of results obtained by evaluating the audio data by adopting an evaluation model;

and determining an evaluation result for evaluating the audio data by adopting an evaluation model or other evaluation modes as a final evaluation result based on the uncertainty analysis result.

In one embodiment, the determining an uncertainty analysis result of a result obtained by evaluating the audio data by using an evaluation model based on uncertainty analysis of the audio data and the text data includes:

performing voice recognition based on the audio data and the text data, and determining time information of aligning voice and text;

and performing uncertainty analysis based on the audio data and the time information, and determining an uncertainty analysis result of a result obtained by evaluating the audio data by using an evaluation model.

In another embodiment, the determining uncertainty analysis result of the result obtained by evaluating the audio data by using the evaluation model based on the uncertainty analysis performed on the audio data and the time information includes:

extracting acoustic characteristic information in the audio data;

determining a feature representation of the audio data based on the acoustic feature information and the time information;

determining uncertainty parameters of the audio data based on the feature representation of the audio data and training data of a training evaluation model;

and determining an uncertainty analysis result of the result obtained by evaluating the audio data by adopting the evaluation model based on the uncertainty parameters.

In a further embodiment, determining a feature representation of the audio data based on the acoustic feature information and the temporal information comprises:

determining label information of the audio data based on the acoustic feature information by adopting a pre-constructed acoustic feature extractor;

determining the time length corresponding to each vocabulary based on the label information and the time information, and averaging the characteristics of the corresponding frame number based on the time length to obtain the characteristic representation of each vocabulary;

and averaging the feature representations of all the vocabularies to obtain the feature representation of the corresponding audio data.

In one embodiment, the step of training the acoustic feature extractor comprises:

acquiring training data, wherein the training data comprises acoustic characteristic information at a frame level and corresponding real label information;

training an acoustic feature extractor by using training data so as to adjust network parameters of the acoustic feature extractor based on a cross loss function; and determining the cross loss function based on the probability of predicting the label information corresponding to the acoustic characteristic information of each frame during training and the real label information.

In one embodiment, determining an uncertainty parameter of the audio data based on the feature representation of the audio data and training data for training the evaluation model comprises:

determining training feature representations included under each training label in training data used for training an evaluation model;

calculating the similarity between training feature representations included under each training label, and determining the aggregation degree metric of each training label;

calculating the similarity between the feature representation of the audio data and the training feature representation of the training data, and determining the similarity value between the audio data and the training data under each training label;

and normalizing the similarity value based on the aggregation degree metric, and determining the result of the normalization processing as an uncertainty parameter of the audio data.

In one embodiment, the uncertainty analysis result of the result of evaluating the audio data by using the evaluation model is determined based on the uncertainty parameter, and the uncertainty analysis result includes any one of the following items:

sorting the uncertainty parameters of all the audio data in a descending order, determining the uncertainty analysis result of the audio data corresponding to the preset percentage with the lowest sorting as uncertain, and determining the uncertainty analysis results of other audio data as confirmed;

calculating the mean value and the standard deviation of uncertainty parameters of all audio data, determining a threshold value based on the mean value and the standard deviation, determining uncertainty analysis results of the audio data corresponding to the uncertainty parameters lower than or equal to the threshold value as uncertain, and determining uncertainty analysis results of other audio data as determined.

In an embodiment, determining, as a final evaluation result, an evaluation result for evaluating the audio data by using an evaluation model or other evaluation methods based on the uncertainty analysis result includes:

when the uncertainty analysis result is determined, determining an evaluation result obtained by evaluating the audio data by using an evaluation model as a final evaluation result;

and when the uncertainty analysis result is uncertain, determining an evaluation result obtained by evaluating the audio data in other evaluation modes as a final evaluation result.

In one embodiment, the evaluating the audio data by using an evaluating model includes:

performing voice recognition based on the audio data and the text data, and determining voice characteristic information;

and determining an evaluation result of the audio data by adopting an evaluation model based on the voice characteristic information.

In one embodiment, the method further comprises:

and feeding back the final evaluation result to the corresponding user side so as to display the final evaluation result at the user side.

In a second aspect of the present application, there is provided an apparatus for evaluating audio data, comprising:

the acquisition module is used for acquiring audio data and text data corresponding to the audio data;

the analysis module is used for carrying out uncertainty analysis based on the audio data and the text data and determining an uncertainty analysis result of a result obtained by evaluating the audio data by adopting an evaluation model;

and the determining module is used for determining an evaluation result for evaluating the audio data by adopting an evaluation model or other evaluation modes as a final evaluation result based on the uncertainty analysis result.

In a third aspect of the present application, there is provided an electronic device including:

one or more processors;

a memory;

one or more computer programs, wherein the one or more computer programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to: the method provided by the first aspect is performed.

In a fourth aspect of the present application, a computer-readable storage medium is provided for storing computer instructions which, when executed on a computer, cause the computer to perform the method provided by the first aspect.

The beneficial effect that technical scheme that this application provided brought is:

in the method and the device, uncertainty analysis is performed based on the acquired audio data and text data corresponding to the audio data, an uncertainty analysis result of a result obtained by evaluating the audio data by using an evaluation model is determined, and then whether the evaluation result obtained by evaluating the audio data by using the evaluation model or other evaluation modes is used as a final evaluation result can be determined based on the uncertainty analysis result. According to the method, the uncertainty analysis is carried out on the obtained audio data, the uncertainty of the result obtained by evaluating the audio data by the evaluation model is determined, and the audio data with the possibly inaccurate evaluation result can be screened out; and then, an evaluation result obtained by evaluating the audio data by adopting an evaluation model or other evaluation modes can be determined as a final evaluation result based on the uncertainty analysis result, so that the condition that the evaluation score is inaccurate due to the fact that the audio data is evaluated by adopting the evaluation model is effectively reduced, and the accuracy of the audio data evaluation is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments of the present application will be briefly described below.

Fig. 1 is a flowchart of an audio data evaluation method according to an embodiment of the present application;

fig. 2 is a schematic operation flow diagram of an acoustic feature extractor in an audio data evaluation method according to an embodiment of the present application;

fig. 3 is a schematic flowchart illustrating calculation of uncertainty parameters in an audio data evaluation method according to an embodiment of the present application;

fig. 4 is a flowchart illustrating an implementation of a method for evaluating audio data according to an embodiment of the present application;

fig. 5 is a schematic interaction environment diagram of an evaluation method applied to audio data according to an embodiment of the present application;

fig. 6 is a schematic frame diagram of an evaluation system applied to an evaluation method of audio data according to an embodiment of the present application;

fig. 7a is a schematic view of a corresponding display interface when applying the audio data evaluation method according to the embodiment of the present application;

fig. 7b is a schematic view of a corresponding display interface when applying the audio data evaluation method according to the embodiment of the present application;

fig. 8 is a schematic view of a corresponding display interface when applying the audio data evaluation method according to the embodiment of the present application;

fig. 9 is a schematic structural diagram of an audio data evaluation device according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present application.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

The following is a description of the technology and nomenclature involved in this application:

AI (Artificial Intelligence) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making. The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

In the present application, directions of speech technology, machine learning/deep learning, etc. may be involved.

Among the key technologies of Speech Technology (Speech Technology) are ASR (Automatic Speech Recognition), Speech synthesis Technology (TTS) and voiceprint Recognition. The computer can listen, see, speak and feel, and the development direction of the future human-computer interaction is provided, wherein the voice becomes one of the best viewed human-computer interaction modes in the future. The ASR technology is a technology for converting speech into text, and in the embodiment of the present application, a speech recognition model may be constructed by using the ASR technology, and the obtained audio data may be processed.

ML (Machine Learning) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning. In the embodiment of the application, an evaluation model, an uncertainty analysis module and the like can be constructed by adopting the technology related to machine learning.

With the development of artificial intelligence technology, the artificial intelligence technology has occupied more important roles in various fields. In the field of computer-aided teaching, the automatic spoken language evaluation technology plays an important role, and the implementation of the automatic spoken language evaluation technology can effectively improve the efficiency of spoken language evaluation. However, since the spoken language evaluation model constructed by the machine learning technology has uncertainties, such as uncertainties of contingencies (due to random noise in data) and uncertainties of cognition, the evaluation result output by the final spoken language evaluation model has uncertainty or errors, i.e., the accuracy is low.

In the related art, in order to solve the above problem of uncertainty, a scheme for modeling uncertainty is provided, however, the scheme has a requirement on a basic spoken language evaluation model, and the model itself is required to output uncertainty of an evaluation result, which increases the complexity of the model to a certain extent and correspondingly reduces the efficiency of model processing.

In order to solve at least one of the above problems, the present application provides an audio data evaluation method, apparatus, electronic device, and computer-readable storage medium; specifically, uncertainty analysis is performed on the evaluation model according to the evaluation result of the audio data, and then whether the evaluation result obtained by evaluating the audio data by using the evaluation model is used as a final evaluation result can be determined based on the uncertainty analysis result, so that the situation that the evaluation result obtained by evaluating the audio data by using the model is wrong is effectively reduced, and the accuracy of the audio data evaluation is improved.

The following describes the technical solutions of the present application and how to solve the above technical problems with specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.

An embodiment of the present application provides an audio data evaluation method, as shown in fig. 1, fig. 1 shows a schematic flow diagram of an audio data evaluation method provided in an embodiment of the present application, where the method may be executed by any electronic device, such as a user terminal, or a server, where the user terminal may be a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, or the like, the server may be an independent physical server, or a server cluster or distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, a cloud computing, a cloud function, a cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a CDN (Content Delivery Network), and a big data and an artificial intelligence platform, the present application is not so limited. Specifically, the method includes the following steps S101 to S103:

step S101: audio data and text data corresponding to the audio data are acquired.

Specifically, the audio data may be voice data that is artificially generated, or may be data corresponding to an audio track in recorded multimedia data (e.g., video).

Optionally, in the embodiment of the present application, a batch of (multiple) audio data may be acquired and processed at the same time, or a single audio data may be processed; for example, when the method is applied to a spoken language assessment scene, multiple voices input by a batch of users may be acquired and processed at the same time, and a certain voice input by a certain user may also be acquired and processed.

In one embodiment, the text data is the basis for the user to enter the audio data, that is, the user can enter the audio data through a microphone of the terminal based on the text data, and therefore, the content represented by the audio data corresponds to the text data. When a plurality of audio data are acquired at one time for processing, the plurality of audio data correspond to the same text data; for example, when the audio data evaluation method provided by the embodiment of the present application is used for spoken language evaluation at time 1, 500 pieces of audio data and 1 piece of text data a corresponding to the audio data are acquired. For example, when spoken language assessment is performed at time 2 by using the audio data assessment method provided by the embodiment of the present application, 1 piece of audio data and 1 piece of text data B corresponding to the audio data are acquired.

Alternatively, when a plurality of audio data are acquired and processed simultaneously, although the plurality of audio data correspond to the same text data, the time length of each audio data may be different because the situation of pronunciation of each user is different (e.g., the speech rate is different).

Step S102: and performing uncertainty analysis based on the audio data and the text data, and determining an uncertainty analysis result of a result obtained by evaluating the audio data by adopting an evaluation model.

Specifically, the uncertainty analysis in the embodiment of the present application may refer to estimation and research performed on various external factor changes and influences that cannot be controlled in advance in the process of evaluating audio data by an evaluation model, that is, a learning condition of the evaluation model in a training process is analyzed, a confidence of an evaluation result obtained when the model evaluates the data that is not learned is low, and the obtained evaluation result belongs to an uncertain category.

As shown in fig. 4, in the embodiment of the present application, a neural network may be used to construct an uncertainty analysis module to perform uncertainty analysis based on currently acquired audio data and text data, and a specific process of the uncertainty analysis will be described in the following embodiments.

Step S103: and determining an evaluation result for evaluating the audio data by adopting an evaluation model or other evaluation modes as a final evaluation result based on the uncertainty analysis result.

Specifically, the uncertainty analysis result may include one of certainty and uncertainty, when the uncertainty analysis result corresponding to a certain audio data is uncertain, that is, when the uncertainty of the evaluation result obtained by evaluating the audio data by using the evaluation model is high, the audio data is evaluated by using other evaluation methods, and the evaluation result is output as the final evaluation result; and when the uncertainty analysis result corresponding to certain audio data is determined, namely the certainty of the evaluation result obtained by evaluating the audio data by using the evaluation model is higher, outputting the evaluation result as a final evaluation result.

The other evaluation modes can include manual evaluation (manual review) and other modes, the audio data is transmitted to a user side for manual evaluation to be displayed, and the user with professional evaluation capability evaluates the audio data.

In the embodiment of the application, uncertainty analysis is performed based on the acquired audio data and text data corresponding to the audio data, an uncertainty analysis result of a result obtained by evaluating the audio data by using an evaluation model is determined, and further, an evaluation result obtained by evaluating the audio data by using the evaluation model or other evaluation modes can be determined as a final evaluation result based on the uncertainty analysis result. According to the method, the uncertainty analysis is carried out on the obtained audio data, the uncertainty of the result obtained by evaluating the audio data by the evaluation model is determined, and the audio data with the possibly inaccurate evaluation result can be screened out; and then, an evaluation result obtained by evaluating the audio data by adopting an evaluation model or other evaluation modes can be determined as a final evaluation result based on the uncertainty analysis result, so that the condition that the evaluation score is inaccurate due to the fact that the audio data is evaluated by adopting the evaluation model is effectively reduced, and the accuracy of the audio data evaluation is improved.

The following is a description of a specific procedure for uncertainty analysis. In uncertainty analysis, performing feature analysis based on audio data (data to be predicted), comparing the extracted features with features of training data used for training an evaluation model, and calculating similarity between the features; the audio data corresponding to the low similarity may be data that is not covered by the evaluation model during training, and thus the corresponding uncertainty is high. Based on the uncertainty analysis, the uncertainty parameter corresponding to each audio data can be obtained, and further, whether the uncertainty analysis result corresponding to a certain audio data is uncertain or not can be judged based on the uncertainty parameter.

In one embodiment, the step S102 of performing uncertainty analysis based on the audio data and the text data and determining uncertainty analysis results of the evaluation results of the audio data by using the evaluation model includes the following steps a1-a 2:

step A1: and performing voice recognition based on the audio data and the text data, and determining time information of aligning the voice and the text.

Specifically, in the embodiment of the application, the voice recognition takes audio data as a research object, and the machine automatically recognizes and understands the content orally input by the user through voice signal processing and pattern recognition processing; that is, by means of speech recognition, a machine converts speech signals in audio data into corresponding text or commands through a recognition and understanding process.

Since users of different ages and different professional levels read the same text at different speeds and pitches, time information of aligning the speech and the text in the audio data corresponding to each user needs to be determined. For the text "I like applet", the pronunciation time corresponding to "I" is 1s-1.5s, the pronunciation time corresponding to "like" is 1.6s-2s, and the pronunciation time corresponding to "applet" is 2s-3s in the audio data entered by the user a.

Step A2: and performing uncertainty analysis based on the audio data and the time information, and determining an uncertainty analysis result of a result obtained by evaluating the audio data by using an evaluation model.

Specifically, as shown in fig. 4, the acquired audio data and the processed time information of aligning the speech and the text may be used as input data of the uncertainty analysis module, and the uncertainty analysis module performs an uncertainty analysis operation based on the input data to output an uncertainty analysis result. The specific operation of the uncertainty analysis module will be described in the following embodiments.

In a possible embodiment, the step a2 of performing uncertainty analysis based on the audio data and the time information to determine uncertainty analysis results of the evaluation results of the audio data by using the evaluation model includes the following steps B1-B4:

step B1: acoustic feature information in the audio data is extracted.

Optionally, acoustic feature information in the audio data may be extracted through an acoustic model of DNN (Deep Neural Networks), or the acoustic feature information may be obtained by performing speech recognition processing on the audio data by using ASR technology, such as MFCC (Mel Frequency Cepstrum Coefficient), where Mel Frequency is extracted based on human auditory features and forms a nonlinear correspondence with Hz Frequency; the MFCC is the Hz frequency spectrum characteristic calculated by utilizing the corresponding relation between the MFCC and the MFCC. The step of extracting the acoustic feature information may be implemented in the uncertainty analysis module, or may be implemented in another module before the data is input into the uncertainty analysis module (for example, an independent speech recognition model is used to extract the acoustic feature, or a speech recognition model as shown in fig. 4 is used to implement the step).

When the audio data is analyzed, the audio data may be framed, that is, the audio data is divided into a plurality of small segments, each of which is called a frame, and accordingly, the extracted acoustic feature information may be feature information at a frame level.

Step B2: based on the acoustic feature information and the time information, a feature representation of the audio data is determined.

Specifically, as shown in fig. 4 and 6, step B2 may operate using a trained acoustic feature extractor (also referred to as an acoustic feature extraction model, which is part of the uncertainty analysis module).

In one embodiment, the determining of the feature representation of the audio data in step B2 based on the acoustic feature information and the time information includes the following steps C1-C3 (the implementation of steps C1-C3 may be understood as operations performed for each audio data):

step C1: and determining the label information of the audio data based on the acoustic feature information by adopting a pre-constructed acoustic feature extractor.

Specifically, as shown in fig. 2, the acoustic feature extractor includes an extracted feature model based on a deep Neural network, and the model may be a stack of multiple model structures, such as CNN (Convolutional Neural Networks), LSTM (Long Short-Term Memory network), and multiple layers of Networks, or a stack of multiple layers of the same or different model structures, such as a stack of 3 layers of Convolutional Neural Networks; and then the extracted depth features are subjected to nonlinear transformation through a full connection layer included in the acoustic feature extractor to obtain tag information corresponding to the audio data.

Alternatively, the acoustic feature information input to the acoustic feature extractor may be frame-level, and accordingly, the prediction result corresponding to each frame of acoustic features output by the acoustic feature extractor may be a probability distribution of frame-level senones (considering the polyphonic composition units of the phoneme context, which may be triphones or polyphonic), or may be understood as a probability (label information) of a certain senone label.

Step C2: and determining the time length corresponding to each vocabulary based on the label information and the time information, and averaging the characteristics of the corresponding frame number based on the time length to obtain the characteristic representation of each vocabulary.

Specifically, words included in the current audio data can be known based on the tag information, for example, when three tags are included, three words included in the current audio data can be correspondingly known; accordingly, the duration corresponding to each word can be determined by combining the time information of aligning the voice and the text. Further, frame averaging processing is performed based on the time length corresponding to each vocabulary, and feature representation of each vocabulary can be obtained. The frame averaging process may be to determine a corresponding frame number (each frame is generally between 10ms and 25ms, and each vocabulary may correspond to multiple frames) based on a time length corresponding to each vocabulary.

Step C3: and averaging the feature representations of all the vocabularies to obtain the feature representation of the corresponding audio data.

Specifically, on the basis of step C2, the feature representation of each word in the current audio data may be determined, and on the basis of this, a feature averaging process is performed, that is, the feature representations of all words are averaged, so that the feature representation of the audio data may be obtained.

The above steps C1-C3 are described in a specific application example with reference to fig. 3.

Assuming that an "I-like applet" is currently used for acquired audio data (a certain piece of speech), after the audio data is input into an uncertainty analysis module, extracting acoustic features at a frame level of the audio data, where the process of extracting acoustic features may include: performing framing processing on audio data, wherein a frame is generally about 10ms-25ms, and accordingly, the audio data is divided into a plurality of speech frames (e.g. M speech frames { f1, f2,. once.. fm }), and when the method for calculating the MFCCs is performed, a column of DCT coefficients (assuming that the dimension is N) is output corresponding to each frame, the corresponding MFCC can be represented as a matrix of N × M; here, the acoustic feature information "I [ [1,0.2,3], [1.2,3,0.5] ] l ike [ [0.3,2,3], [0.2,2,3] ] apple [ [1,1.2,3], [2,3.5,4] ] inputted to the acoustic feature extractor shown in fig. 3 may be a representation form in which a variance or a standard deviation is taken for all frames on a matrix basis.

After the acoustic feature extractor performs feature extraction based on the acoustic feature information, tag information of the audio data is output, and since the acoustic feature extractor processes the acoustic features at the frame level, the output tag information may also be referred to as a frame feature representation, which represents "I [ [1, 2,3], [2,3, 4] ] like [ [1,1, 3], [1, 3, 4] ] apple [ [4, 2,3], [2,3, 4] ]" as shown in fig. 3.

After obtaining the frame feature representation, a frame averaging process may be performed in conjunction with the time information of the speech and text alignment, i.e., the operation content corresponding to step C2, where the time information is assumed to be represented as follows: the pronunciation time corresponding to the I is 1s-1.5s, the pronunciation time corresponding to the like is 1.6s-2s, and the pronunciation time corresponding to the apple is 2s-3 s; after performing the frame averaging process on this basis, the content shown in fig. 3 can be obtained: each word is characterized by an "I [1.5, 2.5, 3.5] like [1, 2,3.5 ] applet [3, 2.5, 3.5 ]".

After obtaining the feature representation of each word, performing word averaging, that is, averaging the feature representations of all words, the content shown in fig. 3 can be obtained: feature representation of audio data (feature representation of the piece of speech) "[ 1.83, 2.33, 3.5 ]".

In the above embodiment, the application scenario corresponds to spoken english evaluation, and thus the vocabularies are described in units of words, for example, the feature representation of each vocabulary corresponds to the feature representation of each word. Optionally, the method can also be applied to spoken language evaluation scenarios of various languages, such as national language, japanese, korean, and the like, which is not limited in the embodiments of the present application.

The following is a description of a specific process of constructing the acoustic feature extractor.

In one embodiment, training the acoustic feature extractor includes the following steps D1-D2:

step D1: and acquiring training data, wherein the training data comprises acoustic characteristic information at a frame level and corresponding real label information.

Specifically, each training sample data in the training data respectively corresponds to deep valley feature information including a frame level and real label information corresponding to each feature; the real tag information includes a specific tag, and the tag may be a senone tag.

Step D2: training an acoustic feature extractor by using training data so as to adjust network parameters of the acoustic feature extractor based on a cross loss function; and determining the cross loss function based on the probability of predicting the label information corresponding to the acoustic characteristic information of each frame during training and the real label information.

Specifically, the cross-over loss function can be expressed as shown in the following equation (1):

L=-y*log(p)

.... formula (1)

In formula (1), y is a true tag of a senone corresponding to a certain frame of acoustic feature information, and p is a probability that the acoustic feature extractor predicts the corresponding senone tag.

Step B3: and determining the uncertainty parameter of the audio data based on the feature representation of the audio data and the training data for training the evaluation model.

Specifically, the training data used for training the evaluation model may represent the performance of the evaluation model, and the uncertainty of the evaluation model when processing the audio data, that is, the confidence of the prediction result, may be determined by processing the feature representation of the audio data and the training data.

The following is a description of a specific process for determining the uncertainty parameter.

In one embodiment, the determining the uncertainty parameter of the audio data based on the feature representation of the audio data and the training data for training the evaluative model in step B3 includes the following steps E1-E4:

step E1: training feature representations included under each training label in training data used to train the profile model are determined.

Specifically, the operation manner of determining the feature representation of the audio data shown in steps B1-B2 may be referred to perform part of the content in step E1, such as first extracting the acoustic feature information in the training data for each training data, and then determining the feature representation of the training data based on the time information that the acoustic feature information is aligned with the speech and the text, so that the training feature representation of each training data can be obtained. Further, the training data distributed under each training label can be determined based on the training feature representation corresponding to the training label; for example, the following steps are carried out: the training data includes 4 pieces of voice data A, B, C and D, the training feature representation corresponding to the training tag 1 may correspond to the voice data A, C and D, and 3 pieces of voice data are distributed under the training tag 1.

Step E2: and calculating the similarity between training feature representations included under each training label, and determining the aggregation degree metric of each training label.

Specifically, the similarity between two training feature representations in the training data may be calculated through various distance functions (such as cosine distance functions), and based on the similarity between the training feature representations, a similarity feature set corresponding to each training label may be obtained. For example, the following steps are carried out: if 100 training data are corresponding to the training label 1, a similarity feature set comprising 100 × 99 similarity features can be obtained; alternatively, if the training labels 1 correspond to the voice data A, C and D, the similarity between the voice data, that is, the similarity between a and C, A and D, C and A, C and D, D and A, D and C, can be calculated respectively, and a similarity feature set including 6 similarity features can be obtained (where, since AC and CA, AD and DA, CD and DC belong to the same similarity feature, the similarity features belonging to the same feature in the set can be deleted in order to reduce the calculation complexity of the subsequent steps during processing). Furthermore, for the similarity feature set of each training label, the result of calculating the average or mode of the set is used as the aggregation degree metric sim (inner) of the training label.

Step E3: and calculating the similarity between the feature representation of the audio data and the training feature representation of the training data, and determining the similarity value of the audio data and the training data under each training label.

Specifically, assuming that the training label 1 includes 10 training data, the similarity between the feature representation of the audio data and the training feature representation of each training data under the training label 1 may be calculated to obtain a similarity feature set including 10 similarity features, and a result of calculating an average or mode of the set may be used as a similarity value sim (outer) between the audio data and the training data under the training label 1.

Step E4: and normalizing the similarity value based on the aggregation degree metric, and determining the result of the normalization processing as an uncertainty parameter of the audio data.

Specifically, the normalization process can be expressed as shown in the following formula (2):

as shown in the formula (2), the normalization process is to calculate the ratio of the similarity value in the aggregation degree measurement, and based on the normalization process, the similarity value s between the audio data and each training label can be obtainedlThe result obtained by the normalization processing can be determined as an uncertainty parameter of the audio data, and if the training labels include 5, 5 similarities corresponding to the audio data, that is, corresponding uncertainty parameters, can be finally obtained: [0.3,0.5,0.3,0.6,0.1]。

Alternatively, since the evaluation model is a trained model in actual application, the steps E1 and E2 may be processed online, and the processing of steps E1 to E2 may be understood as determining the aggregation degree of data under each training label based on the training data corpus and the label. Correspondingly, the step B3 may only include the steps E3-E4, and the step E3 may be implemented by directly obtaining the training features, which means that the step E4 is implemented by directly obtaining the aggregation degree metric of each training label corresponding to the trained evaluation model.

Step B4: and determining an uncertainty analysis result of the result obtained by evaluating the audio data by adopting the evaluation model based on the uncertainty parameters.

Specifically, if a plurality of audio data are currently obtained, the uncertainty parameter corresponding to each audio data obtained in step B3 may be processed to obtain an uncertainty result corresponding to each audio data. If the audio data is obtained currently, the audio data can be compared with a preset threshold, and if the audio data is lower than or equal to the preset threshold, the uncertainty analysis result of the audio data is determined to be uncertain; and if the uncertainty analysis result of the audio data is higher than the preset threshold, determining the uncertainty analysis result of the audio data as a determination.

The following is a description of a specific process for processing multiple uncertainty parameters in the uncertainty analysis result.

In one embodiment, the step B4 of determining the uncertainty analysis result of the evaluation result of the audio data by using the evaluation model based on the uncertainty parameter includes any one of the following steps F1-F2:

step F1: and sequencing the uncertainty parameters of all the audio data in a descending order, determining the uncertainty analysis result of the audio data corresponding to the preset percentage with the lowest sequencing as uncertain, and determining the uncertainty analysis results of other audio data as confirmed.

Specifically, examples are: if 10 audio data are included, 10 uncertainty parameters are correspondingly included, and after the uncertainty parameters are sorted in a descending order, the uncertainty analysis result of the audio data corresponding to the lowest sorted 10% range can be determined as uncertain, and the uncertainty analysis results of the remaining 9 audio data can be determined as confirmed.

Step F2: calculating the mean value and the standard deviation of uncertainty parameters of all audio data, determining a threshold value based on the mean value and the standard deviation, determining uncertainty analysis results of the audio data corresponding to the uncertainty parameters lower than or equal to the threshold value as uncertain, and determining uncertainty analysis results of other audio data as determined.

Specifically, in step F2, the threshold is dynamically adjusted according to the uncertainty parameter corresponding to different audio data, so as to improve the adaptability and accuracy of determining the uncertainty analysis result of the audio data based on the uncertainty parameter.

The following is a description of a specific procedure for determining a final evaluation result based on the uncertainty analysis result.

In an embodiment, as shown in fig. 4, the determining, in step S103, an evaluation result for evaluating the audio data by using an evaluation model or other evaluation methods as a final evaluation result based on the uncertainty analysis result includes the following steps G1-G2:

step G1: and when the uncertainty analysis result is determined, determining an evaluation result obtained by evaluating the audio data by using the evaluation model as a final evaluation result.

Specifically, the operation of evaluating the audio data by the evaluation model can be performed after determining the corresponding evaluation result of the evaluation model, or can be performed synchronously during uncertainty analysis; when the synchronous operation is carried out, the final evaluation result can be directly output after the corresponding evaluation result of the evaluation model is determined, and the evaluation efficiency of the audio data can be effectively improved.

Step G2: and when the uncertainty analysis result is uncertain, determining an evaluation result obtained by evaluating the audio data in other evaluation modes as a final evaluation result.

Specifically, in view of reducing the problem of additionally adopting resources related to other evaluation modes, the audio data can be evaluated by adopting other evaluation modes after determining the evaluation results corresponding to the other evaluation modes, so that the waste of resources is reduced, and the evaluation cost of the audio data is reduced.

The following describes a specific processing procedure for evaluating audio data by using an evaluation model.

In one embodiment, the evaluation of the audio data using the evaluation model includes the following steps H1-H2:

step H1: and performing voice recognition based on the audio data and the text data, and determining voice characteristic information.

Step H2: and determining an evaluation result of the audio data by adopting an evaluation model based on the voice characteristic information.

In the embodiment of the application, the evaluation module can automatically evaluate the pronunciation of the user. It generally comprises two parts: 1. operation corresponding to step H1: extracting pronunciation confidence characteristics (namely voice characteristic information) based on voice recognition; 2. operation corresponding to step H2: and constructing an evaluation model based on the pronunciation confidence characteristics so that an evaluation result obtained by evaluating the audio data by the evaluation model is matched with a scoring result of a professional evaluator. And inputting the speech and the corresponding follow-up reading text to a spoken language evaluation module based on the trained evaluation model, and outputting an evaluation score of the corresponding pronunciation.

Alternatively, as shown in fig. 4, a model may be separately constructed for processing in the operation of performing speech recognition, and then speech feature information extracted by the speech recognition model is input to the evaluation model for processing, so that an evaluation result of the audio data may be obtained.

The following describes the visualization process of the final evaluation result.

In a possible embodiment, the method for evaluating audio data provided by the embodiment of the present application further includes the following step I1:

step I1: and feeding back the final evaluation result to the corresponding user side so as to display the final evaluation result on a display interface of the user side.

Specifically, in the embodiment of the present application, a user may read text data (i.e., a reading-after text) at a user side (as shown in fig. 7a and 7 b), the user side may upload the acquired audio data and the text data to a server, and the server transfers the acquired audio data and the text data to an evaluation method of the audio data provided in the embodiment of the present application in an evaluation system, so that after the evaluation system determines the final evaluation result, the final evaluation result may be fed back to the user side through the server, so that the final evaluation result is displayed on a display interface of the user side (as shown in fig. 8).

An application example of the method for evaluating audio data according to the embodiment of the present application is described below with reference to fig. 4 to 8.

In one possible embodiment, N users (students) employ the terminal 400 to read according to a given reading-after text, the contents of which are shown in fig. 7a as "I knock the fact, do you knock? ", the user may click or long-press the" start reading "detection control to cause the terminal 400 to turn on the microphone to collect audio data (in this case, voice data) pronounced by the user; when the user finishes reading, the pronunciation can be finished by clicking or releasing the detection control of "finish reading" so that the terminal 400 stops collecting the audio data.

After reading is finished, the terminal 400 uploads the collected audio data and text data to the server 200 through the network 300, and the server 200 calls the evaluation system 500 to evaluate the audio data.

Specifically, the server 200 may send the audio data to the uncertainty analysis module and simultaneously send the audio data and the text data to the speech recognition model.

Further, in the evaluation system 500, the speech text alignment result (time information of speech and text alignment) output by the speech recognition model is sent to the uncertainty analysis module, and the speech feature information output by the speech recognition model is sent to the evaluation model. The output speech text alignment result and the speech feature information may be the same speech recognition model or different speech recognition models.

After the uncertainty analysis module acquires the audio data, the frame-level acoustic features are extracted based on the audio data, the acoustic features are input into an acoustic feature extractor, and an uncertainty analysis result of the audio data is determined through the acoustic feature extractor and other network architectures in the uncertainty analysis module.

When the uncertainty result is determined, calling an evaluation result output by the evaluation model to return to the user; and when the uncertainty result is uncertain, transmitting the audio data to a manual evaluation module, performing score evaluation by a teacher, and finally returning the score to the user as the score evaluated by the teacher.

The evaluation system 500 returns the final evaluation result to the server 200, and the server 200 feeds back the final evaluation result to the terminals 400-1 to 400-N used by each user through the network 300.

After the terminal 400 obtains the final evaluation result, the terminal is displayed on the display interface (for example, displayed on the display interfaces 400-11 to 400-N1 respectively), the display effect is as shown in fig. 8, the follow-up reading text is displayed on the display interface, the quality of the spoken language reading is expressed by 5 stars, in the evaluation result shown in fig. 8, a user obtains 4 stars, and if the evaluation result is 100 minutes, the user can correspondingly obtain 80 minutes. Further, the final evaluation result not only includes the evaluation score, but also can correspondingly indicate the defect of spoken language reading of the user, for example, the word "knock" indicated by the gesture in fig. 8 is a word with poor spoken language reading of the user.

In the embodiment of the present application, there is a case where the evaluation system 500 may be a part of the server 200, and on this basis, the execution subject of the evaluation method of audio data provided in the above embodiment may be the server 200; there is another case that the evaluation system 500 may be carried by other independent computer devices (terminals or servers), and on this basis, the execution subject of the audio data evaluation method provided by the above embodiments is the terminal or the server, respectively.

In order to further explain the technical effects that can be achieved by the embodiments of the present application, the following provides corresponding experimental data.

The experiment of the application is carried out based on a trained evaluation model. A total of 1500 test data (corresponding to the audio data in the above-described embodiment) and corresponding expert annotation scores were used. And inputting the test data into an uncertainty analysis module and an evaluation model, and outputting an uncertainty result and an evaluation result of the evaluation model. And taking the sample with a larger difference between the evaluation result and the actual expert scoring result as all uncertain sample labels, wherein the result output by the uncertain analysis module is an uncertain prediction value, and the accuracy of the uncertain analysis module can be calculated. The experimental results show that: the accuracy was 80% and the recall was 30%. Although the recall rate is low, the accuracy of the recalled test data is high, and the recalled test data can be returned to the professional, so that the professional can further correct the evaluation result.

An embodiment of the present application provides an audio data evaluation apparatus, as shown in fig. 9, the audio data evaluation apparatus 900 may include: an acquisition module 901, an analysis module 902 and a determination module 903; the acquiring module 901 is configured to acquire audio data and text data corresponding to the audio data; the analysis module 902 is configured to perform uncertainty analysis based on the audio data and the text data, and determine an uncertainty analysis result of a result obtained by evaluating the audio data by using an evaluation model; a determining module 903, configured to determine, based on the uncertainty analysis result, an evaluation result obtained by evaluating the audio data by using an evaluation model or other evaluation methods as a final evaluation result.

In an embodiment, the analysis module 902 is configured to perform the following steps when performing uncertainty analysis based on the audio data and the text data and determining an uncertainty analysis result of a result obtained by evaluating the audio data by using the evaluation model:

performing voice recognition based on the audio data and the text data, and determining time information of aligning voice and text;

and performing uncertainty analysis based on the audio data and the time information, and determining an uncertainty analysis result of a result obtained by evaluating the audio data by using an evaluation model.

In another embodiment, the analysis module 902 is configured to perform the step of performing uncertainty analysis based on the audio data and the time information and determining an uncertainty analysis result of the evaluation result of the audio data by using the evaluation model, and further configured to perform the following steps:

extracting acoustic characteristic information in the audio data;

determining a feature representation of the audio data based on the acoustic feature information and the time information;

determining uncertainty parameters of the audio data based on the feature representation of the audio data and training data of a training evaluation model;

and determining an uncertainty analysis result of the result obtained by evaluating the audio data by adopting the evaluation model based on the uncertainty parameters.

In a further embodiment, the analysis module 902, when performing the step of determining the characteristic representation of the audio data based on the acoustic characteristic information and the time information, is further configured to perform the steps of:

determining label information of the audio data based on the acoustic feature information by adopting a pre-constructed acoustic feature extractor;

determining the time length corresponding to each vocabulary based on the label information and the time information, and averaging the characteristics of the corresponding frame number based on the time length to obtain the characteristic representation of each vocabulary;

and averaging the feature representations of all the vocabularies to obtain the feature representation of the corresponding audio data.

In an embodiment, the apparatus 900 further comprises a training module for training the acoustic feature extractor, and specifically, the training module is further configured to perform the following steps:

acquiring training data, wherein the training data comprises acoustic characteristic information at a frame level and corresponding real label information;

training an acoustic feature extractor by using training data so as to adjust network parameters of the acoustic feature extractor based on a cross loss function; and determining the cross loss function based on the probability of predicting the label information corresponding to the acoustic characteristic information of each frame during training and the real label information.

In one embodiment, the analysis module 902 is configured to perform the step of determining the uncertainty parameter of the audio data based on the feature representation of the audio data and the training data of the trained evaluation model, if so, performing the following steps:

determining training feature representations included under each training label in training data used for training an evaluation model;

calculating the similarity between training feature representations included under each training label, and determining the aggregation degree metric of each training label;

calculating the similarity between the feature representation of the audio data and the training feature representation of the training data, and determining the similarity value between the audio data and the training data under each training label;

and normalizing the similarity value based on the aggregation degree metric, and determining the result of the normalization processing as an uncertainty parameter of the audio data.

In an embodiment, the analysis module 902 is configured to perform the step of determining an uncertainty analysis result of the evaluation result of the audio data by using the evaluation model based on the uncertainty parameter, and further configured to perform any one of the following:

sorting the uncertainty parameters of all the audio data in a descending order, determining the uncertainty analysis result of the audio data corresponding to the preset percentage with the lowest sorting as uncertain, and determining the uncertainty analysis results of other audio data as confirmed;

calculating the mean value and the standard deviation of uncertainty parameters of all audio data, determining a threshold value based on the mean value and the standard deviation, determining uncertainty analysis results of the audio data corresponding to the uncertainty parameters lower than or equal to the threshold value as uncertain, and determining uncertainty analysis results of other audio data as determined.

In an embodiment, the determining module 903 is configured to perform the following steps when determining, based on the uncertainty analysis result, an evaluation result obtained by evaluating the audio data by using an evaluation model or other evaluation methods as a final evaluation result:

when the uncertainty analysis result is determined, determining an evaluation result obtained by evaluating the audio data by using an evaluation model as a final evaluation result;

and when the uncertainty analysis result is uncertain, determining an evaluation result obtained by evaluating the audio data in other evaluation modes as a final evaluation result.

In one embodiment, when the determining module 903 is used for performing the step of evaluating the audio data by using an evaluation model, the determining module is further used for performing the following steps:

performing voice recognition based on the audio data and the text data, and determining voice characteristic information;

and determining an evaluation result of the audio data by adopting an evaluation model based on the voice characteristic information.

In an embodiment, the apparatus 900 further includes a feedback module, configured to feed back the final evaluation result to a corresponding user side, so as to display the final evaluation result at the user side.

The apparatus according to the embodiment of the present application may execute the method provided by the embodiment of the present application, and the implementation principle is similar, the actions executed by the modules in the apparatus according to the embodiments of the present application correspond to the steps in the method according to the embodiments of the present application, and for the detailed functional description of the modules in the apparatus, reference may be specifically made to the description in the corresponding method shown in the foregoing, and details are not repeated here.

An embodiment of the present application provides an electronic device, including: a memory and a processor; at least one program stored in the memory for execution by the processor, which when executed by the processor, implements: in the method and the device, uncertainty analysis is performed based on the acquired audio data and text data corresponding to the audio data, an uncertainty analysis result of a result obtained by evaluating the audio data by using an evaluation model is determined, and then whether the evaluation result obtained by evaluating the audio data by using the evaluation model or other evaluation modes is used as a final evaluation result can be determined based on the uncertainty analysis result. According to the method, the uncertainty analysis is carried out on the obtained audio data, the uncertainty of the result obtained by evaluating the audio data by the evaluation model is determined, and the audio data with the possibly inaccurate evaluation result can be screened out; and then, an evaluation result obtained by evaluating the audio data by adopting an evaluation model or other evaluation modes can be determined as a final evaluation result based on the uncertainty analysis result, so that the condition that the evaluation score is inaccurate due to the fact that the audio data is evaluated by adopting the evaluation model is effectively reduced, and the accuracy of the audio data evaluation is improved.

In an alternative embodiment, an electronic device is provided, as shown in fig. 10, the electronic device 1000 shown in fig. 10 comprising: a processor 1001 and a memory 1003. Where the processor 1001 is coupled to the memory 1003, such as via a bus 1002. Optionally, the electronic device 1000 may further include a transceiver 1004, and the transceiver 1004 may be used for data interaction between the electronic device and other electronic devices, such as transmission of data and/or reception of data. It should be noted that the transceiver 1004 is not limited to one in practical application, and the structure of the electronic device 1000 is not limited to the embodiment of the present application.

The Processor 1001 may be a CPU (Central Processing Unit), a general-purpose Processor, a DSP (Digital Signal Processor), an ASIC (Application Specific Integrated Circuit), an FPGA (Field Programmable Gate Array) or other Programmable logic device, a transistor logic device, a hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor 1001 may also be a combination of computing functions, e.g., comprising one or more microprocessors, DSPs and microprocessors, and the like.

Bus 1002 may include a path that transfers information between the above components. The bus 1002 may be a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. The bus 1002 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 10, but this is not intended to represent only one bus or type of bus.

The Memory 1003 may be a ROM (Read Only Memory) or other type of static storage device that can store static information and instructions, a RAM (Random Access Memory) or other type of dynamic storage device that can store information and instructions, an EEPROM (Electrically Erasable Programmable Read Only Memory), a CD-ROM (Compact Disc Read Only Memory) or other optical Disc storage, optical Disc storage (including Compact Disc, laser Disc, optical Disc, digital versatile Disc, blu-ray Disc, etc.), a magnetic Disc storage medium or other magnetic storage device, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to these.

The memory 1003 is used for storing application program codes (computer programs) for executing the present application, and the processor 1001 controls the execution. The processor 1001 is configured to execute application program codes stored in the memory 1003 to implement the contents shown in the foregoing method embodiments.

Among them, electronic devices include but are not limited to: smart phones, tablet computers, notebook computers, smart speakers, smart watches, vehicle-mounted devices, and the like.

According to an aspect of the application, a computer program product or computer program is provided, comprising computer instructions, the computer instructions being stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the method for evaluating audio data provided in the above-mentioned various optional implementations.

The present application provides a computer-readable storage medium, on which a computer program is stored, which, when running on a computer, enables the computer to execute the corresponding content in the foregoing method embodiments.

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

The foregoing is only a partial embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

25页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种云平台访问控制方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!