Method, device and equipment for evaluating spoken language pronunciation and storage medium

文档序号:170883 发布日期:2021-10-29 浏览:28次 中文

阅读说明:本技术 一种口语发音评测方法、装置、设备及存储介质 (Method, device and equipment for evaluating spoken language pronunciation and storage medium ) 是由 徐晓烁 康跃腾 林炳怀 于 2021-02-03 设计创作,主要内容包括:本申请实施例公开了一种人工智能领域的口语发音评测方法、装置、设备及存储介质,其中该方法包括:获取待评测的目标音频;对目标音频进行声学特征提取处理得到目标声学特征序列;通过声学特征识别模型,根据目标声学特征序列确定声学似然概率向量,声学特征识别模型是以二音素状态为建模单元的模型;基于声学似然概率向量和目标文本,确定目标文本中目标音素的后验概率;根据目标音素的后验概率确定目标发音评测结果。该方法能够提升口语发音评测的准确度。(The embodiment of the application discloses a method, a device, equipment and a storage medium for evaluating spoken language pronunciation in the field of artificial intelligence, wherein the method comprises the following steps: acquiring a target audio to be evaluated; carrying out acoustic feature extraction processing on the target audio to obtain a target acoustic feature sequence; determining an acoustic likelihood probability vector according to a target acoustic feature sequence through an acoustic feature recognition model, wherein the acoustic feature recognition model is a model taking a diphone state as a modeling unit; determining the posterior probability of a target phoneme in the target text based on the acoustic likelihood probability vector and the target text; and determining a target pronunciation evaluation result according to the posterior probability of the target phoneme. The method can improve the accuracy of the spoken language pronunciation evaluation.)

1. A method for evaluating spoken language pronunciation, the method comprising:

acquiring a target audio to be evaluated; the target audio corresponds to target text;

performing acoustic feature extraction processing on the target audio to obtain a target acoustic feature sequence;

determining an acoustic likelihood probability vector according to the target acoustic feature sequence through an acoustic feature recognition model; the acoustic feature recognition model is a model taking diphone states as modeling units;

determining a posterior probability of a target phoneme in the target text based on the acoustic likelihood probability vector and the target text;

and determining a target pronunciation evaluation result according to the posterior probability of the target phoneme.

2. The method of claim 1, wherein determining the posterior probability of the target phoneme in the target text based on the acoustic likelihood probability vector and the target text comprises:

determining a time interval to which an acoustic feature corresponding to a target diphone in the target text belongs in the target acoustic feature sequence as a target time interval based on the acoustic likelihood probability vector and the target text;

and determining the posterior probability of the target phoneme according to the likelihood probability corresponding to the acoustic features in the target time interval in the acoustic likelihood probability vector.

3. The method of claim 2, wherein the determining a time interval to which an acoustic feature corresponding to a target diphone in the target text in the target acoustic feature sequence belongs based on the acoustic likelihood probability vector and the target text comprises:

constructing a candidate diphone state sequence corresponding to the target text according to the duration of the target audio and the diphone state corresponding to each diphone in the target text;

for each candidate diphone state sequence, determining a reference likelihood probability corresponding to the candidate diphone state sequence based on the acoustic likelihood probability vector;

selecting a target diphone state sequence from each candidate diphone state sequence according to the reference likelihood probability corresponding to each candidate diphone state sequence;

and determining time intervals to which the acoustic features corresponding to the diphones in the target text in the target acoustic feature sequence respectively belong according to the target diphone state sequence.

4. The method of claim 2, wherein determining the posterior probability of the target phoneme according to the likelihood probability corresponding to the acoustic feature in the target time interval in the acoustic likelihood probability vector comprises:

determining a reference Hidden Markov Model (HMM) topology according to the length of the target time interval;

combining the single phones pairwise to obtain a plurality of candidate diphones corresponding to the acoustic features in the target time interval;

determining a diphone state sequence corresponding to each candidate diphone according to the diphone state corresponding to each candidate diphone and the reference HMM topology;

and determining the posterior probability of the target phoneme according to the likelihood probability corresponding to the acoustic features in the target time interval in the acoustic likelihood probability vector based on the diphone state sequences.

5. The method of claim 4, wherein determining the posterior probability of the target phoneme based on the sequence of diphone states according to the likelihood probability corresponding to the acoustic feature in the target time interval in the acoustic likelihood probability vector comprises:

for each diphone state sequence, determining the reference likelihood probability of the candidate diphone corresponding to the diphone state sequence according to the likelihood probability corresponding to the acoustic feature in the target time interval in the diphone state included in the diphone state sequence in the acoustic likelihood probability vector;

determining a sum of the reference likelihood probabilities of the candidate diphones as a total reference likelihood probability;

determining, for each of the candidate diphones, a posterior probability of the candidate diphone based on the reference likelihood probability of the candidate diphone and the total reference likelihood probability;

and constructing a diphone posterior probability distribution corresponding to the acoustic features in the target time interval as the posterior probability of the target phoneme based on the posterior probability of each candidate diphone.

6. The method of claim 4, wherein determining the posterior probability of the target phoneme based on the sequence of diphone states according to the likelihood probability corresponding to the acoustic feature in the target time interval in the acoustic likelihood probability vector comprises:

for each diphone state sequence, determining the reference likelihood probability of the candidate diphone corresponding to the diphone state sequence according to the likelihood probability corresponding to the acoustic feature in the target time interval in the diphone state included in the diphone state sequence in the acoustic likelihood probability vector;

determining a sum of the reference likelihood probabilities of the candidate diphones as a total reference likelihood probability;

and aiming at the target diphone, determining the posterior probability of the target diphone as the posterior probability of the target diphone according to the reference likelihood probability of the target diphone and the total reference likelihood probability.

7. The method of claim 6 wherein determining the posterior probability of the target diphone based on the reference likelihood probability of the target diphone and the total reference likelihood probability comprises:

and determining the posterior probability of the target diphone according to the reference likelihood probability of the target diphone, the prior probability of the target diphone, the total reference likelihood probability and the respective prior probability of each candidate diphone.

8. The method of claim 4 wherein the top phone and the bottom phone of the candidate diphone are referred to as the top phone and the bottom phone, respectively; determining the posterior probability of the target phoneme according to the likelihood probability corresponding to the acoustic features in the target time interval in the acoustic likelihood probability vector based on each diphone state sequence, including:

determining, for the diphone state sequence corresponding to each of the candidate diphones including the same post-phoneme, a reference likelihood probability of the candidate diphone according to a likelihood probability corresponding to an acoustic feature in the target time interval in the diphone state included in the diphone state sequence in the acoustic likelihood probability vector; selecting the maximum reference likelihood probability from the reference likelihood probabilities of the candidate diphones including the later phoneme as the reference likelihood probability of the later phoneme;

determining the sum value of the respective reference likelihood probabilities of the rear phonemes as a total reference likelihood probability;

for each said post phoneme, determining a posterior probability of said post phoneme according to the reference likelihood probability of said post phoneme and said total reference likelihood probability;

and constructing single-phone posterior probability distribution corresponding to the acoustic features in the target time interval based on the posterior probability of each posterior phone, wherein the single-phone posterior probability distribution is used as the posterior probability of the target phone.

9. The method of claim 4 wherein the top phone and the bottom phone of the candidate diphone are referred to as the top phone and the bottom phone, respectively; determining the posterior probability of the target phoneme according to the likelihood probability corresponding to the acoustic features in the target time interval in the acoustic likelihood probability vector based on each diphone state sequence, including:

determining, for the diphone state sequence corresponding to each of the candidate diphones including the same post-phoneme, a reference likelihood probability of the candidate diphone according to a likelihood probability corresponding to an acoustic feature in the target time interval in the diphone state included in the diphone state sequence in the acoustic likelihood probability vector; selecting the maximum reference likelihood probability from the reference likelihood probabilities of the candidate diphones including the later phoneme as the reference likelihood probability of the later phoneme;

determining the sum value of the respective reference likelihood probabilities of the rear phonemes as a total reference likelihood probability;

and aiming at a target post-phoneme in the target diphone, determining the posterior probability of the target post-phoneme as the posterior probability of the target phoneme according to the reference likelihood probability and the total reference likelihood probability of the target post-phoneme.

10. The method according to any one of claims 1 to 8, wherein the determining a target pronunciation assessment result according to the posterior probability of the target phoneme comprises at least one of the following:

determining a phoneme pronunciation evaluation result according to the posterior probability of the target phoneme through a phoneme evaluation model;

determining a word pronunciation evaluation result according to the first posterior probability set through a word evaluation model; the first set of posterior probabilities includes: the posterior probability of each target phoneme included in the word to be evaluated in the target text is determined;

determining a statement pronunciation evaluation result according to the second posterior probability set through a statement evaluation model; the second set of posterior probabilities comprises: and the posterior probability of each target phoneme included in the sentence to be evaluated in the target text.

11. A spoken utterance evaluation apparatus, comprising:

the audio acquisition module is used for acquiring a target audio to be evaluated; the target audio corresponds to target text;

the acoustic feature extraction module is used for extracting acoustic features of the target audio to obtain a target acoustic feature sequence;

the likelihood probability determination module is used for determining an acoustic likelihood probability vector according to the target acoustic feature sequence through an acoustic feature recognition model; the acoustic feature recognition model is a model taking diphone states as modeling units;

a posterior probability determination module, configured to determine a posterior probability of a target phoneme in the target text based on the acoustic likelihood probability vector and the target text;

and the pronunciation evaluating module is used for determining a target pronunciation evaluating result according to the posterior probability of the target phoneme.

12. The apparatus of claim 11, wherein the posterior probability determination module comprises:

a forced alignment sub-module, configured to determine, based on the acoustic likelihood probability vector and the target text, a time interval to which an acoustic feature corresponding to a target diphone in the target text in the target acoustic feature sequence belongs, as a target time interval;

and the posterior probability determining submodule is used for determining the posterior probability of the target phoneme according to the likelihood probability corresponding to the acoustic features in the target time interval in the acoustic likelihood probability vector.

13. The apparatus of claim 12, wherein the forced alignment sub-module is specifically configured to:

constructing a candidate diphone state sequence corresponding to the target text according to the duration of the target audio and the diphone state corresponding to each diphone in the target text;

for each candidate diphone state sequence, determining a reference likelihood probability corresponding to the candidate diphone state sequence based on the acoustic likelihood probability vector;

selecting a target diphone state sequence from each candidate diphone state sequence according to the reference likelihood probability corresponding to each candidate diphone state sequence;

and determining time intervals to which the acoustic features corresponding to the diphones in the target text in the target acoustic feature sequence respectively belong according to the target diphone state sequence.

14. An apparatus, comprising a processor and a memory;

the memory is used for storing a computer program;

the processor is used for executing the spoken language pronunciation evaluating method according to any one of claims 1 to 10 according to the computer program.

15. A computer-readable storage medium for storing a computer program for executing the method for evaluating a spoken utterance according to any one of claims 1 to 10.

Technical Field

The present application relates to the technical field of Artificial Intelligence (AI), and in particular, to a method, an apparatus, a device, and a storage medium for evaluating a spoken utterance.

Background

Nowadays, learning knowledge skills through education Applications (APPs) is a common learning method for users. In a common application scenario, an APP for helping a user learn a foreign language may provide a spoken pronunciation practice function, which may evaluate and score the spoken pronunciation of the user based on pronunciation audio uploaded by the user, thereby facilitating the user to know whether the spoken pronunciation is standard or not.

In the related art, a Hidden Markov Model-Deep Neural Networks (HMM-DNN) acoustic Model is mainly used to evaluate spoken utterances of a user. The HMM-DNN acoustic model takes triphones as modeling units, and determines and outputs acoustic posterior probability according to acoustic characteristics of pronunciation audio uploaded by a user; and further, determining a spoken utterance evaluation result of the user according to the acoustic posterior probability output by the HMM-DNN model through the two classifiers.

However, in practical applications, the HMM-DNN acoustic model has weak acoustic modeling capability and poor speech recognition performance, and the oral pronunciation of the user is evaluated based on the acoustic posterior probability output by the HMM-DNN acoustic model, so that the accuracy of the obtained evaluation result is low, and the evaluation effect is often not ideal.

Disclosure of Invention

The embodiment of the application provides a method, a device, equipment and a storage medium for evaluating spoken language pronunciation, which can ensure that the determined spoken language pronunciation evaluation result has higher accuracy and effectively improve the spoken language pronunciation evaluation effect.

In view of this, the first aspect of the present application provides a method for evaluating spoken language pronunciation, where the method includes:

acquiring a target audio to be evaluated; the target audio corresponds to target text;

performing acoustic feature extraction processing on the target audio to obtain a target acoustic feature sequence;

determining an acoustic likelihood probability vector according to the target acoustic feature sequence through an acoustic feature recognition model; the acoustic feature recognition model is a model taking diphone states as modeling units;

determining a posterior probability of a target phoneme in the target text based on the acoustic likelihood probability vector and the target text;

and determining a target pronunciation evaluation result according to the posterior probability of the target phoneme.

The second aspect of the present application provides a spoken utterance evaluation apparatus, the apparatus including:

the audio acquisition module is used for acquiring a target audio to be evaluated; the target audio corresponds to target text;

the acoustic feature extraction module is used for extracting acoustic features of the target audio to obtain a target acoustic feature sequence;

the likelihood probability determination module is used for determining an acoustic likelihood probability vector according to the target acoustic feature sequence through an acoustic feature recognition model; the acoustic feature recognition model is a model taking diphone states as modeling units;

a posterior probability determination module, configured to determine a posterior probability of a target phoneme in the target text based on the acoustic likelihood probability vector and the target text;

and the pronunciation evaluating module is used for determining a target pronunciation evaluating result according to the posterior probability of the target phoneme.

A third aspect of the application provides an apparatus comprising a processor and a memory:

the memory is used for storing a computer program;

the processor is configured to execute the steps of the spoken utterance evaluation method according to the first aspect.

A fourth aspect of the present application provides a computer-readable storage medium for storing a computer program for executing the steps of the spoken utterance evaluation method according to the first aspect.

A fifth aspect of the present application provides a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the steps of the spoken language pronunciation evaluating method according to the first aspect.

According to the technical scheme, the embodiment of the application has the following advantages:

the embodiment of the application provides a spoken language pronunciation evaluation method which innovatively utilizes an acoustic feature recognition model with a diphone state as a modeling unit to evaluate the spoken language pronunciation. Determining an acoustic likelihood probability vector according to a target acoustic feature sequence corresponding to a target audio to be evaluated through the acoustic feature recognition model; then, determining the posterior probability of the target phoneme in the target text based on the acoustic likelihood probability vector and the target text; and finally, determining a target pronunciation evaluation result according to the posterior probability of the target phoneme. Considering that an acoustic feature recognition model taking a diphone state as a modeling unit has better acoustic modeling capability and speech recognition capability compared with an HMM-DNN acoustic model in the related art, the embodiment of the application introduces the acoustic feature recognition model into a spoken language pronunciation evaluation process; in order to enable the acoustic likelihood probability vector output by the acoustic feature recognition model to be suitable for spoken language pronunciation evaluation, the embodiment of the application also provides an implementation mode for determining the acoustic posterior probability based on the acoustic likelihood probability; therefore, the acoustic feature recognition model with the diphone state as the modeling unit is used for the spoken utterance evaluation, so that the spoken utterance evaluation result with higher accuracy can be ensured, and the spoken utterance evaluation effect is effectively improved.

Drawings

Fig. 1 is a schematic view of an application scenario of a spoken language pronunciation evaluation method according to an embodiment of the present application;

fig. 2 is a schematic flow chart of a spoken language pronunciation evaluation method according to an embodiment of the present application;

fig. 3 is a schematic diagram of an HMM topology used by a Chain model according to an embodiment of the present application;

FIG. 4 is a schematic diagram of an exemplary HMM topology provided by an embodiment of the present application;

FIG. 5 is a schematic diagram of an exemplary pronunciation assessment result presentation interface provided by an embodiment of the present application;

FIG. 6 is a schematic flow chart illustrating another method for evaluating spoken language pronunciation according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a spoken language pronunciation evaluation device according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of another spoken language pronunciation evaluation device according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of another spoken language pronunciation evaluation device according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of a terminal device according to an embodiment of the present application;

fig. 11 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

In order to make the technical solutions of the present application better understood, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms "first," "second," "third," "fourth," and the like in the description and in the claims of the present application and in the drawings described above, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Key technologies for Speech Technology (Speech Technology) include: automatic Speech Recognition (ASR), Text To Speech (TTS), and voiceprint Recognition. The computer can listen, see, speak and feel, and the development direction of the future human-computer interaction is provided, wherein the voice becomes one of the best viewed human-computer interaction modes in the future.

With the research and progress of artificial intelligence technology, the artificial intelligence technology is developed and applied in a plurality of fields, for example, common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned driving, automatic driving, unmanned aerial vehicles, robots, smart medical treatment, smart customer service, and the like.

The scheme provided by the embodiment of the application relates to an artificial intelligence voice technology, and is specifically explained by the following embodiment:

in the related art, the HMM-DNN acoustic model using triphones as a modeling unit is generally used for spoken language pronunciation evaluation, and both the acoustic modeling capability and the speech recognition capability of the HMM-DNN acoustic model are weak, and accordingly, the pronunciation evaluation result determined based on the acoustic posterior probability output by the HMM-DNN acoustic model is often low in accuracy and poor in the obtained pronunciation evaluation result.

In view of the problems in the related art, the embodiment of the present application provides a method for evaluating spoken language pronunciation, which innovatively uses an acoustic feature recognition model using a diphone state as a modeling unit for spoken language pronunciation evaluation, and can ensure that the determined pronunciation evaluation result has higher accuracy and obtain better pronunciation evaluation effect.

Specifically, in the method for evaluating spoken language pronunciation provided by the embodiment of the present application, a target audio to be evaluated is obtained first, and the target audio corresponds to a target text. And then, carrying out acoustic feature extraction processing on the target audio to obtain a target acoustic feature sequence. Secondly, determining an acoustic likelihood probability vector according to the target acoustic feature sequence through an acoustic feature recognition model; the acoustic feature recognition model is a model in which a diphone state is a modeling unit. Further, based on the acoustic likelihood probability vector and the target text, the posterior probability of the target phoneme in the target text is determined. And finally, determining a target pronunciation evaluation result according to the posterior probability of the target phoneme.

Compared with an HMM-DNN model taking triphones as a modeling unit, the acoustic feature recognition model taking the diphone state as the modeling unit has better acoustic modeling capability and speech recognition capability, so that the method provided by the embodiment of the application introduces the acoustic feature recognition model into the oral pronunciation evaluation process; in addition, in order to enable the acoustic likelihood probability vector output by the acoustic feature recognition model to be suitable for spoken language pronunciation evaluation, the embodiment of the application also provides an implementation mode for determining the acoustic posterior probability based on the acoustic likelihood probability. Therefore, the acoustic feature recognition model with the diphone state as the modeling unit is used for the oral pronunciation evaluation, and the oral pronunciation evaluation is carried out by using the acoustic feature recognition model, so that the determined pronunciation evaluation result has higher accuracy, and the oral pronunciation evaluation effect is effectively improved.

It should be understood that the method for evaluating spoken language pronunciation provided by the embodiment of the present application may be applied to devices with speech processing capability, such as terminal devices, servers, and the like. The terminal device may be a smart phone, a computer, a tablet computer, a Personal Digital Assistant (PDA), a smart speaker, a smart robot, or the like. The server may specifically be an application server or a Web server, and in actual deployment, the server may be an independent server, or may also be a cluster server or a cloud server.

In order to facilitate understanding of the method for evaluating spoken language pronunciation provided in the embodiment of the present application, an application scenario of the method for evaluating spoken language pronunciation is described in the following.

Referring to fig. 1, fig. 1 is a schematic view of an application scenario of a spoken language pronunciation evaluation method provided in an embodiment of the present application. As shown in fig. 1, the application scenario includes a terminal device 110 and a server 120, and the terminal device 110 and the server 120 may communicate with each other through a network. Wherein, a target application program is installed in the terminal device 110, and the target application program has a spoken language pronunciation practice function; the server 120 is configured to execute the spoken language pronunciation evaluating method provided in the embodiment of the present application.

In practical applications, the user may use the spoken language pronunciation practice function in the target application through the terminal device 110. For example, when the user uses the spoken language pronunciation practice function, terminal device 110 may display the reading-after text provided by the target application to the user, and may capture audio generated when the user reads the reading-after text in response to the user operating the audio input control; after detecting that the user confirms that the audio input is completed, the terminal device 110 may send the audio collected by itself to the server 120 through the network.

After receiving the audio sent by the terminal device 110, the server 120 may regard the audio as a target audio to be evaluated, and regard a reading text according to which the user enters the audio as a target text. The server 120 performs acoustic feature extraction processing on the target audio to obtain a corresponding target acoustic feature sequence, where the target acoustic feature sequence includes acoustic features in each time unit in the target audio.

The server 120 may then determine the acoustic likelihood probability vectors from the target acoustic feature sequences via an acoustic feature recognition model, which is a model that models the diphone state as a unit. Illustratively, the acoustic feature recognition model may be a Chain (Chain) model, and the Chain model may output a corresponding acoustic likelihood probability vector according to an input acoustic feature sequence, where the acoustic likelihood probability vector is essentially a matrix of T × N dimensions, T represents the number of time units included in the acoustic feature sequence, N is the total number of diphone states, and the element X in the acoustic likelihood probability vectorijRepresenting likelihood probabilities corresponding to acoustic features in the ith time cell in the jth diphone state, the likelihood probabilities characterizing a conditional probability that acoustic features in the ith time cell are observed given the jth diphone state.

Next, the server 120 may determine the posterior probability of the target phoneme in the target text based on the acoustic likelihood probability vector output by the acoustic feature recognition model and the target text. Specifically, the server 120 may perform forced alignment on the target acoustic feature sequence corresponding to the target audio and the target text, that is, determine the acoustic feature corresponding to the target diphone in the target text in the target acoustic feature sequence, where a time interval to which the acoustic feature belongs is the target time interval corresponding to the target diphone. The server 120 may then evaluate whether the user's pronunciation for the target diphone is standard based on the acoustic features within the target time interval.

Since the acoustic likelihood probability vector output by the acoustic feature recognition model is usually difficult to be directly used for evaluating the pronunciation, the server 120 needs to determine an acoustic posterior probability that can be used for evaluating the pronunciation according to the acoustic likelihood probability vector, where the posterior probability refers to a probability that a certain phoneme state is observed under a given acoustic feature condition. In a specific implementation, the server 120 may determine the posterior probability of the target phoneme according to the likelihood probability corresponding to the acoustic feature in the target time interval in the acoustic likelihood probability vector; the target phoneme may be the target diphone itself or a later phoneme in the target diphone.

Further, the server 120 may determine the target pronunciation evaluation result according to the posterior probability of the target phoneme. For example, the server 120 may evaluate whether the pronunciation of the target phoneme is accurate according to the posterior probability of the target phoneme. For another example, the server 120 may evaluate whether the pronunciation of the word by the user is accurate or not based on the posterior probabilities of the target phonemes belonging to the word. For another example, the server 120 may further evaluate whether the pronunciation of the sentence is accurate according to the posterior probabilities of the target phonemes belonging to the sentence; and so on.

After determining the target pronunciation evaluation result through the above process, the server 120 may send the target pronunciation evaluation result to the terminal device 110 through the network, so that the terminal device 110 displays the target pronunciation evaluation result to the user, and the user can know whether the spoken pronunciation is standard or not.

It should be understood that the application scenario shown in fig. 1 is merely an example. In practical applications, the acoustic feature recognition model may also be deployed locally on the terminal device 110, and the terminal device 110 performs spoken language pronunciation evaluation independently based on the target audio input by the user. The application scenario of the spoken language pronunciation evaluation method provided by the embodiment of the present application is not limited at all.

The following describes the spoken language pronunciation evaluation method provided by the present application in detail by way of a method embodiment.

Referring to fig. 2, fig. 2 is a schematic flow chart of a spoken language pronunciation evaluation method provided in the embodiment of the present application. For convenience of description, the following embodiments take the execution subject of the spoken language pronunciation evaluation method as an example of a server. As shown in fig. 2, the method for evaluating spoken language pronunciation includes the following steps:

step 201: acquiring a target audio to be evaluated; the target audio corresponds to a target text.

In practical application, the server can acquire the audio of the spoken language pronunciation to be evaluated as a target audio, and take the text corresponding to the target audio as a target text.

In a possible implementation manner, the server may obtain an audio sent by the terminal device as a target audio to be evaluated. Illustratively, a target application program with a spoken language pronunciation exercise function is installed in the terminal device, the spoken language pronunciation exercise function can provide a reading-after text for a user, and the reading-after text is displayed in an interface corresponding to the spoken language pronunciation exercise function; the user can trigger the terminal equipment to collect the audio generated when the user reads the reading text through the touch starting reading control, and trigger the terminal equipment to stop collecting the audio through the touch ending reading control; the terminal device sends the collected audio to the server through the network, so that the server takes the received audio as a target audio to be evaluated, and a target text corresponding to the target audio is a reading text of the user.

In addition, the spoken language pronunciation exercise function can also support the user to play freely, namely, under the condition that the text is not read after, the user can start to read after through touch control to trigger the terminal equipment to collect the audio generated by free reading of the terminal equipment, and can stop collecting the audio through the touch control end to trigger the terminal equipment to stop collecting the audio; the terminal device sends the collected audio to the server through the network, so that the server takes the received audio as a target audio to be evaluated, and at the moment, the server can determine a target text corresponding to the target audio through voice recognition of the target audio.

It should be understood that the implementation manner of triggering the terminal device to acquire the audio by the user is only an example, in practical applications, the user may also trigger the terminal device to acquire the audio by other manners, for example, the user may trigger the terminal device to acquire the audio by pressing the audio entry control for a long time, and the implementation manner of triggering the terminal device to acquire the audio is not limited in this application.

In another possible implementation manner, the server may obtain a target audio to be evaluated from the database, and determine a target text corresponding to the target audio. For example, the audio uploaded by the user may be stored in a database, and when a pronunciation assessment is required for a certain audio, the server may call the audio from the database as a target audio and determine a target text corresponding to the target audio.

It should be understood that, in practical application, the server may also obtain the target audio to be evaluated and the target text corresponding to the target audio in other manners, and the obtaining manners of the target audio and the target text are not limited in this application. In addition, the target audio to be evaluated in the embodiment of the present application is not limited to the audio input by the user, and may also be other types of audio.

Step 202: and performing acoustic feature extraction processing on the target audio to obtain a target acoustic feature sequence.

After the server acquires the target audio, the server can perform acoustic feature extraction processing on the target audio, so as to obtain a target acoustic feature sequence corresponding to the target audio.

Illustratively, the server may obtain a target acoustic feature sequence corresponding to the target audio by performing a series of processing such as pre-emphasis, framing and windowing, decoding, discrete fourier transform, mel filtering, logarithm extraction, discrete cosine transform, and difference extraction on the target audio. Of course, in practical applications, the server may also perform the acoustic feature extraction processing on the target audio in other manners, and the implementation manner of extracting the target acoustic feature sequence from the target audio is not limited in this application.

It should be noted that the target acoustic feature sequence includes acoustic features in a plurality of time units, and the plurality of time units correspond to the duration of the target audio. That is, the target audio is divided into a plurality of segments of sub-audio from a time dimension, the acoustic features corresponding to one segment of sub-audio are the acoustic features in one time unit, and the acoustic features corresponding to each segment of sub-audio form the target acoustic feature sequence. The length of the time unit may be set according to actual requirements, such as 1ms, 10ms, and the like, and the length of the time unit is not limited herein.

Step 203: determining an acoustic likelihood probability vector according to the target acoustic feature sequence through an acoustic feature recognition model; the acoustic feature recognition model is a model in which a diphone state is a modeling unit.

The server extracts acoustic features of the target audio to obtain a target acoustic feature sequence, then calls an acoustic feature recognition model, inputs the target acoustic feature sequence into the acoustic feature recognition model, and the acoustic feature recognition model outputs acoustic likelihood probability vectors correspondingly by analyzing the input target acoustic feature sequence.

It should be noted that the acoustic feature recognition model is a neural network model using diphone states (hereinafter also referred to as senone states) as modeling units, and is used for determining, for an input acoustic feature sequence, likelihood probabilities corresponding to acoustic features in each time unit in the acoustic feature sequence in each senone state, where the likelihood probabilities specifically refer to conditional probabilities of observing the acoustic features in a given senone state.

For example, in the method provided in the embodiment of the present application, the acoustic feature recognition model may specifically be a Chain model. Unlike the HMM-DNN acoustic model using triphones as modeling units, the Chain model is trained based on a sequence discrimination training criterion, and uses diphone states as modeling units. The Chain model generally uses a two-state HMM topology to represent a diphone, and fig. 3 is a schematic diagram of an HMM topology used by the Chain model, in which a diphone p1p2The corresponding first senone state a can only occur once, while the diphone p can occur1p2The corresponding second senone state b can occur any number of times, i.e. zero, one or more times, the number of occurrences of the second senone state b depending on the diphone p1p2A duration of the corresponding acoustic feature; in other words, after the time interval length of the corresponding acoustic feature of a diphone is determined, the corresponding senone state sequence can be determined accordingly, for example, assume a diphone p1p2The corresponding acoustic feature has a time interval length ofThree time units, then the diphone p1p2The corresponding senone state sequence should be [ abb ]]。

Furthermore, the output result of the Chain model is also different from the output result of the HMM-DNN acoustic model, which is an acoustic posterior probability, i.e. a conditional probability that a senone state is observed under a given acoustic feature, and the output result of the Chain model is an acoustic likelihood probability, i.e. a conditional probability that an acoustic feature is observed under a given senone state.

Suppose a diphone p1p2Is from the 1 st time unit to the T th time unit, and the acoustic feature sequence O ═ O in the time interval1o2…oT]∈RF×TF denotes the dimension of the acoustic feature; suppose the senone state sequence S ═ S corresponding to the time interval1s2…sT],stIndicating the senone status at time t. Chain model for determining the state s at a given senonetUnder the observationConditional probability P ofθ(ot|st). According to an independent assumption, given a senone state sequence S for a certain time interval, the Chain model can calculate the likelihood probability by equation (1):

where θ represents a parameter of the Chain model, the right side of equation (1) can be regarded as a likelihood function with respect to θ.

It should be understood that, in practical applications, besides the Chain model may be used as the acoustic feature recognition model, other neural network models which use diphone states as modeling units and are used for determining the acoustic likelihood probability may also be used as the acoustic feature recognition model in the embodiment of the present application, and the acoustic feature recognition model in the embodiment of the present application is not limited in any way.

Need to make sure thatIt is noted that the acoustic likelihood probability vector output by the acoustic feature recognition model can be understood as a T × N-dimensional likelihood probability matrix, where T represents the number of time units included in the input target acoustic feature sequence, and N represents the total number of senone states; in the acoustic likelihood probability vector, element XijAnd the likelihood probability corresponding to the acoustic feature in the ith time cell in the jth senone state is shown, that is, the conditional probability of observing the acoustic feature in the ith time cell under the condition of giving the jth senone state.

Step 204: and determining the posterior probability of the target phoneme in the target text based on the acoustic likelihood probability vector and the target text.

During specific implementation, the server can determine a time interval to which the acoustic features corresponding to the target diphones in the target text belong in the target acoustic feature sequence as a target time interval based on the acoustic likelihood probability vector and the target text; and then, determining the posterior probability of the target phoneme according to the likelihood probability corresponding to the acoustic features in the target time interval in the acoustic likelihood probability vector.

That is, after the server acquires the acoustic likelihood probability vector output by the acoustic feature recognition model, the server may forcibly align the target acoustic feature sequence corresponding to the target audio with the target text corresponding to the target audio based on the acoustic likelihood probability vector, that is, for a target diphone (which may be any diphone in the target text) in the target text, determine a time interval to which the acoustic feature corresponding to the target acoustic feature sequence belongs, as the target time interval.

In a possible implementation manner, the server may construct a candidate senone state sequence corresponding to the target text according to the duration of the target audio and the senone state corresponding to each diphone in the target text; then, aiming at each candidate senone state sequence, determining a reference likelihood probability corresponding to the candidate senone state sequence based on an acoustic likelihood probability vector output by an acoustic feature recognition model; further, selecting a target senone state sequence from the candidate senone state sequences according to the reference likelihood probability corresponding to each candidate senone state sequence; and finally, determining a time interval to which the acoustic features corresponding to the diphone in the target acoustic feature sequence belong according to the target senone state sequence aiming at each diphone in the target text.

Specifically, under the condition that the duration of the target audio is given, the server may allocate a corresponding time interval for each diphone in the target text, construct a senone state sequence corresponding to the diphone according to the length of the time interval corresponding to the diphone and the senone state corresponding to the diphone, and concatenate the senone state sequences corresponding to the diphones according to the arrangement order of the diphones in the target text, so as to obtain a candidate senone state sequence corresponding to the target text. And the server adjusts the time interval allocated by the server according to each diphone in the target text and repeatedly executes the operation to obtain a plurality of candidate senone state sequences corresponding to the target text.

The server can then determine for each candidate senone state sequence its corresponding reference likelihood probability. Specifically, the server may determine, for each senone state in the candidate senone state sequence, a corresponding time unit, and search, in an acoustic likelihood probability vector output by the acoustic feature recognition model, a likelihood probability corresponding to the acoustic feature in the time unit in the senone state, as a likelihood probability corresponding to the senone state; then, based on the likelihood probability corresponding to each senone state in the candidate senone state sequence, the reference likelihood probability corresponding to the candidate senone state sequence is calculated, for example, the sum or product of the likelihood probabilities corresponding to each senone state in the candidate senone state sequence may be calculated as the reference likelihood probability corresponding to the candidate senone state sequence.

Furthermore, the server may select an optimal candidate senone state sequence from the candidate senone state sequences as the target senone state sequence according to the reference likelihood probability corresponding to each candidate senone state sequence, for example, the server may select a candidate senone state sequence with the highest reference likelihood probability as the target senone state sequence.

The senone states included in the target senone state sequence correspond to the time units in the target audio one by one, and the target senone state sequence is composed of senone state sequences corresponding to the diphones in the target text; based on this, the server may concatenate time units corresponding to the senone states in the senone state sequence corresponding to the diphone for each diphone in the target text to obtain a time interval corresponding to the diphone.

For the convenience of understanding the implementation process, the following takes the target text as hi and allows silence to occur before hi (the corresponding phoneme is labeled as eps), and the duration of the target audio includes 5 time units, which is exemplified by combining with the HMM topology corresponding to hi shown in fig. 4.

The target text hi includes monophones sil, h, i, and sil, and the following diphones are included in the target text hi in view of allowing silence eps to appear before hi: (eps, sil), (sil, h), (h, i), (i, sil); the HMM topology corresponding to the target text hi is shown in fig. 4, where the senone state corresponding to the diphone (eps, sil) includes a1 and b1, the senone state corresponding to the diphone (sil, h) includes a2 and b2, the diphone (h, i) includes a3 and b3, and the senone state corresponding to the diphone (i, sil) includes a4 and b4..

In the case where the duration of the target audio includes 5 time units, the server may allocate first to third time units for diphones (eps, sil), (sil, h), and (h, i), respectively, and fourth and fifth time units for diphones (i, sil), in which case the server constructs a candidate senone state sequence of a1a2a3a4b 4; the server can also construct other candidate senone state sequences, such as a1a2a3b3a4, a1a2b2a3a4, a1b1a2a3a4, by adjusting the allocation manner of the time intervals for the diphones.

Then, the server can determine the corresponding reference likelihood probability for each candidate senone state sequence; taking the example of determining the corresponding reference likelihood probability for the candidate senone state sequence a1a2a3b3a4, the server can search the acoustic likelihood probability vector output by the acoustic feature recognition model for the first time next to senone state a1Likelihood probability P (o) of acoustic feature correspondence in inter-cell1|s1=a1) Likelihood probability P (o) corresponding to the acoustic feature in the second time cell in senone state a22|s2=a2) Likelihood probability P (o) corresponding to the acoustic feature in the third time cell in senone state a33|s3=a3) Likelihood probability P (o) corresponding to the acoustic feature in the fourth time cell in senone state a44|s4=a4) And likelihood probability P (o) corresponding to the acoustic feature in the fifth time cell under senone state b45|s5=b4) (ii) a Further, a reference likelihood probability P (o) corresponding to the candidate senone state sequence a1a2a3b3a4 is calculated based on the likelihood probabilities1|s1=a1)×P(o2|s2=a2)×P(o3|s3=a3)×P(o4|s4=a4)×P(o5|s5=b4). In this manner, in a similar manner, corresponding reference likelihood probabilities are calculated for the candidate senone state sequences a1a2a3b3a4, a1a2b2a3a4, a1b1a2a3a4, respectively.

Furthermore, the server can determine the candidate senone state sequence corresponding to the maximum reference likelihood probability according to the reference likelihood probability corresponding to each candidate senone state sequence as a target senone state sequence; and according to the target senone state sequence, respectively determining corresponding time intervals of the diphones (eps, sil), (sil, h), (h, i) and (i, sil) in the target text hi. Assuming that the target senone state sequence is a1a2a3b3a4, it can be determined that the first time unit of the target audio corresponds to diphone (eps, sil), the second time unit of the target audio corresponds to diphone (sil, h), the third time unit of the target audio corresponds to diphone (h, i), and the fourth and fifth time units of the target audio correspond to diphone (i, sil).

The server forcibly aligns the target acoustic feature sequence corresponding to the target audio with the target text corresponding to the target audio, and after the target time interval corresponding to the target diphone in the target text is determined, the posterior probability of the target phoneme can be further determined according to the likelihood probability corresponding to the acoustic feature in the target time interval in the acoustic likelihood probability vector.

The target phoneme may refer to the target diphone itself or the next diphone of the target diphone (because the next diphone of the diphones is the dominant diphone). The posterior probability of the target phoneme may refer to an independent posterior probability value or a posterior probability distribution composed of a plurality of posterior probability values. When the target phoneme is the target diphone, the posterior probability distribution of the target phoneme is M-by-M posterior probability distribution, wherein M is the number of all the monophonic phonemes, and the element Y in the posterior probability distributionijCharacterizing the posterior probability corresponding to a diphone composed of the ith monophone and the jth monophone; when the target phoneme is the next monophone in the target diphone, the posterior probability distribution of the target phoneme is M-1-dimensional posterior probability distribution, wherein M is the number of all monophones, and the element Z in the posterior probability distributioni1The posterior probability corresponding to the ith single phone is shown.

Specifically, when the posterior probability of the target phoneme is determined, the server may determine a reference HMM topology according to the length of the target time interval corresponding to the target diphone; then, combining the single phonemes pairwise to obtain a plurality of candidate diphones corresponding to the acoustic features in the target time interval; further, determining a senone state sequence corresponding to each candidate diphone according to the senone state corresponding to each candidate diphone and the reference HMM topology; and finally, based on each senone state sequence, determining the posterior probability of the target phoneme according to the likelihood probability corresponding to the acoustic features in the target time interval in the acoustic likelihood probability vector output by the acoustic feature recognition model.

After the server determines the target time interval, determining a reference HMM topology suitable for the diphone of the target time interval according to the length of the target time interval; for example, in the case that the target time interval includes only one time unit, the reference HMM topology includes only the master senone state corresponding to the diphone, in the case that the target time interval includes two time units, the reference HMM topology includes both the master senone state and the slave senone state corresponding to the diphone, and the slave senone state appears only once, in the case that the target time interval includes three time units, the reference HMM topology includes both the master senone state and the slave senone state corresponding to the diphone, and the slave senone state appears twice in a cycle; and so on.

In addition, the server needs to combine each single phone two by two to obtain a plurality of candidate diphones; for example, according to the rules of the CMU pronunciation dictionary, 39 monophones are involved in total without regard to position and stress, and combining the monophones two by two will result in 39 × 39 — 1521 candidate diphones. Furthermore, the server may determine, for each candidate diphone, a senone state sequence corresponding to the candidate diphone according to the senone state corresponding to the candidate diphone and the reference HMM topology; for example, suppose a diphone p1p2The corresponding senone states include a master senone state a and a slave senone state b, and p is a diphone in case of three time units corresponding to the reference HMM topology1p2The sequence of senone state constructed should be abb.

Furthermore, the server may determine the posterior probability of the target phoneme according to the likelihood probability corresponding to the acoustic feature in the target time interval in the acoustic likelihood probability vector output by the acoustic feature recognition model based on each senone state sequence. The embodiment of the present application provides four exemplary implementation manners for determining the posterior probability of the target phoneme, and the four implementation manners are respectively described below.

In a first possible implementation manner, the server may determine the posterior probability distribution of the diphone as the posterior probability of the target phoneme by: for each senone state sequence, determining the reference likelihood probability of the candidate diphones corresponding to the senone state sequence according to the likelihood probability corresponding to the acoustic features in the senone state lower target time interval included in the senone state sequence in the acoustic likelihood probability vector; determining the sum of the reference likelihood probabilities of the candidate diphones as a total reference likelihood probability; determining the posterior probability of each candidate diphone according to the reference likelihood probability and the total reference likelihood probability of the candidate diphone; further, a diphone posterior probability distribution corresponding to the acoustic features of the target time interval is constructed as the posterior probability of the target phoneme based on the posterior probability of each candidate diphone.

Specifically, the server may search, for each senone state in the senone state sequence, a likelihood probability corresponding to an acoustic feature in a time unit corresponding to the senone state in the acoustic likelihood probability vector, as a likelihood probability corresponding to the senone state; then, the reference likelihood probability of the candidate diphone corresponding to the senone state sequence is determined according to the likelihood probability corresponding to each senone state in the senone state sequence, for example, the sum or product of the likelihood probabilities corresponding to each senone state in the senone state sequence can be calculated as the reference likelihood probability of the candidate diphone corresponding to the senone state sequence. Then, the sum of the reference likelihood probabilities of the candidate diphones is calculated as a total reference likelihood probability. For each candidate diphone, calculating the ratio between the reference likelihood probability of the candidate diphone and the total reference likelihood probability as the posterior probability of the candidate diphone. Further, constructing a diphone posterior probability distribution by using the posterior probability of each candidate diphone as the posterior probability of the target phoneme; for example, assuming that there are M monophones in total, M monophones are combined two by two to obtain M candidate diphones, and the posterior probability distribution of M diphones in M dimensions can be constructed using the posterior probability of each of the M candidate diphones, where the element Y isijIs the posterior probability of a candidate diphone consisting of the ith monophonic element and the jth monophonic element.

Suppose a candidate diphone is p1p2The candidate diphone p1p2Is P (O | P)1p2) Then the candidate diphone p can be calculated by equation (2)1p2A posteriori probability P (P)1p2|O):

Wherein q is1q2Can represent any one of the candidate diphones, P (O | q)1q2) Representing a candidate diphone q1q2Reference likelihood probability of. In candidate diphones p1p2P (O | P) when the corresponding senone state sequence is S1p2)=P(O|S)。

Thus, the posterior probability of each candidate diphone is calculated by the formula (2), that is, the posterior probability distribution of the diphone corresponding to the target time interval is constructed by using the posterior probability of each candidate diphone, and is used as the posterior probability of the target phoneme.

In a second possible implementation manner, the server may determine the posterior probability value of the target diphone as the posterior probability of the target phoneme by: for each senone state sequence, determining the reference likelihood probability of the candidate diphones corresponding to the senone state sequence according to the likelihood probability corresponding to the acoustic features in the senone state lower target time interval included in the senone state sequence in the acoustic likelihood probability vector; determining the sum of the reference likelihood probabilities of the candidate diphones as a total reference likelihood probability; and aiming at the target diphone, determining the posterior probability of the target diphone as the posterior probability of the target diphone according to the reference likelihood probability and the total reference likelihood probability of the target diphone.

Unlike the first implementation described above, in the second implementation, the server may determine the posterior probability of the target diphone only for the target diphone in the target text, and directly use the posterior probability of the target diphone as the posterior probability of the target phoneme. That is, in the second implementation, the server also needs to determine its reference likelihood probability for each candidate diphone, and determine the total reference likelihood probability according to the respective reference likelihood probabilities of the respective candidate diphones; however, the server does not need to calculate the posterior probability of each candidate diphone, and only needs to calculate the posterior probability of the target diphone, that is, only needs to calculate the ratio of the reference likelihood probability of the target diphone to the total reference likelihood probability, and obtains the posterior probability of the target diphone as the posterior probability of the target diphone.

Illustratively, assume the target diphone is p1p2The target diphone p1p2Is P (O | P)1p2) Then the target diphone p can be calculated by equation (3)1p2A posteriori probability P (P)1p2|O):

Wherein q is1q2Can represent any one of the candidate diphones, P (O | q)1q2) Representing a candidate diphone q1q2Reference likelihood probability of. At the target diphone p1p2P (O | P) when the corresponding senone state sequence is S1p2)=P(O|S)。

In this way, the posterior probability of the target diphone is calculated by the formula (3), and the posterior probability of the target diphone can be used as the posterior probability of the target phoneme.

It should be noted that, in the first implementation and the second implementation, when the server determines the posterior probability of the diphone (candidate diphone or target diphone), the prior probability of the diphone may also be considered comprehensively. Taking the calculation of the posterior probability of the target diphone as an example, the server may determine the posterior probability of the target diphone according to the reference likelihood probability of the target diphone, the prior probability of the target diphone, the total reference likelihood probability, and the respective prior probabilities of the candidate diphones.

Illustratively, assume the target diphone is p1p2The target diphone p1p2Is P (O | P)1p2) Then the target diphone p can be calculated by equation (4)1p2A posteriori probability P (P)1p2|O):

Wherein, P (P)1p2) Representing a target diphone p1p2By counting the target diphone p1p2Determining the occurrence times in the historical text; p (q)1q2) Representing any one of the candidate diphones q1q2By counting the candidate diphones q1q2The number of occurrences in the history text.

The posterior probability calculation formula shown in the formula (4) is derived according to a Bayes formula, the posterior probability calculation formulas shown in the formula (2) and the formula (3) are obtained by converting the formula (4) under the condition of assuming prior equivalences, and experimental researches show that the posterior probability calculated by the formula (2) and the formula (3) is higher in accuracy.

In a third possible implementation manner, the server may determine the mono-phone posterior probability distribution as the posterior probability of the target phone by: regarding the front phone and the back phone in the candidate diphone as the front phone and the back phone respectively; determining reference likelihood probability of each candidate diphone according to likelihood probability corresponding to acoustic features in a senone state target time interval in the senone state sequence in the acoustic likelihood probability vector, wherein the senone state sequence comprises the same rear phoneme; and selecting the maximum reference likelihood probability as the reference likelihood probability of the latter phoneme from the respective reference likelihood probabilities of the candidate diphones including the latter phoneme; then, determining the sum value of the respective reference likelihood probabilities of the rear phonemes as a total reference likelihood probability; further, for each post phoneme, determining the posterior probability of the post phoneme according to the reference likelihood probability and the total reference likelihood probability of the post phoneme; and finally, constructing single-phoneme posterior probability distribution corresponding to the acoustic features in the target time interval based on the posterior probability of each posterior phoneme, and taking the single-phoneme posterior probability distribution as the posterior probability of the target phoneme.

Specifically, the server may construct a candidate diphone set corresponding to each phoneme by using each candidate diphone as the latter phoneme. Then, for each post phoneme, determining the reference likelihood probability of the post phoneme according to the senone state sequence corresponding to each candidate diphone in the corresponding candidate diphone set. During specific implementation, the server may search, for each senone state in the senone state sequence, a likelihood probability corresponding to the acoustic feature in the time unit corresponding to the senone state in the acoustic likelihood probability vector as a likelihood probability corresponding to the senone state; then, the reference likelihood probability of the candidate diphone corresponding to the senone state sequence is determined according to the likelihood probability corresponding to each senone state in the senone state sequence, for example, the sum or product of the likelihood probabilities corresponding to each senone state in the senone state sequence can be calculated as the reference likelihood probability of the candidate diphone corresponding to the senone state sequence. After determining the reference likelihood probability for each candidate diphone in the candidate diphone set corresponding to a certain later phoneme, the server may select the maximum reference likelihood probability from the reference likelihood probabilities of each candidate diphone as the reference likelihood probability of the later phoneme.

Then, the sum of the reference likelihood probabilities of the rear phonemes is calculated as a total reference likelihood probability, and for each rear phoneme, the ratio of the reference likelihood probability of the rear phoneme to the total reference likelihood probability is calculated as the posterior probability of the rear phoneme. Further, constructing a single-phoneme posterior probability distribution by using the posterior probability of each posterior phoneme as the posterior probability of the target phoneme; for example, assuming a total of M monophones, an M x 1 dimensional posterior probability distribution of the monophones can be constructed using the posterior probabilities of the M monophones, respectively, where the element Z isi1I.e. the posterior probability of the ith monophonic element.

Illustratively, according to the Bayesian formula, a certain single phone p2A posteriori probability P (P)2| O) can be calculated by equation (5):

wherein the content of the first and second substances,is represented in a single phoneme p2In the fixed case, each single pheromone p is included1The sum of the posterior probabilities of the candidate diphones of (a);representing the sum of the reference likelihood probabilities of each of the candidate diphones.

However, experimental studies show that the posterior probability of the monophonic element calculated by the formula (5) is often not accurate enough; therefore, the single-voxel posterior probability calculation formula shown in formula (5) is adjusted to formula (6):

wherein the content of the first and second substances,representing the post phoneme p2Reference likelihood probability of (2), i.e. phoneme p after fixing2Pre-enumeration phoneme p1For each preceding phoneme p included1Calculates its reference likelihood probability from the candidate diphones comprising the post-phoneme p2Selecting the highest reference likelihood probability from the reference likelihood probabilities of the candidate diphones as the post-phoneme p2Reference likelihood probability of.Representing each post-phoneme q2The sum of the respective reference likelihood probabilities.

The reason for this adjustment is that in practical applications, all diphones obtained by combining each monophone two by two do not exist, some monophones correspond to many diphones, and some monophones correspond to few diphones; the calculation of the direct sum results in a larger a posteriori probability being calculated for the monophonic elements corresponding to more diphones and a smaller a posteriori probability being calculated for the monophonic elements corresponding to less diphones; in practice, the method finds that the maximum reference likelihood probability is selected from the reference likelihood probabilities of the candidate diphones containing the same posterior phoneme to represent the reference likelihood probability of the posterior phoneme, so that the posterior probability obtained by subsequent calculation can be more accurate.

In this way, after the posterior probability of each subsequent phoneme is calculated by the formula (6), the posterior probability distribution of the monophonic phonemes corresponding to the target time interval is constructed by using the posterior probability of each subsequent phoneme, and is used as the posterior probability of the target phoneme.

In a fourth possible implementation manner, the server may determine the posterior probability value of the target posterior element in the target diphone as the posterior probability of the target diphone by: regarding the front phone and the back phone in the candidate diphone as the front phone and the back phone respectively; determining reference likelihood probability of each candidate diphone according to likelihood probability corresponding to acoustic features in a senone state target time interval in the senone state sequence in the acoustic likelihood probability vector, wherein the senone state sequence comprises the same rear phoneme; and selecting the maximum reference likelihood probability as the reference likelihood probability of the latter phoneme from the respective reference likelihood probabilities of the candidate diphones including the latter phoneme; then, determining the sum value of the respective reference likelihood probabilities of the rear phonemes as a total reference likelihood probability; furthermore, for the target post-phoneme in the target diphone, the posterior probability of the target post-phoneme is determined as the posterior probability of the target phoneme according to the reference likelihood probability and the total reference likelihood probability of the target post-phoneme.

Unlike the third implementation manner described above, in the fourth implementation manner, the server may determine the posterior probability of the target diphone only for the posterior phoneme included in the target diphone (i.e., the target posterior phoneme), and directly use the posterior probability of the target posterior phoneme as the posterior probability of the target phoneme. That is, in the fourth implementation, the server also needs to determine its reference likelihood probability for each post-phoneme, and determine the total reference likelihood probability according to the respective reference likelihood probabilities of the post-phonemes; however, the server does not need to calculate the posterior probability of each post-phoneme, and only needs to calculate the posterior probability of the target post-phoneme included in the target diphone, that is, only needs to calculate the ratio of the reference likelihood probability of the target post-phoneme to the total reference likelihood probability to obtain the posterior probability of the target post-phoneme as the posterior probability of the target phoneme.

Illustratively, assume the target diphone is p1p2Target post phoneme is p2The server can calculate the target post-phoneme p by equation (7)2The posterior probability of (2):

wherein the content of the first and second substances,representing the target post phoneme p2The reference likelihood probability of (a); q. q.s1q2Any one of the candidate diphones can be represented,representing a certain post-phoneme q2Reference likelihood probability of.

In this way, the posterior probability of the target posterior phoneme in the target diphone is calculated by the formula (7), and the posterior probability of the target posterior phoneme can be used as the posterior probability of the target phoneme.

Step 205: and determining a target pronunciation evaluation result according to the posterior probability of the target phoneme.

After the server determines the posterior probability of the target phoneme, the target pronunciation evaluation result can be determined according to the posterior probability of the target phoneme.

In practical application, the server generally needs to use a pronunciation evaluation model to determine a target pronunciation evaluation result according to the posterior probability of the target phoneme. The pronunciation evaluation model is a neural network model obtained by training in advance in a supervised training mode, that is, the server can train the pronunciation evaluation model by using a large number of training samples including the phoneme posterior probability and the labeled pronunciation evaluation result until the pronunciation evaluation model meets the training end condition, for example, until the performance of the pronunciation evaluation model reaches a preset performance standard, or until the iterative training times of the pronunciation evaluation model reaches a preset training times, and so on.

It should be understood that, in practical applications, the processing object of the pronunciation evaluation model trained by the server may be any one of diphone posterior probability distribution, diphone posterior probability value, monophonic posterior probability distribution and monophonic posterior probability value, and specifically, the processing object of the pronunciation evaluation model may be set according to practical requirements.

It should be noted that, in practical applications, the server may determine the target pronunciation evaluation result through at least one of the following manners: determining a phoneme pronunciation evaluation result according to the posterior probability of the target phoneme through a phoneme evaluation model; determining a word pronunciation evaluation result according to a first posterior probability set through a word evaluation model, wherein the first posterior probability set comprises: the posterior probability of each target phoneme included in the word to be evaluated in the target text; determining a statement pronunciation evaluation result according to a second posterior probability set through a statement evaluation model, wherein the second posterior probability set comprises: the sentence to be evaluated in the target text comprises the posterior probability of each target phoneme.

That is, after the server determines the posterior probability of the target phoneme, at least one of the phoneme pronunciation, the word pronunciation and the sentence pronunciation may be evaluated based on the posterior probability of the target phoneme. When the server evaluates the phoneme pronunciation, the posterior probability of the target phoneme determined in step 205 may be directly input into the phoneme evaluation model, and the result output by the phoneme evaluation model may be obtained as the phoneme pronunciation evaluation result. When the server evaluates the word pronunciation, the server can determine each target phoneme included in the word to be evaluated, further input the posterior probability of each target phoneme in the word to be evaluated into the word evaluation model, and obtain the output result of the word evaluation model as the word pronunciation evaluation result. When the server evaluates the statement pronunciation, the server can determine each target phoneme included in the statement to be evaluated, further input the posterior probability of each target phoneme in the statement to be evaluated into the statement evaluation model, and obtain the output result of the statement evaluation model as the statement pronunciation evaluation result. Of course, the server may further determine the pronunciation evaluation result of the article according to the posterior probability of each target phoneme included in each sentence in the article to be evaluated by using the article evaluation model, and the like.

It should be understood that in a scenario where the server performs pronunciation evaluation according to the target audio uploaded by the terminal device, after the server determines a target pronunciation evaluation result, the target pronunciation evaluation result may be further returned to the terminal device, so that the terminal device displays the pronunciation evaluation result to the user. Fig. 5 is a schematic diagram of an exemplary pronunciation evaluation result display interface, and as shown in fig. 5, the terminal device may display a statement pronunciation evaluation result in the interface, where the statement pronunciation evaluation result may be represented by a score or a star level; further, as shown in fig. 5 (a), the user can view the pronunciation evaluation result of a phoneme by clicking the phoneme, and as shown in fig. 5 (b), the user can view the pronunciation evaluation result of a word by pressing the word for a long time.

In the spoken language pronunciation evaluation method provided by the embodiment of the application, the acoustic feature recognition model with the diphone state as the modeling unit has better acoustic modeling capability and speech recognition capability compared with the HMM-DNN model with the triphone as the modeling unit, so that the acoustic feature recognition model is introduced into the spoken language pronunciation evaluation process; in addition, in order to enable the acoustic likelihood probability vector output by the acoustic feature recognition model to be suitable for spoken language pronunciation evaluation, the embodiment of the application also provides an implementation mode for determining the acoustic posterior probability based on the acoustic likelihood probability. Therefore, the acoustic feature recognition model with the diphone state as the modeling unit is used for the oral pronunciation evaluation, and the oral pronunciation evaluation is carried out by using the acoustic feature recognition model, so that the determined pronunciation evaluation result has higher accuracy, and the oral pronunciation evaluation effect is effectively improved.

In order to further understand the method for evaluating spoken language pronunciation provided by the embodiment of the present application, the following takes the determination of the pronunciation evaluation result based on the posterior probability distribution of the monophonic element as an example with reference to the flowchart shown in fig. 6, and a whole exemplary description is provided for the method for evaluating spoken language pronunciation provided by the embodiment of the present application.

As shown in fig. 6, the terminal device may send the audio recorded when the user follows the target text to the server through the network, so that the server takes the audio as the target audio to be evaluated. After the server obtains the target audio, the server may first perform acoustic feature extraction processing on the target audio through step 601 to obtain a target acoustic feature sequence corresponding to the target audio.

The server may then determine an acoustic likelihood probability vector based on the target acoustic feature sequence using the Chain model, via step 602. The acoustic likelihood probability vector includes likelihood probabilities corresponding to the acoustic features in the time units in the target acoustic feature sequence in the senone state.

Next, the server may perform forced alignment on the target acoustic feature sequence and the target text according to the acoustic likelihood probability vector output by the Chain model in step 603, and determine, for each diphone included in the target text, a time interval to which the acoustic feature corresponding to the target acoustic feature sequence belongs.

Furthermore, the server may determine, through steps 604 to 606, a monophonic posterior probability distribution corresponding to the acoustic features in each time interval based on the likelihood probability corresponding to the acoustic features in the time interval in the acoustic likelihood probability vector, with the time interval corresponding to each diphone in the target text as the processing unit.

When the method is specifically implemented, the method aims at single pheromone p2Each phoneme p can be enumerated1As a pre-phoneme with the monophone p2Forming a plurality of candidate diphones; then, aiming at each candidate diphone, determining the reference likelihood probability of the candidate diphone according to the senone state sequence and the acoustic likelihood probability vector corresponding to the candidate diphone; further, fromIncluding a post-phoneme p2Selecting the maximum reference likelihood probability max from the reference likelihood probabilities of the candidate diphonesp1P(O|p1p2) As the post phoneme p2The reference likelihood probability of (a); in this manner, their respective reference likelihood probabilities are determined for each monophonic element in the manner described above. Then, the server may calculate a sum value of the respective reference likelihood probabilities of the individual monophones as a total reference likelihood probability. Further, for each monophonic element, the ratio of its reference likelihood probability to the total reference likelihood probability is calculated as the posterior probability of that monophonic element. And finally, constructing the posterior probability distribution of the single phone corresponding to the acoustic features in the time interval by utilizing the posterior probability of each single phone.

Finally, the server may score the spoken language pronunciation evaluation of the audio uploaded by the terminal device according to the posterior probability distribution of the single phone corresponding to each acoustic feature in each time interval in the target acoustic feature sequence by using the pre-trained pronunciation evaluation model in step 607.

The inventor of the present application compares, through experiments, a model recognition effect of an HMM-DNN acoustic model using triphones as modeling units in the related art with a model recognition effect of a Chain model using diphone states as modeling units in the embodiment of the present application, and obtains a comparison result of the model recognition effects shown in table 1. In order to ensure the fairness of comparison, the HMM-DNN model and the Chain model are trained by using 380-hour spoken English recordings of primary school students in China and are tested by using 10-hour voice data.

TABLE 1

Chain model HMM-DNN model
Word Error Rate (Word Error Rate, WER) 11.22 13.51

As can be seen from table 1, a higher recognition accuracy can be achieved by using the Chain model, and theoretically, the decoding graph of the Chain model is smaller than that of the HMM-DNN model using triphones as modeling units, and the decoding time of the Chain model is shorter.

In addition, the inventor also scores phoneme pronunciations by respectively using the method for evaluating the spoken language pronunciations based on the likelihood probability output by the Chain model and the method for evaluating the spoken language pronunciations based on the HMM-DNN model in the related art, so as to evaluate the evaluation accuracy of the two methods. In order to ensure the fairness of comparison, after the computation of the posterior probability of the phoneme is completed, a three-layer neural network with the same structure is used for predicting the pronunciation quality of the phoneme; the input of the neural network is the phoneme posterior probability calculated by the two methods, and the output of the neural network is 1 or 0 (respectively representing whether the pronunciation of the current evaluated phoneme is good or not). The neural network is obtained by training using about 3000 sentences of spoken English phonetic annotation data of elementary school students in China, and using about 1000 sentences of annotation data for testing. The indexes of the experimental evaluation include Recall (Recall), accuracy (Precision) and F-measure, and the comparison results are shown in table 2.

TABLE 2

Rate of accuracy Recall rate F value
Prior Art 0.49 0.54 0.51
This application 0.46 0.61 0.53

From table 2, it can be found that, in terms of both recall rate and F value, the evaluation effect of the method provided by the embodiment of the present application is superior to that based on the HMM-DNN model in the related art.

The method provided by the embodiment of the application has the advantages that based on the Chain model widely used in the speech recognition industry, the posterior probability of the phoneme is determined based on the likelihood probability output by the Chain model, and pronunciation evaluation is carried out based on the posterior probability of the phoneme. On one hand, the pronunciation evaluation effect obtained based on the Chain model is superior to that based on the HMM-DNN model in the related art. On the other hand, considering that the existing pronunciation scoring software generally needs to maintain two models, one is an HMM-DNN model for pronunciation evaluation, and the other is a Chain model for speech recognition, which increases the maintenance cost of the system and consumes a large amount of manpower and material resources; the embodiment of the application provides a spoken language pronunciation evaluation method based on the Chain model, so that pronunciation scoring software only needs to maintain one Chain model, and the maintenance cost of software products is greatly reduced.

Aiming at the spoken language pronunciation evaluating method described above, the application also provides a corresponding spoken language pronunciation evaluating device, so that the spoken language pronunciation evaluating method is applied and realized in practice.

Referring to fig. 7, fig. 7 is a schematic structural diagram of a spoken language pronunciation evaluation device 700 corresponding to the spoken language pronunciation evaluation method shown in fig. 2. As shown in fig. 7, the spoken language pronunciation evaluation device 700 includes:

the audio acquisition module 701 is used for acquiring a target audio to be evaluated; the target audio corresponds to target text;

an acoustic feature extraction module 702, configured to perform acoustic feature extraction processing on the target audio to obtain a target acoustic feature sequence;

a likelihood probability determining module 703, configured to determine, by using an acoustic feature recognition model, an acoustic likelihood probability vector according to the target acoustic feature sequence; the acoustic feature recognition model is a model taking diphone states as modeling units;

a posterior probability determination module 704, configured to determine a posterior probability of a target phoneme in the target text based on the acoustic likelihood probability vector and the target text;

and the pronunciation evaluating module 705 is used for determining a target pronunciation evaluating result according to the posterior probability of the target phoneme.

Optionally, on the basis of the spoken language pronunciation evaluating device shown in fig. 7, referring to fig. 8, fig. 8 is a schematic structural diagram of another spoken language pronunciation evaluating device 800 provided in the embodiment of the present application. As shown in fig. 8, the posterior probability determination module 704 includes:

a forced alignment sub-module 801, configured to determine, based on the acoustic likelihood probability vector and the target text, a time interval to which an acoustic feature corresponding to a target diphone in the target text in the target acoustic feature sequence belongs, as a target time interval;

a posterior probability determining submodule 802, configured to determine a posterior probability of the target phoneme according to a likelihood probability corresponding to the acoustic feature in the target time interval in the acoustic likelihood probability vector.

Optionally, on the basis of the spoken language pronunciation evaluating apparatus shown in fig. 8, the forced alignment sub-module 801 is specifically configured to:

constructing a candidate diphone state sequence corresponding to the target text according to the duration of the target audio and the diphone state corresponding to each diphone in the target text;

for each candidate diphone state sequence, determining a reference likelihood probability corresponding to the candidate diphone state sequence based on the acoustic likelihood probability vector;

selecting a target diphone state sequence from each candidate diphone state sequence according to the reference likelihood probability corresponding to each candidate diphone state sequence;

and determining time intervals to which the acoustic features corresponding to the diphones in the target text in the target acoustic feature sequence respectively belong according to the target diphone state sequence.

Optionally, on the basis of the spoken language pronunciation evaluating device shown in fig. 8, referring to fig. 9, fig. 9 is a schematic structural diagram of another spoken language pronunciation evaluating device 900 provided in the embodiment of the present application. As shown in fig. 9, the posterior probability determination sub-module 802 includes:

an HMM topology determining unit 901, configured to determine a reference hidden markov model HMM topology according to the length of the target time interval;

a candidate diphone construction unit 902, configured to combine each diphone pairwise to obtain a plurality of candidate diphones corresponding to the acoustic features in the target time interval;

a diphone state sequence constructing unit 903, configured to determine a diphone state sequence corresponding to each candidate diphone according to the diphone state corresponding to each candidate diphone and the reference HMM topology;

a posterior probability determining unit 904, configured to determine a posterior probability of the target phoneme according to a likelihood probability corresponding to the acoustic feature in the target time interval in the acoustic likelihood probability vector, based on each diphone state sequence.

Optionally, on the basis of the spoken language pronunciation evaluating device shown in fig. 9, the posterior probability determining unit 904 is specifically configured to:

for each diphone state sequence, determining the reference likelihood probability of the candidate diphone corresponding to the diphone state sequence according to the likelihood probability corresponding to the acoustic feature in the target time interval in the diphone state included in the diphone state sequence in the acoustic likelihood probability vector;

determining a sum of the reference likelihood probabilities of the candidate diphones as a total reference likelihood probability;

determining, for each of the candidate diphones, a posterior probability of the candidate diphone based on the reference likelihood probability of the candidate diphone and the total reference likelihood probability;

and constructing a diphone posterior probability distribution corresponding to the acoustic features in the target time interval as the posterior probability of the target phoneme based on the posterior probability of each candidate diphone.

Optionally, on the basis of the spoken language pronunciation evaluating device shown in fig. 9, the posterior probability determining unit 904 is specifically configured to:

for each diphone state sequence, determining the reference likelihood probability of the candidate diphone corresponding to the diphone state sequence according to the likelihood probability corresponding to the acoustic feature in the target time interval in the diphone state included in the diphone state sequence in the acoustic likelihood probability vector;

determining a sum of the reference likelihood probabilities of the candidate diphones as a total reference likelihood probability;

and aiming at the target diphone, determining the posterior probability of the target diphone as the posterior probability of the target diphone according to the reference likelihood probability of the target diphone and the total reference likelihood probability.

Optionally, on the basis of the spoken language pronunciation evaluating device shown in fig. 9, the posterior probability determining unit 904 is specifically configured to:

and determining the posterior probability of the target diphone according to the reference likelihood probability of the target diphone, the prior probability of the target diphone, the total reference likelihood probability and the respective prior probability of each candidate diphone.

Optionally, on the basis of the spoken language pronunciation evaluating apparatus shown in fig. 9, the front monophonic element and the rear monophonic element in the candidate diphones are respectively used as a front phoneme and a rear phoneme; the posterior probability determination unit 904 is specifically configured to:

determining, for the diphone state sequence corresponding to each of the candidate diphones including the same post-phoneme, a reference likelihood probability of the candidate diphone according to a likelihood probability corresponding to an acoustic feature in the target time interval in the diphone state included in the diphone state sequence in the acoustic likelihood probability vector; selecting the maximum reference likelihood probability from the reference likelihood probabilities of the candidate diphones including the later phoneme as the reference likelihood probability of the later phoneme;

determining the sum value of the respective reference likelihood probabilities of the rear phonemes as a total reference likelihood probability;

for each said post phoneme, determining a posterior probability of said post phoneme according to the reference likelihood probability of said post phoneme and said total reference likelihood probability;

and constructing single-phone posterior probability distribution corresponding to the acoustic features in the target time interval based on the posterior probability of each posterior phone, wherein the single-phone posterior probability distribution is used as the posterior probability of the target phone.

Optionally, on the basis of the spoken language pronunciation evaluating apparatus shown in fig. 9, the front monophonic element and the rear monophonic element in the candidate diphones are respectively used as a front phoneme and a rear phoneme; the posterior probability determination unit 904 is specifically configured to:

determining, for the diphone state sequence corresponding to each of the candidate diphones including the same post-phoneme, a reference likelihood probability of the candidate diphone according to a likelihood probability corresponding to an acoustic feature in the target time interval in the diphone state included in the diphone state sequence in the acoustic likelihood probability vector; selecting the maximum reference likelihood probability from the reference likelihood probabilities of the candidate diphones including the later phoneme as the reference likelihood probability of the later phoneme;

determining the sum value of the respective reference likelihood probabilities of the rear phonemes as a total reference likelihood probability;

and aiming at a target post-phoneme in the target diphone, determining the posterior probability of the target post-phoneme as the posterior probability of the target phoneme according to the reference likelihood probability and the total reference likelihood probability of the target post-phoneme.

Optionally, on the basis of the spoken language pronunciation evaluating apparatus shown in fig. 7, the pronunciation evaluating module 705 is specifically configured to perform at least one of the following operations:

determining a phoneme pronunciation evaluation result according to the posterior probability of the target phoneme through a phoneme evaluation model;

determining a word pronunciation evaluation result according to the first posterior probability set through a word evaluation model; the first set of posterior probabilities includes: the posterior probability of each target phoneme included in the word to be evaluated in the target text is determined;

determining a statement pronunciation evaluation result according to the second posterior probability set through a statement evaluation model; the second set of posterior probabilities comprises: and the posterior probability of each target phoneme included in the sentence to be evaluated in the target text.

The spoken language pronunciation evaluating device provided by the embodiment of the application considers that an acoustic feature recognition model taking a diphone state as a modeling unit has better acoustic modeling capability and speech recognition capability compared with an HMM-DNN model taking a triphone as a modeling unit, so that the acoustic feature recognition model is introduced into a spoken language pronunciation evaluating process; in addition, in order to enable the acoustic likelihood probability vector output by the acoustic feature recognition model to be suitable for spoken language pronunciation evaluation, the embodiment of the application also provides an implementation mode for determining the acoustic posterior probability based on the acoustic likelihood probability. Therefore, the acoustic feature recognition model with the diphone state as the modeling unit is used for the oral pronunciation evaluation, and the oral pronunciation evaluation is carried out by using the acoustic feature recognition model, so that the determined pronunciation evaluation result has higher accuracy, and the oral pronunciation evaluation effect is effectively improved.

The embodiment of the present application further provides a device for evaluating spoken language pronunciation, where the device may specifically be a terminal device or a server, and the terminal device and the server provided in the embodiment of the present application will be introduced from the perspective of hardware materialization.

Referring to fig. 10, fig. 10 is a schematic structural diagram of a terminal device provided in an embodiment of the present application. As shown in fig. 10, for convenience of explanation, only the parts related to the embodiments of the present application are shown, and details of the technology are not disclosed, please refer to the method part of the embodiments of the present application. The terminal can be any terminal equipment including a mobile phone, a tablet computer, a Personal Digital Assistant (Personal Digital Assistant, abbreviated as "PDA"), a Sales terminal (Point of Sales, abbreviated as "POS"), a vehicle-mounted computer, and the like, taking the terminal as a smart phone as an example:

fig. 10 is a block diagram illustrating a partial structure of a smart phone related to a terminal provided in an embodiment of the present application. Referring to fig. 10, the smart phone includes: radio Frequency (RF) circuit 1010, memory 1020, input unit 1030, display unit 1040, sensor 1050, audio circuit 1060, wireless fidelity (WiFi) module 1070, processor 1080, and power source 1090. Those skilled in the art will appreciate that the smartphone configuration shown in fig. 10 is not intended to be limiting and may include more or fewer components than shown, or some components in combination, or a different arrangement of components.

The memory 1020 may be used to store software programs and modules, and the processor 1080 executes various functional applications and data processing of the smart phone by operating the software programs and modules stored in the memory 1020. The memory 1020 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the smartphone, and the like. Further, the memory 1020 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.

The processor 1080 is a control center of the smartphone, connects various parts of the entire smartphone through various interfaces and lines, and executes various functions and processes data of the smartphone by running or executing software programs and/or modules stored in the memory 1020 and calling data stored in the memory 1020, thereby integrally monitoring the smartphone. Optionally, processor 1080 may include one or more processing units; preferably, the processor 1080 may integrate an application processor, which handles primarily the operating system, user interfaces, applications, etc., and a modem processor, which handles primarily the wireless communications. It is to be appreciated that the modem processor described above may not be integrated into processor 1080.

In the embodiment of the present application, the processor 1080 included in the terminal further has the following functions:

acquiring a target audio to be evaluated; the target audio corresponds to target text;

performing acoustic feature extraction processing on the target audio to obtain a target acoustic feature sequence;

determining an acoustic likelihood probability vector according to the target acoustic feature sequence through an acoustic feature recognition model; the acoustic feature recognition model is a model taking diphone states as modeling units;

determining a posterior probability of a target phoneme in the target text based on the acoustic likelihood probability vector and the target text;

and determining a target pronunciation evaluation result according to the posterior probability of the target phoneme.

Optionally, the processor 1080 is further configured to execute steps of any implementation manner of the spoken language pronunciation evaluation method provided in the embodiment of the present application.

Referring to fig. 11, fig. 11 is a schematic structural diagram of a server 1100 according to an embodiment of the present disclosure. The server 1100 may vary widely in configuration or performance and may include one or more Central Processing Units (CPUs) 1122 (e.g., one or more processors) and memory 1132, one or more storage media 1130 (e.g., one or more mass storage devices) storing applications 1142 or data 1144. Memory 1132 and storage media 1130 may be, among other things, transient storage or persistent storage. The program stored on the storage medium 1130 may include one or more modules (not shown), each of which may include a series of instruction operations for the server. Still further, the central processor 1122 may be provided in communication with the storage medium 1130 to execute a series of instruction operations in the storage medium 1130 on the server 1100.

The server 1100 may also include one or more power supplies 1126, one or more wired or wireless network interfaces 1150, one or more input-output interfaces 1158, and/or one or more operating systems such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.

The steps performed by the server in the above embodiment may be based on the server structure shown in fig. 11.

The CPU 1122 is configured to execute the following steps:

acquiring a target audio to be evaluated; the target audio corresponds to target text;

performing acoustic feature extraction processing on the target audio to obtain a target acoustic feature sequence;

determining an acoustic likelihood probability vector according to the target acoustic feature sequence through an acoustic feature recognition model; the acoustic feature recognition model is a model taking diphone states as modeling units;

determining a posterior probability of a target phoneme in the target text based on the acoustic likelihood probability vector and the target text;

and determining a target pronunciation evaluation result according to the posterior probability of the target phoneme.

Optionally, the CPU 1122 may also be configured to execute the steps of any implementation manner of the spoken language pronunciation evaluation method provided in the embodiment of the present application.

The embodiment of the present application further provides a computer-readable storage medium, configured to store a computer program, where the computer program is configured to execute any one implementation manner of the spoken language pronunciation evaluation method described in the foregoing embodiments.

Embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes any one of the implementation manners of the spoken utterance evaluation method according to the foregoing embodiments.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing computer programs.

It should be understood that in the present application, "at least one" means one or more, "a plurality" means two or more. "and/or" for describing an association relationship of associated objects, indicating that there may be three relationships, e.g., "a and/or B" may indicate: only A, only B and both A and B are present, wherein A and B may be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of single item(s) or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

32页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:基于嵌套深度神经网络的语音情感识别方法和系统

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!