Speech function automatic evaluation system and method based on voice recognition

文档序号：96701 发布日期：2021-10-12 浏览：33次中文

阅读说明：本技术 一种基于语音识别的言语功能自动评估系统和方法 (Speech function automatic evaluation system and method based on voice recognition ) 是由莫贵明苏荣锋王岚燕楠于 2020-04-03 设计创作，主要内容包括：本发明公开了一种基于语音识别的言语功能自动评估系统和方法。该系统包括语音端点检测模块、语音识别模块和评估模块,所述语音识别模块分别与所述语音端点检测模块和所述评估模块具有通信连接,其中所述语音端点检测模块被配置为从连续语音流中检测出语音信号的起点和终点,以提取待评估的语音段；所述语音识别模块被配置为基于经训练的声学模型,对待评估语音段进行特征提取,并输入至深度神经网络模型,以识别出对应的词序列；所述评估模块被配置针对所识别出的词序列,结合呼吸功能的评价指标、发声功能的评价指标和构音功能的评价指标来评估言语功能。本发明能够更全面和准确地评估言语功能,尤其适用于儿童的言语功能分析。(The invention discloses a speech function automatic evaluation system and method based on voice recognition. The system comprises a voice endpoint detection module, a voice recognition module and an evaluation module, wherein the voice recognition module is respectively in communication connection with the voice endpoint detection module and the evaluation module, and the voice endpoint detection module is configured to detect a starting point and an end point of a voice signal from a continuous voice stream so as to extract a voice section to be evaluated; the voice recognition module is configured to perform feature extraction on a voice segment to be evaluated based on the trained acoustic model, and input the voice segment to the deep neural network model to recognize a corresponding word sequence; the evaluation module is configured to evaluate a speech function in combination with an evaluation index of a respiratory function, an evaluation index of a vocal function, and an evaluation index of an articulatory function for the recognized word sequence. The method can evaluate the speech function more comprehensively and accurately, and is particularly suitable for the speech function analysis of children.)

1. The automatic speech function evaluation system based on voice recognition comprises a voice endpoint detection module, a voice recognition module and an evaluation module, wherein the voice recognition module is in communication connection with the voice endpoint detection module and the evaluation module respectively, and the automatic speech function evaluation system comprises:

the voice endpoint detection module is configured to detect a starting point and an end point of a voice signal from a continuous voice stream to extract a voice segment to be evaluated;

the voice recognition module is configured to perform feature extraction on a voice segment to be evaluated based on the trained acoustic model, and input the voice segment to the deep neural network model to recognize a corresponding word sequence;

the evaluation module is configured to evaluate a speech function in combination with an evaluation index of a respiratory function, an evaluation index of a vocal function, and an evaluation index of an articulatory function for the recognized word sequence.

2. The speech recognition-based speech function automatic assessment system according to claim 1, wherein the speech recognition module comprises a front-end processing unit, an acoustic model, a language model, a pronunciation dictionary and a decoder having a communication connection with the front-end processing unit, the language model, the acoustic model, the pronunciation dictionary respectively, wherein: the front-end processing unit is used for extracting acoustic features of the voice signals, the acoustic model is used for obtaining observation probabilities under a given phoneme state condition, and the decoder is used for obtaining word sequences corresponding to the voice signals in a search space formed by the pronunciation dictionary and knowledge sources of the language model based on the phoneme observation probabilities output by the acoustic model.

3. The speech recognition-based speech function automatic assessment system according to claim 1, wherein the speech features input to the acoustic model are:

after the voice signal is divided into frames according to the frame length of 25ms and the frame length of 10ms, 13-dimensional perceptual linear prediction characteristics and 3-dimensional fundamental frequency characteristics are extracted;

splicing the extracted 13-dimensional perceptual linear prediction features and the 3-dimensional fundamental frequency features to obtain 16-dimensional features, and performing cepstrum mean normalization, linear discriminant analysis, maximum likelihood linear transformation and limited maximum likelihood regression on the 16-dimensional features to obtain 40-dimensional features;

440-dimensional features spliced from consecutive 11 frames are used as speech features for the input of the acoustic model.

4. The speech recognition-based speech function automatic assessment system according to claim 1, wherein said acoustic model is a deep neural network-hidden markov hybrid model, the hidden markov model being used for modeling timing properties of the speech signal, the deep neural network being used for modeling observation probabilities of the speech signal, the training process of said acoustic model comprising: performing forced alignment on the speech data of the training set based on the trained Gaussian mixture-hidden Markov model to obtain supervised data with aligned frame levels; executing deep belief neural network pre-training to obtain an initialization model; and training the initialization model based on a cross entropy criterion and a back propagation algorithm by using the frame level aligned supervised data to obtain a deep neural network-hidden Markov mixed model.

5. The speech recognition-based speech function automatic assessment system according to claim 1, wherein the training set of acoustic models is the speech training data of preschool children, the audio is short sentences, and the contents include daily vocabularies, children's rumors, storybooks, command interactive sentences.

6. The speech recognition-based speech function automatic assessment system according to claim 1, wherein the evaluation index of respiratory function is duration, the evaluation index of vocal function is loudness and pitch, and the evaluation index of vocal function comprises oral alternant movement rate and vocal-constituting voice function.

7. The speech recognition-based speech function automatic assessment system according to claim 6, wherein the articulatory speech function assessment comprises:

selecting standard text from an evaluation expectation library;

based on the standard text read by the children to be tested, generating tested voice and extracting acoustic features, inputting the tested voice and the extracted acoustic features into an acoustic model together with the standard text, and obtaining the articulation definition and the pronouncing-mispronunciation initial clinical meaning by calculating the GOP score of each phoneme pronunciation.

8. The speech recognition-based speech function automatic assessment system according to claim 7, wherein the assessment prediction library is updated according to the following steps:

constructing a mapping table of initial consonants and corresponding linguistic data according to the evaluation linguistic data, wherein each initial consonant corresponds to a list of words or sentences containing the initial consonant;

in the first evaluation, the probability that the linguistic data corresponding to each initial consonant is selected is set to be equal;

in the subsequent evaluation, the probability weight of each initial consonant corresponding corpus being selected is updated based on the pronunciation error rate of the tested child to the initial consonant in the historical evaluation.

9. A speech function automatic evaluation method based on voice recognition comprises the following steps:

detecting a starting point and an end point of a voice signal from a continuous voice stream to extract a voice section to be evaluated;

extracting the characteristics of the voice segment to be evaluated based on the trained acoustic model, and inputting the voice segment to the deep neural network model to identify a corresponding word sequence;

and (3) evaluating the speech function by combining the evaluation index of the respiratory function, the evaluation index of the vocal function and the evaluation index of the vocal function aiming at the recognized word sequence.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the speech function automatic assessment method based on speech recognition according to claim 9.

Technical Field

The invention relates to the technical field of information, in particular to a speech function automatic evaluation system and method based on voice recognition.

Background

Speech disorders are mainly classified into four categories: dysarthria, stuttering, dysphonation and hearing impairment, which are clinically manifested by abnormalities in respiratory, phonic, resonant, sound-forming and speech functions. The incidence of the speech disorders in the children group in China is high, and the research shows that the students in the educational schools in China mainly have moderate and severe intellectual disabilities, wherein more than 70 percent of the students have the speech disorders, and the proportion of the dysarthria accounts for about 75 to 80 percent. The investigation of 2316 children in Shenyang city revealed that the incidence of speech impairment was 4.01%, with 51.08% being the highest percentage of children with functional dysarthria. Furthermore, studies have shown that 72.3% to 89.2% of speech-impaired children are improved to a greater extent and partially or even fully rehabilitated after appropriate treatment and intervention. Therefore, a comprehensive assessment of speech function is crucial, which will provide the possibility to find potential speech disorder symptoms.

In the past, the speech function assessment of children is generally carried out by professional speech therapists with related language and culture backgrounds by adopting an auditory method and establishing scales to carry out subjective assessment, and the experience and knowledge of the speech therapists are greatly depended. At present, professional Chinese speech therapists are seriously lacked, and the requirements of low-age patients on professionals are greatly cut off. In this context, it is desirable to rely on speech recognition (ASR) and speech signal processing techniques to achieve reliable, convenient, automated assessment of speech function suitable for preschool children.

In the past two decades, many scholars have studied automated assessment of speech based on ASR, such as Computer aided pronunciation training (cap) systems and Computer aided language learning systems (CALL). Different methods are used for feature extraction in automated speech evaluation systems, e.g. based on signal processing, prosodic analysis and natural language processing. The extracted features are then input into a statistical model for automatic discrimination to obtain the spoken language ability score of the tester. The ASR module plays an important role in the automatic speech evaluation system, and a set of features widely applied to the automatic speech evaluation system can be extracted from the output of the ASR and the prosody analysis result, and the features can be used for evaluating fluency, pronunciation, tone, grammar, vocabulary usage and the like.

For decades, ASR technology has evolved greatly and has undergone many changes. Likewise, ASR-based automated speech assessment techniques vary. Early Gaussian Mixture-Hidden Markov models (GMM-HMMs) have been the best framework for building acoustic models for speech recognition. On the basis of this framework, the scholars propose a GOP (Goodness of probability) algorithm for automatically evaluating pronunciation. With the rapid development of artificial intelligence, Deep Neural Networks (DNNs) are successfully applied to many fields, and the conventional acoustic model GMM-HMM of the ASR system is gradually replaced by DNN-HMM. Meanwhile, the automatic voice evaluation technology is also developed correspondingly. However, most of the current research is directed to the voice assessment task of non-local speakers of the computer-aided language teaching system, and the speaker ages over 10 years, and there is only a few research on the automatic assessment of the speech function of preschool children aged 3-6 years.

Through statistical analysis, the existing speech function assessment technology mainly has the following problems:

1) most of the current researches aim at the voice evaluation task of non-local speakers of a computer-aided language teaching system, and the age of the speakers is over 10 years, so that the accuracy of speech disorder evaluation of preschool children aged 3-6 years is influenced.

2) The existing automatic voice evaluation system mainly aims at pronunciation evaluation of adult second language learning, evaluation standards are single, speech function conditions are difficult to reflect comprehensively, and particularly for preschool children, a more comprehensive evaluation scheme is provided to analyze speech function development conditions of preschool children.

3) The current speech function assessment is relatively fixed and cannot be adjusted correspondingly according to the assessment condition of the testee. Treatment of speech-impaired patients is a continuous cyclic process requiring long-term intervention and evaluation of the patient. Based on this, the content of each evaluation should be adjusted accordingly to the last evaluation result.

Disclosure of Invention

The invention aims to overcome the defects of the prior art, provides a speech function automatic evaluation system and method based on voice recognition, provides a technical scheme for comprehensive speech function evaluation by combining a deep voice recognition technology and a voice signal processing technology, and particularly can improve the evaluation accuracy of the language function of children.

According to a first aspect of the present invention, there is provided a speech function automatic evaluation system based on voice recognition. The system comprises a voice endpoint detection module, a voice recognition module and an evaluation module, wherein the voice recognition module is in communication connection with the voice endpoint detection module and the evaluation module respectively, and the system comprises: the voice endpoint detection module is configured to detect a starting point and an end point of a voice signal from a continuous voice stream to extract a voice segment to be evaluated; the voice recognition module is configured to perform feature extraction on a voice segment to be evaluated based on the trained acoustic model, and input the voice segment to the deep neural network model to recognize a corresponding word sequence; the evaluation module is configured to evaluate a speech function with an evaluation index of a respiratory function, an evaluation index of a vocal function, and an evaluation index of an articulatory function for the recognized word sequence.

According to a second aspect of the invention, a speech function automatic evaluation method based on voice recognition is provided. The method comprises the following steps: detecting a starting point and an end point of a voice signal from a continuous voice stream to extract a voice section to be evaluated; extracting the characteristics of the voice segment to be evaluated based on the trained acoustic model, and inputting the voice segment to the deep neural network model to identify a corresponding word sequence; speech function is evaluated for the recognized word sequence by using an evaluation index of respiratory function, an evaluation index of vocal function, and an evaluation index of vocal function.

Compared with the prior art, the method has the advantages that a special voice recognition system is constructed according to preschool children voice data based on the larger difference between the voice and the cognitive abilities of the children and adults, and meanwhile, more scientific assessment linguistic data are provided for the sound-forming voice assessment of the children; from a speech physiological system, the respiratory function assessment, the vocal function assessment, the sound-forming function assessment and the like of the children are respectively carried out, so that the speech function condition of the children can be mastered more comprehensively and accurately; in addition, the content of each sound-forming voice function evaluation can be adaptively adjusted according to the historical evaluation result, so that the evaluation content is more flexible and targeted.

Other features of the present invention and advantages thereof will become apparent from the following detailed description of exemplary embodiments thereof, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description, serve to explain the principles of the invention.

FIG. 1 is a schematic diagram of an automated speech function assessment system based on speech recognition according to one embodiment of the present invention;

FIG. 2 is a flow diagram of a dual threshold endpoint detection method according to one embodiment of the invention;

FIG. 3 is a speech recognition framework according to one embodiment of the present invention;

FIG. 4 is a flow diagram of GMM-HMM model training according to one embodiment of the invention;

FIG. 5 is a flowchart illustrating DNN-HMM model training according to one embodiment of the present invention;

FIG. 6 is a schematic diagram of a DNN network architecture according to one embodiment of the present invention;

FIG. 7 is a schematic diagram of a speech physiological system according to one embodiment of the invention;

FIG. 8 is a flow chart of a method of autocorrelation pitch frequency detection according to one embodiment of the present invention;

FIG. 9 is a block diagram of an assessment of an articulatory speech function according to one embodiment of the invention;

FIG. 10 is a flow chart of adaptively adjusting an assessment corpus according to an embodiment of the present invention.

Detailed Description

Various exemplary embodiments of the present invention will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, the numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present invention unless specifically stated otherwise.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the invention, its application, or uses.

Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.

In all examples shown and discussed herein, any particular value should be construed as merely illustrative, and not limiting. Thus, other examples of the exemplary embodiments may have different values.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.

Through research and analysis, the voice characteristics of children are obviously different from those of adults, the children have shorter sound channels, and the sound channel development has high inter-individual difference, so that the voices of the children have larger difference of the positions and the spectral distribution. In addition, since preschool children do not receive systematic speech learning and pronunciation training, the choice of vocabulary and sentence construction modes is greatly different from that of adults, resulting in irregular pronunciation. Based on the technical scheme, the invention provides the technical scheme which is particularly suitable for the automatic speech function evaluation of preschool children, and the method is based on the voice recognition and voice signal processing technology, can carry out reliable, convenient and low-cost comprehensive evaluation on the respiratory function, the sound production function and the sound formation function of the children, can enable parents to comprehensively know the speech conditions of the children in real time, and can find potential speech disorder symptoms in the preclinical or early clinical stage, thereby creating the opportunity of timely diagnosis for the children patients so as to adopt effective strategies for intervention and rehabilitation treatment.

According to one embodiment of the invention, the speech function automatic evaluation system based on language recognition comprises an endpoint detection module, a voice recognition module and an evaluation module, wherein the endpoint detection module is used for performing endpoint detection (for example, a double-threshold method is adopted) on input voice so as to eliminate the interference of mute noise, improve the robustness of speech function evaluation and be beneficial to the evaluation module to calculate the continuous voice length; the voice recognition module is used for training an acoustic model by using the voice data of the children, extracting the characteristics of the voice to be evaluated based on the trained acoustic model, inputting the characteristics into the deep neural network, and obtaining the phoneme logarithm posterior probability of each frame of voice on the output layer; the evaluation module is used for comprehensively analyzing and evaluating the speech function from multiple angles such as the respiratory function, the sound production function and the sound formation function.

In an application example, referring to fig. 1, a user interacts with an evaluation system through a client. During the evaluation process, the user's voice is uploaded to the server's Mysql database and written to the status. The automatic evaluation module judges whether the speech to be evaluated is uploaded or not through a monitoring thread, and if the speech to be evaluated is judged according to the state, the speech is automatically evaluated. The automatic evaluation module comprises an endpoint detection module, a voice recognition module and an evaluation module. After the evaluation is finished, the evaluation score is written back to the Mysql database, and the status is rewritten. In addition, the client also judges whether the evaluation is finished or not according to the state through the monitoring thread, and if the evaluation is finished, the evaluation result is obtained and returned to the client. In the following, the endpoint detection module, the voice recognition module and the evaluation module will be described in detail, and details of the client and the Mysql database will not be repeated.

First, about the end point detection module

The endpoint Detection module is used for performing Voice endpoint Detection (VAD), which is to distinguish Voice from non-Voice signal periods in a Voice signal to accurately determine a start point and an end point of the Voice signal, that is, to detect an effective Voice segment from a continuous Voice stream. In the present system, the role of using voice endpoint detection is: the evaluation module of the system comprises continuous duration evaluation, which needs to calculate the duration of the voice section; the voice processing device is used for removing redundant non-voiced signals and improving the speed of processing voice; interference due to non-voiced signals entering the back-end analysis system is reduced.

In one embodiment, speech endpoint detection is achieved using a dual threshold method, for example, combining short-term energy and a short-term zero-crossing rate, the short-term energy being formulated as:

in the formula, N represents the number of sample points taken in the frame window, x (i) represents the amplitude value of the ith sample point, and enery represents the energy sum of a frame of speech.

The short-time average zero-crossing rate is expressed as:

in the formula, sgn [ ] is a sign function:

w (n) is a window function, e.g. in the form of a rectangular window, expressed as:

referring to fig. 2, the process of determining valid speech segments by using the dual-threshold method includes: calculating the short-time capability and the short-time zero-crossing rate according to frames; locating voiced sounds, for example, setting a higher energy threshold Mh, and determining end points A1 and A2; expanding the search, for example, setting a lower energy threshold Ml and expanding the lower energy threshold Ml to both sides by the endpoints a1 and a2 to determine the endpoints B1 and B2; an expanded search, for example, setting a zero-crossing rate threshold and expanding on both sides by B1 and B2, determines the final endpoints C1 and C2. The thresholds involved therein may be set to suitable values depending on the noise interference situation and the required processing speed, accuracy, etc.

The invention adopts a method of combining short-time energy and short-time zero-crossing rate, and determines the starting point and the end point of the voice signal by using two thresholds of the short-time energy and the short-time zero-crossing rate, thereby effectively removing redundant information and improving the voice recognition effect. Other prior art techniques may also be used by those skilled in the art to detect the duration of a speech signal.

Second, about the speech recognition module

The framework of the speech recognition module is shown in fig. 3, given an acoustic observation sequence X ═ X₁X₂…X_nThe speech recognition process is a process of finding the corresponding word sequence

Since p (x) is the distribution probability of the sequence of acoustic observations, which can be assumed to be constant, the above formula can be written as:

where P (X | W) represents the probability of acoustically observing a sequence given the word sequence W, which may correspond to an acoustic model in an ASR system. P (w) is the prior probability that the acoustic observation sequence X corresponds to text, which may correspond to the language model of the ASR system. In the present invention, the neural network framework uses a feed-forward neural network using only the acoustic model portion of the speech recognition system, e.g., based on Kaldi deep learning platform training.

In FIG. 3, the speech recognition framework includes a front-end process, an acoustic model, a language model, and a pronunciation dictionary and decoder. The front-end processing unit is used for extracting acoustic features capable of representing voice information, and an acoustic model is used for carrying out acoustic modeling on phonemes and obtaining observation probability under a given phoneme state condition; the decoder obtains the word sequence corresponding to the voice in a search space composed of a pronunciation dictionary and a knowledge source of a language model based on the phoneme observation probability output by an acoustic model.

The training data, acoustic features, model training process, model parameters, etc. will be described separately below.

1) Training data

The voice recognition system of the embodiment of the invention is based on 48-hour voice training data of preschool children of 3-5 years old, the recording environment is a quiet indoor environment, the proportion of men and women is balanced, the regional distribution of the children is mainly north, the audio frequency is short sentences, and the contents comprise daily expressions, children's rumors, story books, command interactive sentences and the like.

2) Acoustic features

The original acoustic features used in the embodiment of the invention are splicing of perceptual linear prediction features (PLP) and fundamental frequency features (Pitch), 13-dimensional perceptual linear prediction features and 3-dimensional fundamental frequency features are extracted after a speech signal is subjected to frame shifting according to a frame length of 25ms and a frame length of 10ms, and cepstrum mean regularization (CMN), Linear Discriminant Analysis (LDA), Maximum Likelihood Linear Transformation (MLLT) and limited maximum likelihood regression (fMLLR) are carried out on the basis of the 16-dimensional features to obtain 40-dimensional features. Finally, to take into account the correlation of the temporal signal context, the 440-dimensional features of the consecutive 11-frame stitching are finally used as the final acoustic model input features.

3) Training process

In one embodiment, the acoustic model is implemented using a deep neural network-hidden markov hybrid system (DNN-HMM), the hidden markov model modeling timing attributes of the speech signal, and the deep neural network modeling observation probabilities of the speech signal. The deep neural network model does not need to make assumptions about the distribution to which the acoustic features obey, and can better utilize contextual information with continuous stitched frame features.

The training process of the acoustic model comprises the following steps: firstly, training a GMM-HMM model to obtain training data of frame alignment; the DNN-HMM model is then trained.

Referring to the GMM-HMM model training flow illustrated in FIG. 4, the GMM-HMM model training is based on, for example, an Expectation Maximization (EM). Specifically, starting from a single-phone model, 118 single phones are modeled, resulting in a single-phone model. Then, in order to integrate the co-articulation phenomenon into modeling, the modeling unit related to the context is considered, and triphone model training is carried out. And constructing a decision tree according to the relevant statistics and the question set, performing state binding on the triphones, and training the bound triphones based on the training data to obtain a triphones model. Next, some adjustments are made to the triphone model. Linear Discriminant Analysis (LDA) is a supervised dimension reduction technology, and the core idea is to minimize the intra-class variance and maximize the inter-class variance after projection. The maximum likelihood linear transformation is to use a linear transformation to decorrelate the parameter feature vectors under the maximum likelihood criterion, so that the likelihood that the model matches the training set is higher in the new space. Finally, in order to compensate the problem that the actual data is not matched with the acoustic conditions in the trained triphone model, the speaker self-adaptive training is carried out, for example, transformation matrix estimation is carried out on each speaker, then the transformed characteristics are constructed, and then iterative training is carried out to obtain new acoustic model parameters.

The training process of the DNN-HMM model is shown in FIG. 5, firstly, the speech data of the training set is forced to be aligned based on the trained GMM-HMM model, and supervised data of frame level alignment is obtained. Then, Deep Belief Network (DBN) pre-training is performed to obtain a better initialization model. Finally, a final DNN acoustic model is trained based on, for example, cross-entropy criteria and back-propagation algorithms on the basis of the initialized model using the aligned supervised data.

4) Training parameters

For example, the total Gaussian number of the GMM-HMM model is 39995, and the number of bound triphone states is 3392. The learning rate of the DNN-HMM acoustic model training is 0.008, the number of nodes of an input layer is 440, the number of hidden layers is 5, the number of hidden layers is 2048, and the number of nodes of an output layer is 3392. The deep neural network structure is shown in figure 6.

In the embodiment of the invention, the factor that the children voice is different from the adult voice is considered, and a special acoustic model is constructed based on the preschool children voice data so as to realize accurate speech function evaluation.

Third, about the assessment module

Speech production is achieved by coordinated movement of three systems, the respiratory system, the vocal system and the sound-forming system, see fig. 7. The gas stored in the lung, the trachea and the bronchus is regularly discharged along with the movement of expiration to form airflow; when the airflow reaches the glottis, it is converted into a series of pulse signals (glottis waves); then, by the resonance action of the vocal tract, sound waves having an appropriate form are formed, and finally speech signals (sound waves) are emitted from the mouth and the nose. In view of the above-described features, the evaluation module of the embodiment of the present invention analyzes the speech function by comprehensively evaluating the respiratory function, the vocal function, and the articulatory function.

1) Evaluation of respiratory function

The respiratory system is the source and basis of speech. During speech, it is necessary to instantaneously inhale a large amount of gas and maintain a smooth exhalation, and maintain sufficient subglottic pressure with a small flow of gas. The present invention evaluates respiratory function primarily by measuring the sustained duration of the sound of the child. The duration refers to the longest time for a person to give a single vowel/a/after deep breathing, and is one of the best indexes for measuring the speech and breathing capacity. The duration of sound is influenced by factors such as sex, age, health condition, height, weight, vital capacity and breathing pattern. Any disease of the respiratory system, disease of the sound producing system, or disharmony of the respiratory system and sound producing system may result in a reduction in sustained duration. The sustained duration profiles of children of the same age and gender are shown in table 1 below.

Table 1: chinese preschool children continuous duration reference standard

The testing environment noise is controlled below 40dB, the distance between a mouth and a microphone is about 10cm, the total recording time length is 10s, the audio frequency is subjected to end point detection before evaluation, and after the voice section time length t is obtained, the continuous voice length score can be calculated by the following piecewise function according to the reference standard of the corresponding age and gender:

if the sustained duration t < m- σ, then several anomalies may exist:

abnormal breathing patterns (e.g., chest breathing);

impaired respiratory function (e.g., decreased lung capacity);

abnormal vocal function (e.g., diminished glottic closure control);

respiratory and vocal movement incoordination (e.g. vocalization during inspiration);

2) evaluation of sound production function

Dysphonia refers to abnormalities in loudness, tone, and quality of sound. Loudness abnormality is mainly the result of combined action of factors such as respiratory airflow, vocal cord resistance, vocal cord vibration morphology and subglottic pressure. Tonal abnormalities are mainly affected by factors such as vocal cord length, mass, tension and subglottic pressure. The abnormal sound quality is generally caused by functional abnormalities or organic lesions of the vocal cords. The present invention primarily looks at the assessment of loudness and pitch.

(1) Loudness assessment

The essence of loudness assessment is to assess the intensity of the speaker's speech, i.e., the sound intensity. Sound intensity is the objective physical strength of sound, and refers to the amount of power consumed over a certain area, and its unit is W/cm²The amplitude of the sound body vibration is determined, and the larger the amplitude is, the stronger the sound intensity is, and the double-microphone method and the discrete point method are commonly used for measurement. The loudness is the auditory psychological perception quantity of sound intensity, which means that sound waves with certain intensity act on the back of human ears and the brain is rightSubjective perception of the sound intensity. Since loudness and intensity are closely related, the assessment of sound intensity is conventionally referred to as an assessment of loudness. In the present invention, the objective assessment of loudness is by calculating the decibel value of the user's audio. The user audio frequency is 16bit quantization precision, and the amplitude value of each sampling point is in the range of 0-65535. The audio decibel value calculation formula is as follows:

L_p＝10log₁₀(P_rms)²dB

＝20log₁₀(P_rms)dB (8)

wherein L is_pIs an audio decibel value, P_rmsIs the amplitude value of the current sampling point. In the sound intensity evaluation, the user is prompted to continue to send the syllable/ba/, the middle 1/3 sampling point of the audio frequency after the end point detection is taken to calculate the decibel value, and finally the average value is calculated to be the final result.

Through statistical analysis, the decibel coincidence mean value of the recorded audio of a normal child in a quiet environment with a fixed distance from a microphone is 72.5dB, the standard deviation is 7.5dB of Gaussian distribution, the loudness score range is 0-10 minutes, and the method is calculated according to the following formula:

if L is measured_pNot less than 80dB, which indicates that the patient has the possibility of high loudness, if L_p65dB or less indicates that the patient has the possibility of low loudness.

(2) Pitch estimation

The essence of the pitch assessment is to assess the fundamental frequency of the speech. Fundamental frequency is a physical quantity that refers to the number of times the vocal cords vibrate per second, in hertz (Hz). And pitch is the auditory psychological perception of the fundamental frequency, which is the subjective perception of the individual of the sound level. In the range of natural sound zone, the higher the speed of vocal cord vibration, the higher the tone; the smaller the rate at which the vocal cords vibrate, the lower the pitch. The tone is a key factor reflecting the function of vocalization, and the voice is different from tone to tone. The fundamental frequencies of speech in children of the same age and gender fit approximately in a Gaussian distribution, as shown in Table 2 below.

Table 2: chinese preschool children average speech fundamental frequency reference standard (unit: Hertz)

In the present invention, the existing short-time autocorrelation method based on three-level center clipping can be used to detect the pitch frequency of user utterance/ba/, and the algorithm flow is shown in fig. 8.

After the fundamental frequency pitch is obtained through calculation, according to the reference standard of the average speech fundamental frequency of Chinese children, the pitch evaluation score range is 0-10 points, and the following formula is adopted for calculation:

where m and σ are the mean and standard deviation of the distribution of the fundamental frequency of children for age and gender, respectively. If pitch is measured to be m-2 sigma, it indicates that the patient has a possibility of low pitch. If pitch ≧ m-2 σ, this indicates that the patient has a high probability of pitch.

3) Evaluation of sound-constituting function

Speech production is achieved by coordinated movement of the respiratory, vocal and articulation systems. The sound-making system is composed of oral cavity, nasal cavity, pharyngeal cavity and their accessory organs, in which the most important sound-making organs are mandible, lip, tongue and soft palate. Their respective flexible and coordinated movements are a necessary condition for producing clear, meaningful speech.

Dysarthria refers to phenomena of unclear speech formation, abnormal harmony and simple or compound vowel tone and the like in the process of giving meaningful speech due to abnormal movement or coordinated movement of the vocal organs, so that the intelligibility of the speech is influenced, and the phenomena are the main reasons for reducing the speech definition. In the present invention, the sound-constituting function evaluation includes sound-constituting motion function evaluation and sound-constituting voice function evaluation.

(1) Assessment of articulatory motor function

In the sound-forming movement, the mandible, the lip and the tongue are the most important sound-forming organs, and whether the movement of the mandible, the lip and the tongue is normal or not is a key factor influencing the sound-forming definition. In the present invention, oral alternating movement rates are used to assess the articulatory movement function. The oral rotation rate refers to the total number of utterances that can give up to a particular syllable every 4 seconds. The oral rotation movement rate reflects the tongue movement state and the oral muscle group coordination level, and is an important index for measuring speech definition. Herein, choose/pataka/as a specific pronounced syllable. It is composed of three syllables, and mainly examines the flexibility of alternative movements of lip, tongue and mandible when pronouncing. The reference standard of the oral cavity alternate movement rate of the preschool children in China is shown in the following table 3.

Table 3: chinese preschool children oral rotation movement rate reference standard (4 second)

When testing, the test is required to inhale deeply, and a breath gives out the designated syllable continuously as soon as possible in 10 seconds, the tone and the loudness are moderate, and each syllable must be complete. Then, carrying out end point detection and voice recognition on the tested voice, counting syllables/pataka/frequency and dividing by 4 to obtain the oral cavity alternation movement rate s, and finally, calculating the oral cavity alternation rate score by using the following piecewise function:

wherein m and sigma are the mean value and standard deviation of oral rotation movement rate of children corresponding to ages and sexes in China respectively. If the oral alternate movement speed s is less than m-sigma, the alternating movement flexibility of the lower jaw, the tongue, the lips and the soft palate is poor.

(2) Assessment of sound-forming speech function

The purpose of the assessment of the sound-forming voice function is mainly to examine the sound-forming phoneme learning conditions of initials, finals, tones and the like of a patient. Generally, the assessment of the sound-forming voice function needs an assessment corpus, and the formulation principle is to combine the phonemes such as the initials, the finals and the tones. In the speech system, the parts mainly exerting force during pronunciation are called pronunciation parts, including 7 parts of the two lips, the labial teeth, the front tongue tip, the middle tongue tip, the back tongue tip, the tongue surface and the tongue root. In order to comprehensively evaluate the sound-forming voice function of the preschool children and consider the cognitive ability of the preschool children, the evaluation corpus is scientifically selected in the invention, so that the corpus syllables cover all initial and final consonants of Chinese pinyin, all pronunciation parts are comprehensively evaluated, and the sound position learning ability and the sound-forming definition degree of the tested children are reflected. The evaluation corpus includes 40 words and 40 phrase sentences, as shown in tables 4 and 5 below.

Table 4: evaluating words

Table 5: evaluating sentences

As shown in fig. 9, in the evaluation, firstly, the tested child reads the automatically selected standard text according to the prompt, generates the tested speech, extracts the acoustic features, inputs the acoustic features and the standard text into the acoustic model for forced alignment, and obtains the label corresponding to each frame. Then, inputting the characteristics into an acoustic model, calculating the posterior probability of an output layer by using a feedforward algorithm, and calculating the GOP score of each phoneme pronunciation by combining the alignment result. And finally, setting a proper GOP threshold value, judging whether each phoneme pronunciation is correct or not and carrying out phoneme statistics. And then, according to the statistical data, three parts of contents are realized, which are respectively as follows: calculating the articulation definition, giving out the clinical meaning of the pronounciation easy to be misread, and adaptively adjusting and evaluating linguistic data.

(1) GOP algorithm

The GOP algorithm is an evaluation algorithm for phoneme-level pronunciation quality, and is defined as a ratio of a standard phoneme logarithm posterior probability and a maximum logarithm posterior probability output by a current frame acoustic model, as shown in the following formula:

wherein, Q is a phone set, p is a phone set corresponding to the current frame, and LPP is a log phone posterior probability, which is defined as follows:

wherein o is_tIs an input feature, t_sAnd t_eThe audio p corresponds to a start frame and an end frame, respectively, and s is a binding state corresponding to the current phoneme p, i.e., a label corresponding to an output layer of the neural network of the acoustic model.

(2) Articulation definition

Here we use the articulation definition to evaluate the preschool children's articulatory speech function, i.e. to calculate the percentage of phonemes that are tried to be correctly uttered. The articulation score is calculated as follows:

wherein, C is the number of correctly-pronounced phonemes in the tested speech, and N is the total number of phonemes of the word or sentence. As shown in Table 6 below, the presence of dysarthria is indicated when the articulation clarity score of the test child is < m-sigma.

Table 6: reference standard for integral sound construction definition of normal children

(3) Easy-to-read and error initial consonant clinical meaning

As shown in table 7 below, according to the mandarin chinese initial consonant pronunciation table, the clinical meanings corresponding to the first n easy-to-read and incorrect initials in the phoneme statistical data, including pronunciation parts and pronunciation modes, are given, so that the pronunciation problem of the tested child is clearer.

Table 7: mandarin consonant structure sound watch

In conclusion, the automatic speech function evaluation system specially for preschool children is constructed, and three systems for speech generation, namely a respiratory function, a sound production function and a sound formation function, are evaluated. Aiming at the respiratory function, carrying out continuous duration evaluation on the children to be tested; performing loudness evaluation and tone evaluation aiming at the sounding function; aiming at the sound-forming function, two aspects are mainly considered, namely sound-forming motion function evaluation is realized by evaluating oral cavity alternate motion rate; and secondly, evaluating the sound-forming voice function by evaluating the sound-forming definition. Both above evaluations for the articulatory system are based on speech recognition.

(4) Self-adaptive adjustment of assessment corpus

Treatment of a patient with speech impairment is a continuous cyclic process requiring long-term intervention and evaluation of the patient. Based on this consideration, it is expected that the evaluation content of each time can be adjusted correspondingly to the last evaluation result. In the invention, the probability of occurrence of the corpus corresponding to each initial is evaluated next time is updated by calculating the pronunciation error rate ratio of each initial evaluated this time, and the self-adaptive adjustment of the evaluated corpus is realized by combining the random generation number algorithm of the designated probability, and the flow is shown in fig. 10.

Firstly, a mapping table of initial consonants and corresponding linguistic data is constructed according to the evaluation linguistic data, and each initial consonant corresponds to a list of words or sentences containing the initial consonant. And in the first evaluation, the probability that the linguistic data corresponding to each initial consonant is selected is equal. At the next placeIn the secondary evaluation, the probability weight w of each initial corresponding corpus being selected_iUpdating according to the following formula:

wherein the molecule e_iIs the pronunciation error rate of the ith initial consonant in the evaluation, and the denominator is the sum of the pronunciation error rates of all the initial consonants. After the probability weight is updated, the next evaluation uses a random number generation algorithm of the designated probability, and n unrepeated words or sentences are selected from the list corresponding to the 23 initial consonants as evaluation linguistic data. In this way, for a part of phonemes with poor pronunciation in the assessment of the child, the proportion of the linguistic data corresponding to the part of phonemes can be increased in the next assessment.

In order to further verify the effect of the invention, the provided speech function automatic evaluation system based on voice recognition is tested in a child rehabilitation center, speech function automatic evaluation is carried out on preschool children, and correlation analysis is carried out on the automatic evaluation result of the system and the result evaluated by experts.

In summary, the speech function automatic evaluation system based on voice recognition provided by the invention at least achieves the following technical effects:

1) speech function automatic evaluation system for preschool children of 3-6 years old

The existing speech automatic evaluation system mainly aims at the second language learning of adults and does not aim at the Chinese speech function automatic evaluation system of preschool children of 3-6 years old. And the serious shortage of Chinese professional speech therapists makes the research of a reliable and convenient speech function automatic evaluation system aiming at preschool children urgent. The invention constructs a special ASR system based on the speech data of the children, comprehensively considers the cognitive ability of the preschool children and the pronunciation parts of the phonemes corresponding to the words, and provides more targeted assessment linguistic data.

2) More comprehensive automatic evaluation of speech function of children

The evaluation standard of the existing scheme is single at present, and the speech function status of the tested speech is difficult to be comprehensively reflected. The invention starts from three systems of speech physiology, performs respiratory function assessment, vocal function assessment and sound-forming function assessment on the tested children, and can more comprehensively and accurately master the speech function condition of the children.

3) Adaptively adjusting the pronunciation-forming function assessment corpus

Treatment of a patient with speech impairment is a continuous cyclic process requiring long-term intervention and evaluation of the patient. According to the method and the device, the acoustic function assessment corpus can be correspondingly adjusted according to the last assessment result, so that the assessment content is more targeted.

4) Automatic assessment of the articulatory movement function is proposed

In the prior art, aiming at assessment of the sound-forming movement, a professional speech therapist quantitatively measures the movement capacity of sound-forming organs and the coordination movement capacity of the organs through acoustic analysis, and cannot meet the real-time requirement.

The present invention may be a system, method and/or computer program product. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied therewith for causing a processor to implement various aspects of the present invention.

The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.

The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.

The computer program instructions for carrying out operations of the present invention may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present invention are implemented by personalizing an electronic circuit, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA), with state information of computer-readable program instructions, which can execute the computer-readable program instructions.

Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions. It is well known to those skilled in the art that implementation by hardware, by software, and by a combination of software and hardware are equivalent.

Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein. The scope of the invention is defined by the appended claims.

23页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：机器人、语音数据处理方法、装置以及存储介质

Speech function automatic evaluation system and method based on voice recognition

相关技术

网友询问留言