Method for realizing mobile phone end pronunciation visualization system

文档序号：1244098 发布日期：2020-08-18 浏览：26次中文

阅读说明：本技术 一种手机端发音可视化系统的实现方法 (Method for realizing mobile phone end pronunciation visualization system ) 是由徐天一赵隆轩王建荣于瑞国于建高洁严丽珺于 2020-04-27 设计创作，主要内容包括：本发明公开一种手机端发音可视化系统的实现方法,包括步骤：建立发音运动数据库,提取语音数据的特征使语音数据能转换为机器所能处理分析的特征向量形式,然后与原始数据共同组成数据集；利用上述数据集对选择的GMM模型进行训练,得到发音可视化模型；将所述发音可视化模型与预创建的声音录入模块与可视化处模块连接,应用到移动平台,形成手机端发音可视化系统。本发明所形成的手机端发音可视化系统,应用于手机等移动平台上,可以实现发音动作可以移动平台,如手机显示面上的可视化显示,通过直观的展示方便了解相应的使用者的语音的发音动作。(The invention discloses a method for realizing a mobile phone end pronunciation visualization system, which comprises the following steps: establishing a pronunciation motion database, extracting the characteristics of voice data to convert the voice data into a characteristic vector form which can be processed and analyzed by a machine, and then forming a data set together with the original data; training the selected GMM model by using the data set to obtain a pronunciation visualization model; and connecting the pronunciation visualization model with a pre-created sound input module and a visualization module, and applying the pronunciation visualization model to a mobile platform to form a mobile phone end pronunciation visualization system. The mobile phone end pronunciation visualization system formed by the invention is applied to mobile platforms such as mobile phones, can realize that pronunciation actions can be displayed visually on the mobile platforms such as mobile phone display surfaces, and is convenient to know the corresponding pronunciation actions of the voice of a user through visual display.)

1. A realization method of a mobile phone end pronunciation visualization system is characterized by comprising the following steps:

s1, establishing a pronunciation motion database, extracting the characteristics of voice data to convert the voice data into a characteristic vector form which can be processed and analyzed by a machine, and then forming a data set together with original data;

s2, training the selected GMM model by using the data set to obtain a pronunciation visualization model;

and S3, connecting the pronunciation visualization model and the pre-created sound input module with the visualization module, and applying the pronunciation visualization model and the pre-created sound input module to a mobile platform to form a mobile phone end pronunciation visualization system.

2. The method for implementing the system for visualizing pronunciation at a mobile phone end as claimed in claim 1, wherein each pair of data files in the data set comprises a MFCC data file and a corresponding EMA data file, and the MFCC data file describes a characteristic EMA data profile of voice data and a pronunciation action when the corresponding voice data is generated.

3. The method for implementing the pronunciation visualization system at the mobile phone end according to claim 1, wherein the audio data collected by the voice recording module is PCM and is converted into WAV format, so that the voice signal obtained from the mobile phone can be converted into a voice feature vector recognized and processed by a machine.

4. The method for implementing the system for visualizing pronunciation at a mobile phone end as claimed in claim 1, wherein the feature of the extracted voice data is an MFCC feature.

5. The method for implementing the system for visualizing the pronunciation of the mobile phone terminal according to claim 1, wherein the pronunciation visualization model is a GMM mapping model based on max _ comp, and the output is 14-dimensional data corresponding to the abscissa and the abscissa of seven different oral organs recorded by EMA data.

6. The method for implementing the system for visualizing pronunciation at a mobile phone terminal according to claim 1, wherein the processing procedure of the visualization module comprises the following steps:

and establishing a coordinate system, preprocessing the data predicted by the pronunciation visualization model in real time, performing coordinate first conversion, marking in the established coordinate system in real time, and realizing real-time dynamic display.

Technical Field

The invention relates to the technical field of voice recognition and visualization, in particular to a method for realizing a mobile phone end pronunciation visualization system.

Background

The traditional pronunciation teaching adopts a follow-up reading simulation teaching mode, and students need to find out problems existing in themselves through auditory feedback and then try repeatedly to find a correct pronunciation method. The learning efficiency of students is low. The visual technology of the pronunciation process shows the movement of the pronunciation organ in the pronunciation process of the person in the form of an image, provides visual feedback during pronunciation, and enables people to find problems more quickly and find a correct pronunciation mode when learning pronunciation. The visualization of the pronunciation process can also help the hearing impaired person to correct pronunciation.

However, the existing pronunciation visualization system cannot be used on terminals such as mobile phones, so that the development of a pronunciation visualization system suitable for mobile phones has important significance.

Disclosure of Invention

The invention aims to provide a method for realizing a mobile phone end pronunciation visualization system aiming at the technical defects in the prior art so as to visualize the physiological information of pronunciation organs when a user inputs voices.

The technical scheme adopted for realizing the purpose of the invention is as follows:

a method for realizing a mobile phone end pronunciation visualization system comprises the following steps:

s2, training the selected GMM model by using the data set to obtain a pronunciation visualization model;

Each pair of data files in the data set comprises an MFCC data file and a corresponding EMA data file, wherein the MFCC data file describes characteristic EMA data description file of voice data and pronunciation action when the corresponding voice data is generated.

The voice signals obtained from the mobile phone can be converted into voice feature vectors which can be recognized and processed by a machine.

Wherein the feature of the extracted voice data is an MFCC feature.

The pronunciation visualization model is a GMM mapping model based on max _ comp, is output as 14-dimensional data and corresponds to the horizontal and vertical coordinates of seven different oral organs recorded by EMA data.

Wherein, the visualization processing module comprises the following steps:

The mobile phone end pronunciation visualization system formed by the invention is applied to mobile platforms such as mobile phones, can realize that pronunciation actions can be displayed visually on the mobile platforms such as mobile phone display surfaces, and is convenient to know the corresponding pronunciation actions of the voice of a user through visual display.

Drawings

FIG. 1 is a schematic block diagram of a mobile phone terminal pronunciation visualization system;

FIG. 2 is a schematic diagram of the horizontal and vertical coordinates of seven different oral organs

FIG. 3 is a schematic diagram of 7 coordinate nodes shown in a view coordinate system;

FIG. 4 is a flow chart for implementing the dynamic display function;

fig. 5-7 show the trajectory of the estimated and true abscissa of the first joint soft palate after MMSE, MLE, and max _ comp, respectively, are used.

Detailed Description

The invention is described in further detail below with reference to the figures and specific examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

As shown in fig. 1, the mobile terminal pronunciation visualization system of the mobile phone of the present invention is implemented by the following method:

step S0101: pronunciation movement database establishment

The database of desired articulatory movements was constructed using EMA data and WAV sound wave data from a multi-channel articulatory (MOCHA) database developed by the university college of margaritt, edinburg. Data acquisition in the MOCHA database is to record sound waves (16kHz, 16 bits) and EMA data synchronously to a computer through a loudspeaker.

Step S0201: WAV sound data feature extraction

The sound is subjected to characteristic value extraction, so that the voice data in the database can be converted into a characteristic vector form which can be processed and analyzed by a machine. The specific step of using python _ speed _ features to extract MFCC:

(1) install python _ speed _ features library: pip install python _ speed _ features

(2) import of portrequired libraries: the speech recognition libraries python _ speech _ features, scipy. io. wavfile library provide the function of loading WAV audio (WAV file is used as speech input data in the speech processing process) and library numpy for processing numpy array;

(3) acquiring the sampling rate of the recorded WAV file and a formed voice signal data array, and obtaining a result set (sampling rate: samplerate, null voice signal array: signal) through a function WAV.

(4) Invoking a python _ speech _ features. MFCC () function to execute the process of extracting the MFCC characteristic value, which is specifically specified as follows:

python_speech_features.mfcc(signal,samplerate,winlen＝0.025,winstep＝0.01,numcep＝12,nfilt＝26,nfft＝512,lowfreq＝0,highfreq＝None,preemph＝0.97,win-16func＝numpy.hamming)；

(5) saving the finally obtained feature value numpy array to a specified path, wherein the file format is a general npy file, and np.save (file _ path, feature _ mfcc) is used;

(6) looking at the contents of the feature value vector in the formed npy file, it first reads np.

Step S0301: and (4) dividing the data set.

The data set consists of 460 pairs of data files, where each pair of data files includes one MFCC data file and one corresponding EMA data file. The MFCC data describes characteristics of voice data, and the EMA data describes pronunciation actions when corresponding voice data is generated.

The data set is divided into three parts, wherein one part is used as a training set and contains 80% of data of the whole data set; the other two parts, as test and validation sets, respectively, both account for 10% of the entire data set.

Step S0401: and constructing a GMM mapping model.

Executing a command: python./inverse _ GMM. py-t-s-n 64-C 'full' model training, the final trained GMM model will be stored locally with joblib: jmbib, gmm _ combined _64components _11frames _ traj etery.

Step S0501: and (4) realizing a pronunciation visualization system.

(1) Implementation of a Sound recording Module

Audio is collected using the AudioRecord, which is a recording class provided by Android.

In the first step, an object under the AudioRecord class is constructed, and some basic parameters under the class are set.

In a second step, a buffer associated with the created object and storing the sound data is initialized, mainly for writing new sound data to be recorded.

And thirdly, starting to collect and read in audio data.

And fourthly, stopping collecting the audio data to obtain the original audio data PCM.

Fifthly, converting the original PCM format into the WAV format. So that the voice signal obtained from the mobile phone can be converted into the voice feature vector which can be recognized and processed by the machine.

(2) Implementation of a prediction module

Firstly, sound characteristics MFCC extraction is carried out on the currently acquired input sound of the user, and an MFCC function in a python _ speech _ features library needs to be called for extraction, so that a corresponding feature file is obtained.

Second, using the extracted MFCC as the X input for model prediction, and using the GMM mapping model based on max _ comp to predict the Y output, which is expressed as a 14-dimensional data, corresponding to the abscissa and ordinate of seven different oral organs (soft palate, tongue root, tongue body, tongue tip, lower incisor, lower lip, upper lip) recorded by the predicted EMA data, as shown in FIG. 2

(3) Implementation of visualization modules

In the first step, a coordinate system is established.

1, establishing a self-defined view coordinate system: and determining the origin of the view coordinate system. And calling functions getHeight () and getWidth () to obtain the height and width of the current view, and then calculating the origin coordinates.

Draw axis horizontal and vertical and scale with function Canvas.

And in the second step, a static point drawing function is realized, as shown in fig. 2.

And 1, preprocessing data. Reading data, calling a replace () function to remove special characters, and storing the processed data and a DataPointlist.

And 2, converting the coordinate system. And converting the coordinates of the data stored in the DataPointList file into a view coordinate system established in the first step, and storing the converted data coordinates into the PointList file.

And 3, calling a Canvas class to provide a function drawPoint (), and marking the seven organs in the PointList under a self-defined view coordinate system according to the coordinates of the seven organs. As shown in fig. 3, fig. 3 shows 7 coordinate nodes displayed in the view coordinate system.

And thirdly, realizing a dynamic display function.

When displaying the previously predicted joint data, not only the point coordinates of the 7 organs recorded at each time are displayed, but also the conditions at all times are continuously displayed to realize a dynamic display function.

Therefore, in concrete implementation, each moment is regarded as a group containing 7 point coordinate data, and then a control for automatically switching and reading the group is designed, so that simulated dynamic demonstration is realized. The 14-dimensional data is restored into horizontal and vertical coordinates of 7 pronunciation organs, and the horizontal and vertical coordinates are mapped to a customized View coordinate system through coordinate transformation, and the movement condition of the oral pronunciation is represented by 7 point oral organ point coordinates.

Furthermore, by constructing a loop, the program can continuously and sequentially display 7 point coordinates, and visual dynamic display of pronunciation is realized. The specific implementation flow is as shown in fig. 4.

1. A data change listener is set. The data is grouped, each group containing 14-dimensional data, i.e., 7 coordinates. When the grouping is switched, calling back a function ondatechanged (int group) under a data change listener DataChangeListener to obtain a grouping number groupId to be displayed in the next frame so as to update the coordinate display;

2. and circulating the execution until all the data sets run out. Calling a function onDataEnd () under a data change listener DataChangeListener to inform that the current activity data runs out;

and 3, after the activity receives the notification of data completion, changing the start/pause characters in the automatically switched and read grouped controls into 'start', and finishing the dynamic display task.

Because a mapping model can be further established based on the traditional GMM model obtained by training, mapping functions (X, Y) are obtained, and a proper loss function is selected to describe the proximity degree between the estimated value Y and the true value Y. Therefore, three GMM mapping models based on different estimation modes (loss functions) were investigated: MMSE-based GMM mapping; GMM mapping based on MLE; based on the GMM mapping of max _ comp, their impact on the results was analyzed separately.

In the present invention, the above three models will be evaluated using two evaluation indexes: one is an estimation error RMSE used for measuring the error between the actual value and the predicted value at each point; and the other is R2, which is used for explaining the correlation degree of the real value and the predicted value on the whole.

The experimental data set consists of 460 pairs of data files, each pair comprising one MFCC data file and one corresponding EMA data file. The MFCC data describes characteristics of voice data, and the EMA data describes pronunciation actions when corresponding voice data is generated.

And constructing a GMM mapping model, calculating a loss function by using estimation modes of MMSE, MLE and max _ comp respectively, and training the model.

First, the RMSE between the true and estimated values of each channel and their degree of fit R2 are calculated using three different estimation modes, MMSE, MLE, max _ comp. Tables 1 and 2 show the RMSE value and R2 value of each channel under the three estimation modes, respectively.

TABLE 1

TABLE 2

And (4) analyzing results: the RMSE is an error value, which represents a smaller degree of deviation between the true value and the estimated value as the RMSE value is smaller. Therefore, by comparing the RMSE values of the three estimation modes under 14 channels, the deviation degree of the value predicted by GMM mapping based on max _ comp from the true value is minimum, so that under the RMSE performance index, the estimation accuracy of the mapping model at the point is high, and the best estimated max _ comp performance is shown. The relationship between the RMSE values in these three estimation modes is approximately presented as: max _ comp < MLE < MMSE, corresponding to the performance of the model under this performance index: max _ comp > MLE > MMSE.

R2 measures the degree of conformity between the predicted value and the true value. The fitting degree of the real value track curve and the predicted value estimation curve is described on the whole. The greater the value of R2 and closer to 1, the higher the goodness of fit of the two curves. By comparing the R2 values of the three estimation modes under 14 channels, the analysis shows that when the max _ comp estimation mode is adopted, the R2 value is larger and more situations close to 1 are more, so the fitting degree of the real value and the predicted value is better than that of the other two estimation modes. Therefore, under the performance index of R2, the GMM mapping model based on max _ comp has higher conformity and still shows the best performance.

Secondly, for the same sound characteristic MFCC, respectively drawing track curves of estimated values and real values of 14 channels. FIGS. 5, 6, and 7 show the trajectory of the estimated and true abscissa of the first joint soft palate after MMSE, MLE, and max _ comp, respectively, are used.

And (4) analyzing results: by observing the three graphs of fig. 5, 6, and 7, it can be seen that the smoothness of the trajectory curve of the predicted values obtained using the GMM mapping model based on max _ comp is the best, followed by the mapping based on MLE, and finally the MMSE.

In conclusion, according to the performance indexes RMSE and R ^2 and the smoothness of the trajectory curve of the estimated value, the GMM mapping model based on the max _ comp can be better in performance compared with the other two models, and therefore the GMM mapping model is selected to be applied to the subsequent visualization system implementation.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

12页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种基于低秩张量学习的多通道心肺音异常识别系统与装置

Method for realizing mobile phone end pronunciation visualization system

相关技术

网友询问留言