Method, device, equipment and storage medium for silencing communication

文档序号：989492 发布日期：2020-11-06 浏览：2次中文

阅读说明：本技术 一种缄默通讯方法、装置、设备及存储介质 (Method, device, equipment and storage medium for silencing communication ) 是由闫野赵涛印二威邓宝松霍晓凯徐梦菲范晓丽谢良于 2020-07-20 设计创作，主要内容包括：本发明公开了一种缄默通讯方法、装置、设备及存储介质,包括：同步采集面部肌电信号、唇部光学图像和口腔超声图像；对所述面部肌电信号、唇部光学图像和口腔超声图像进行预处理以及特征提取,得到处理后的特征数据；将所述处理后的特征数据输入预先训练的无声语音识别模型,得到以缄默方式说话时的无声语音；将所述无声语音发送到语音接收设备。本发明的缄默通讯方法,融合了面部肌电信号、唇部图像以及口腔超声图像三种常用于无声语音识别技术中的特征信号,能够获得准确率更高,识别速度更快的识别结果。(The invention discloses a method, a device, equipment and a storage medium for silencing communication, wherein the method comprises the following steps: synchronously acquiring facial electromyographic signals, lip optical images and oral cavity ultrasonic images; preprocessing and feature extracting are carried out on the facial electromyographic signals, the lip optical images and the oral cavity ultrasonic images to obtain processed feature data; inputting the processed characteristic data into a pre-trained silent speech recognition model to obtain silent speech when speaking in a silence mode; and transmitting the silent voice to a voice receiving device. The invention integrates three characteristic signals commonly used in silent voice recognition technology, namely facial myoelectric signals, lip images and oral ultrasonic images, and can obtain a recognition result with higher accuracy and higher recognition speed.)

1. A method of silencing communications, comprising:

synchronously acquiring facial electromyographic signals, lip optical images and oral cavity ultrasonic images;

preprocessing and feature extracting are carried out on the facial electromyographic signals, the lip optical images and the oral cavity ultrasonic images to obtain processed feature data;

inputting the processed characteristic data into a pre-trained silent speech recognition model to obtain silent speech when speaking in a silence mode;

and transmitting the silent voice to a voice receiving device.

2. The method of claim 1, wherein the pre-processing of the facial electromyography signals, lip optical images, and oral ultrasound images comprises:

filtering, noise reduction, active segment extraction, data normalization and baseline removal processing are carried out on the facial electromyographic signals, so that preprocessed facial electromyographic signals are obtained;

performing gray-scale image conversion, target area cutting and image compression on the lip optical image to obtain a preprocessed lip optical image;

and carrying out data smooth noise reduction and target area cutting processing on the oral cavity ultrasonic image to obtain a preprocessed oral cavity ultrasonic image.

3. The method of claim 2, wherein the step of performing feature extraction on the preprocessed facial electromyographic signals, the preprocessed optical lip images and the preprocessed oral ultrasonic images to obtain processed feature data comprises:

extracting a Mel frequency cepstrum coefficient from the preprocessed facial electromyographic signals to obtain dynamic characteristic data of facial muscle movement;

extracting the motion characteristic data of the lips by adopting a principal component analysis method for the preprocessed lip optical images;

and extracting the motion characteristic data of the oral cavity and the tongue by adopting a discrete cosine transform method for the preprocessed oral cavity ultrasonic image.

4. The method of claim 1, wherein before inputting the processed feature data into a pre-trained unvoiced speech recognition model, further comprising:

and training the silent speech recognition model by adopting a deep learning method according to the processed characteristic data.

5. The method of claim 4, wherein the method of deep learning is used to train the unvoiced speech recognition model, and comprises:

and training the silent speech recognition model by adopting a convolutional neural network algorithm in deep learning, or,

and training the silent speech recognition model by adopting a long-time memory neural network algorithm in deep learning.

6. The method of claim 1, wherein said transmitting said unvoiced speech to a speech receiving device comprises:

the unvoiced sound is transmitted to a voice receiving apparatus through a wireless communication technology.

7. A silencing communication device, comprising:

the data acquisition module is used for synchronously acquiring facial electromyographic signals, lip optical images and oral cavity ultrasonic images;

the data processing module is used for preprocessing the facial electromyographic signals, the lip optical images and the oral cavity ultrasonic images and extracting characteristics to obtain processed characteristic data;

the recognition module is used for inputting the processed characteristic data into a pre-trained silent speech recognition model to obtain silent speech when speaking in a silence mode;

and the communication module is used for transmitting the unvoiced sound to the voice receiving equipment.

8. The apparatus of claim 7, further comprising:

and the model training module is used for training the silent speech recognition model by adopting a deep learning method according to the processed characteristic data.

9. A silencing communications device comprising a processor and a memory storing program instructions, wherein the processor is configured to perform the silencing communications method of any one of claims 1 to 6 when the program instructions are executed.

10. A computer readable medium having computer readable instructions stored thereon which are executable by a processor to implement a method of silencing communication according to any one of claims 1 to 6.

Technical Field

The present invention relates to the field of communications technologies, and in particular, to a method, an apparatus, a device, and a storage medium for silencing communications.

Background

The silence communication method without depending on acoustic signals has various purposes, for example, can help the phonemic disorder patients to communicate in the medical field, can be used for communication in disaster relief sites such as fire disasters and chemical substance disasters, and is used for secret communication in military command and operation.

The techniques of silent speech recognition in silence communication have been developed so far, and the non-acoustic signals and methods used in silence communication mainly fall into the following categories: the method comprises the steps of converting voice signals by using ultrasonic images of the oral cavity and optical image data of lips, using a surface electromyogram signal sensor to collect electric signals of facial and throat muscle movement during pronunciation to reconstruct a pronunciation process, using an electromagnetic pronunciation recorder to record movement information of each pronunciation organ during pronunciation, analyzing the signals from an electroencephalogram, and simulating the voice generation process by recording electroencephalogram conditions of a speaker. In recent years, lip optical image-based, oral cavity-based ultrasonic imaging systems and surface electromyogram-based silent speech techniques are increasingly used in the silence communication method, and are both non-invasive and clinically safe.

However, in the prior art, the above-mentioned various single non-acoustic data acquisition and silent speech recognition technologies are studied, and due to the low robustness and generalization capability, the speech recognition precision is low, and it is difficult to meet the requirement of high-efficiency silence communication.

Disclosure of Invention

The embodiment of the disclosure provides a method, a device, equipment and a storage medium for silencing communication. The following presents a simplified summary in order to provide a basic understanding of some aspects of the disclosed embodiments. This summary is not an extensive overview and is intended to neither identify key/critical elements nor delineate the scope of such embodiments. Its sole purpose is to present some concepts in a simplified form as a prelude to the more detailed description that is presented later.

In a first aspect, an embodiment of the present disclosure provides a method of silencing communications, including:

synchronously acquiring facial electromyographic signals, lip optical images and oral cavity ultrasonic images;

preprocessing the facial myoelectric signals, the lip optical images and the oral cavity ultrasonic images and extracting characteristics to obtain processed characteristic data;

inputting the processed characteristic data into a pre-trained silent speech recognition model to obtain silent speech when speaking in a silence mode;

the silent speech is transmitted to the speech receiving device.

Further, the preprocessing of the facial myoelectric signal, the optical lip image and the oral cavity ultrasonic image includes:

filtering, noise reduction, active segment extraction, data normalization and baseline removal processing are carried out on the facial myoelectric signals to obtain preprocessed facial myoelectric signals;

performing gray-scale image conversion, target area cutting and image compression on the lip optical image to obtain a preprocessed lip optical image;

and carrying out data smooth noise reduction and target area cutting processing on the oral cavity ultrasonic image to obtain the preprocessed oral cavity ultrasonic image.

Further, the feature extraction is performed on the preprocessed facial myoelectric signals, the preprocessed optical lip images and the preprocessed oral ultrasonic images to obtain processed feature data, and the feature data comprises:

extracting a Mel frequency cepstrum coefficient from the preprocessed facial electromyographic signals to obtain dynamic characteristic data of facial muscle movement;

extracting the motion characteristic data of the lips by adopting a principal component analysis method for the preprocessed lip optical images;

and extracting the motion characteristic data of the oral cavity and the tongue by adopting a discrete cosine transform method for the preprocessed oral cavity ultrasonic image.

Further, before inputting the processed feature data into the pre-trained unvoiced speech recognition model, the method further includes:

and training the silent speech recognition model by adopting a deep learning method according to the processed characteristic data.

Further, the method for training the unvoiced sound recognition model by adopting the deep learning method comprises the following steps:

and (3) training a silent speech recognition model by adopting a convolution neural network algorithm in deep learning, or,

and training the silent speech recognition model by adopting a long-time memory neural network algorithm in deep learning.

Further, transmitting the silent voice to the voice receiving apparatus includes:

the silent speech is transmitted to the speech receiving device by wireless communication technology.

In a second aspect, an embodiment of the present disclosure provides a silencing communication device, including:

the data acquisition module is used for synchronously acquiring facial electromyographic signals, lip optical images and oral cavity ultrasonic images;

the data processing module is used for preprocessing the facial myoelectric signals, the optical lip images and the oral cavity ultrasonic images and extracting characteristics to obtain processed characteristic data;

the recognition module is used for inputting the processed characteristic data into a pre-trained silent speech recognition model to obtain silent speech when speaking in a silence mode;

and the communication module is used for sending the silent voice to the voice receiving equipment.

Further, still include:

and the model training module is used for training the silent speech recognition model by adopting a deep learning method according to the processed characteristic data.

In a third aspect, the disclosed embodiments provide a silence communication device, including a processor and a memory storing program instructions, where the processor is configured to execute the silence communication method provided in the above embodiments when executing the program instructions.

In a fourth aspect, the disclosed embodiments provide a computer-readable medium, on which computer-readable instructions are stored, where the computer-readable instructions are executable by a processor to implement a method for silencing communication provided by the above-mentioned embodiments.

The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects:

the silence communication method disclosed by the embodiment of the disclosure integrates three feature signals commonly used in the silence voice recognition technology, including facial myoelectric signals, lip optical images and oral cavity ultrasonic images, and by preprocessing and feature extraction of the three signals, and training a recognition model based on the three signals, a silence voice recognition result with higher accuracy and higher recognition speed can be obtained.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

Fig. 1 is a flow diagram illustrating a method of silencing communications according to an exemplary embodiment;

fig. 2 is a flow diagram illustrating a method of silencing communications according to an example embodiment;

fig. 3 is a schematic diagram illustrating a method of silencing communications according to an exemplary embodiment;

FIG. 4 is a schematic diagram illustrating a facial electromyographic signal acquisition, according to an exemplary embodiment;

FIG. 5 is a schematic illustration of a lip optical image and oral ultrasound image acquisition according to an exemplary embodiment;

fig. 6 is a schematic structural diagram illustrating a silencing communication device according to an exemplary embodiment;

fig. 7 is a schematic diagram illustrating a structure of a silencing communication device according to an exemplary embodiment;

fig. 8 is a schematic diagram illustrating a structure of a silencing communication device according to an exemplary embodiment;

FIG. 9 is a schematic diagram illustrating a computer storage medium in accordance with an exemplary embodiment.

Detailed Description

So that the manner in which the features and elements of the disclosed embodiments can be understood in detail, a more particular description of the disclosed embodiments, briefly summarized above, may be had by reference to the embodiments, some of which are illustrated in the appended drawings. In the following description of the technology, for purposes of explanation, numerous details are set forth in order to provide a thorough understanding of the disclosed embodiments. However, one or more embodiments may be practiced without these details. In other instances, well-known structures and devices may be shown in simplified form in order to simplify the drawing.

The method for silence communication in the embodiment of the disclosure acquires facial myoelectric signals, lip optical images and oral cavity ultrasonic images synchronously, preprocesses and extracts features of the three signals to obtain processed feature data, trains a silent speech recognition model according to the processed feature data, and further recognizes silent speech in a silence mode.

The following describes in detail the silencing communication method, apparatus, device and storage medium provided in the embodiments of the present application with reference to fig. 1 to 9.

Referring to fig. 1, the method specifically includes the following steps.

Step S101, facial electromyographic signals, lip optical images and oral cavity ultrasonic images are synchronously collected.

Specifically, the movement of facial muscles corresponds to different neuroelectrical activities when a person speaks, so that a silent voice when speaking in a silence mode can be analyzed by collecting facial electromyographic signals, fig. 4 is a schematic diagram illustrating facial electromyographic signal collection according to an exemplary embodiment, as shown in fig. 4, a surface electrode is used as a guide electrode, the surface electrode is placed on the skin of the face and the throat around the oral cavity, and the comprehensive potential of the muscular electrical activity at a detection electrode is measured by closely contacting with the skin surface of the area where the active muscles are located to obtain the facial electromyographic signals, in a possible implementation, the sampling rate is 1000Hz, and the collected original electromyographic signals are 6-channel one-dimensional signals.

The method comprises the steps of simultaneously acquiring lip optical images and oral cavity ultrasonic images while acquiring facial electromyographic signals, and fig. 5 is a schematic diagram illustrating lip image and oral cavity ultrasonic image acquisition according to an exemplary embodiment, wherein the lip optical images and the oral cavity ultrasonic images can be acquired through a head-mounted device, an ultrasonic probe attached to the lower jaw face through the head-mounted device can provide a view of the sagittal plane of the tongue, the motion situation of the tongue surface during speaking can be clearly shown, the sampling rate is set to be 60Hz, and the acquired original oral cavity ultrasonic images are grayscale images with the resolution of 320 to 240 pixels. Ultrasonic imaging is a non-invasive, clinically safe way that allows real-time visualization of the tongue, the most important acoustic organ of the human body.

In order to obtain more non-acoustic characteristic information, the head-mounted device is provided with a small camera right in front of the lips of a speaker, a real-time optical image of the lips during speaking is shot, the sampling rate is set to be 60Hz, and the collected original lip image data is an RGB color image with the resolution of 480 × 320 pixels.

Through this step, facial myoelectric signals, lip optical images and oral ultrasound images can be synchronously acquired.

Step S102, preprocessing and feature extraction are carried out on the facial myoelectric signals, the lip optical images and the oral cavity ultrasonic images, and processed feature data are obtained.

Firstly, preprocessing the collected facial electromyographic signals, lip optical images and oral cavity ultrasonic images.

Specifically, the collected facial electromyographic signals are preprocessed, filtering, noise reduction, activity section extraction, data normalization and baseline removal processing are carried out on the facial electromyographic signals to obtain the preprocessed facial electromyographic signals, in a possible implementation mode, a 50Hz Chebyshev I type IIR trap is used for removing power frequency interference, then a 10-400Hz Butterworth I type IIR band-pass filter is used for filtering and noise reduction, and finally data of an effective activity section are extracted and normalized to obtain facial electromyographic signal data with high signal-to-noise ratio.

Specifically, an acquired lip optical image is preprocessed, a gray-scale image is converted, a target area is cut and an image is compressed to obtain a preprocessed lip optical image, in a possible implementation manner, the acquired lip optical image is firstly converted into the gray-scale image, then data is subjected to mean processing based on continuous points to obtain smooth data, abrupt noise interference is reduced, due to the fact that the data volume is large when the whole image is directly trained, processing efficiency is affected, only a required part of the image is often processed, namely the target area is often processed, the target area is an image area selected from an original image and is taken as a focus of image analysis, the purpose is to remove the information which is interfered, and the target area of the lip moving image is cut out by a rectangular frame with the fixed size of 360 to 240, the resolution of the image is then compressed to 200 x 150 pixels using linear interpolation.

Specifically, the acquired oral ultrasonic image is preprocessed, and the oral ultrasonic image is subjected to data smoothing noise reduction and target area cutting processing to obtain the preprocessed oral ultrasonic image. In a possible implementation manner, mean processing is performed on data based on continuous points to obtain smooth data and reduce abrupt noise interference, and since the data volume is large when a whole image is directly trained and the processing efficiency is affected, a target area of an outlet cavity ultrasonic image is intercepted by a rectangular frame with a fixed size of 360 × 240.

and extracting a Mel frequency cepstrum coefficient from the preprocessed facial electromyographic signals, and a first-order difference coefficient and a second-order difference coefficient containing dynamic characteristic information to obtain dynamic characteristic data of facial muscle movement, wherein the dynamic characteristic data can be directly used for neural network training and recognition.

Extracting the motion characteristic data of the lips by adopting a principal component analysis method for the preprocessed lip optical images;

performing principal component analysis on the preprocessed lip optical image, mainly aiming at acquiring a feature space of a lip moving image, reducing the dimensionality of data as much as possible under the condition of keeping basic motion information of a lip, extracting the features of the image, and removing redundant information of a tongue ultrasonic image, and specifically comprising the following steps of:

firstly, vectorizing an image, namely expanding a lip image with the size of mxn into a vector χ of mn × 1 according to rows;

averaging vectors of N training images

The covariance matrix of the N images is

And then solving a characteristic vector and a characteristic value of the C, wherein the size of the characteristic value represents the size of the variance of the lip image projected to the new space, and selecting the direction with the large variance projected to the rear of the new space as the direction to which the original vector needs to be projected. And sorting the feature vectors from large to small according to the size of the feature values, wherein the images corresponding to the feature vectors are feature sub-images, and the more sub-images contained in the features, the more data feature information is carried, so that the better the reconstruction effect is.

And extracting the motion characteristic data of the oral cavity and the tongue by adopting a discrete cosine transform method for the preprocessed oral cavity ultrasonic image.

The preprocessed oral cavity ultrasonic image is subjected to discrete cosine transform, which is an image transformation technology that uses the real part in fourier transform to convert the spatial signal to the frequency domain. The gray value of the original image can be converted into the discrete cosine transform coefficient of the frequency domain by using the discrete cosine transform, most of information of the converted original image is distributed at the upper left corner, the characteristics of the original ultrasonic image are obtained by selecting partial coefficients at the upper left corner, and high-frequency noise can be ignored. Moreover, the ultrasonic image of each tongue can be processed independently, the obtained feature vector representing the original image is stored for later identification, when a new tongue ultrasonic image needs to be subjected to feature processing, other images do not need to be considered, the calculation amount is small, and the calculation speed is high. The method comprises the following specific steps:

for ultrasound image A of size M N_mnThe two-dimensional discrete cosine transform coefficient can be calculated by the following formula

0≤i≤N-1, 0≤j≤N-1

Wherein

The higher value of the computed discrete cosine transform coefficient is concentrated in the lower frequency part at the upper left corner of the matrix, which means that the information content of the part is the largest, and the high frequency coefficient is distributed at the lower right part of the matrix after the discrete cosine transform, which represents the noise information of the tongue ultrasonic image. Therefore, when the discrete cosine transform characteristic is selected, the zigzag value is taken from the upper left corner of the coefficient matrix.

Through this step, characteristic data after processing of the facial myoelectric signal, the lip optical image, and the oral cavity ultrasonic image can be obtained.

Step S103 inputs the processed feature data to a pre-trained unvoiced speech recognition model to obtain a unvoiced speech when speaking in a silence manner.

Before inputting the processed feature data into the pre-trained unvoiced speech recognition model, the method further comprises training the unvoiced speech recognition model by a deep learning method according to the processed feature data. The unvoiced speech recognition model may be trained by using a convolutional neural network algorithm in deep learning, or may be trained by using an LSTM (Long Short-Term Memory) algorithm in deep learning.

In one possible implementation, the unvoiced speech recognition model is trained by the LSTM algorithm, LSTM is a time recursive network suitable for processing and predicting important events with relatively long intervals and delays in time series, LSTM is proposed to solve the problem of 'gradient vanishing' existing in the recurrent neural network, is a special recurrent neural network, and meanwhile, LSTM explicitly avoids the problem of long-term dependence in design, and a well-designed 'gate' structure of LSTM, including an input gate, a forgetting gate and an output gate, is a method for letting information selectively pass through, includes a sigmoid neural network layer and a pointwise multiplication operation, and has the capability of eliminating or adding information to a cell state, so that LSTM can remember long-term information. The specific process is as follows:

in LSTM, the first stage is a forgetting gate, the forgetting layer determines which information needs to be forgotten from the cell state, the next stage is an input gate that determines which new information can be stored into the cell state, the last stage is an output gate that determines what value to output.

(1) Forget the door: the forgetting gate is an output h of the previous layer_t-1And sequence data x to be input at this layer_tAs input, an activation function sigmoid is used to obtain an output f_t。f_tIs taken to be [0,1 ]]The interval indicates the probability that the state of the cell in the previous layer is forgotten, and 1 is "completely retained" and 0 is "completely discarded".

f_t＝σ(W_f·[h_t-1,x_t]+b_f)

(2) An input gate: the input gate comprises two parts, the first part uses sigmoid activation function, and the output is i_tThe second part uses the tanh activation function and the output is

i_t＝σ(W_i·[h_t-1,x_t]+b_i)

To date, f_tIs the output of a forgetting gate, controls C in the state of the cell in the previous layer_t-1The degree of being forgotten is such that,

for the two output multiplication of the input gate, it indicates how much new information is retained, based on which we can update the new information to C in the cell state of this layer_tThe value is obtained.

(3) An output gate: the output gate is used for controlling how much the cell state of the layer is filtered, and firstly, a sigmoid activation function is used for obtaining [0,1 ]]O of interval value_tThen, C is added_tProcessed by tanh activation function with o_tMultiplication, i.e. output h of the layer_t。

o_t＝σ(W_o·[h_t-1,x_t]+b_o)

h_t＝o_t*tanh(C_t)

Forward propagation in LSTM is computed once in time order, and backward propagation is the accumulated residual transferred back from the last time. In the formula, w_ijRepresents the connection weight from neuron i to j; the input of the neuron is represented by a, and the output is represented by b; the subscript l is a number of times,and ω represents an input gate, a forgetting gate and an output gate, respectively; the c subscript indicates the cell state, and the pephole weights from the cell state to the input gate, forgetting gate and output gate are denoted as w_cl，

And w_cω；s_cRepresents the state of c in the cellular state; the activation function of the control gate is represented by f, and g and h respectively represent the input and output activation functions of the cell state; i represents the number of neurons in the input layer, K represents the number of neurons in the output layer, and H represents the number of cryptic cell states.

Calculation of forward propagation:

an input gate:

forget the door:

cell:

an output gate:

the unit outputs:

updating error back propagation:

the unit outputs:

an output gate:

the state is as follows:

cell:

forget the door:

an input gate:

the method comprises the steps of training a silent speech recognition model through an LSTM algorithm to obtain a trained silent speech recognition model, inputting processed feature data into the trained silent speech recognition model, and obtaining silent speech when speaking in a silence mode.

Step S104 transmits the silent voice to the voice receiving apparatus.

Specifically, after the silent voice is recognized, the silent voice can be sent to the voice receiving equipment through the wireless communication device, and the silent communication is realized. In one possible implementation, the identified silent voice is sent to the voice receiving device through WIFI, and the identified silent voice is sent to the voice receiving device through Bluetooth.

Optionally, the voice receiving device can be connected with a wire, and the recognized silent voice can be sent to the voice receiving device.

Through the steps S101-S104, three feature signals commonly used in the silent speech recognition technology are fused, including facial myoelectric signals, lip optical images and oral cavity ultrasonic images, preprocessing and feature extraction are carried out on the three signals, and a recognition model is trained on the three signals, so that a silent speech recognition result with higher accuracy and higher recognition speed can be obtained.

step S201, detecting whether each module is normally started, when each module is normally started, executing step S203, detecting whether the communication device is normal, and when each module is not normally started, executing step S202, and prompting the user to start the corresponding module.

Step S202, prompting the user to start the corresponding module, then returning to step S201, and detecting whether each module is normally started until the equipment is normally started.

Step S203, detecting whether the communication equipment is normal, when the communication equipment is normal, executing step S205, detecting whether the user starts speaking, and when the communication equipment is abnormal, executing step S204, and prompting the user that the communication environment is abnormal.

Step S204, prompting the user that the communication environment is abnormal, then returning to step S203, and detecting whether the communication equipment is normal or not until the equipment communication is normal.

Step S205, detecting whether the user starts speaking, and executing step S206 when the user starts speaking, and synchronously acquiring facial myoelectric signals, lip optical images and oral ultrasonic images; when the user does not start speaking, step S212 is executed, in a standby state.

And step S206, synchronously acquiring facial myoelectric signals, lip optical images and oral ultrasonic images.

Step S207, preprocessing and feature extraction are performed on the facial muscle electrical signal, the lip optical image, and the oral cavity ultrasonic image, so as to obtain processed feature data.

And step S208, training the unvoiced speech recognition model by adopting a deep learning method according to the processed data.

In step S209, the processed feature data is input to a pre-trained unvoiced speech recognition model to obtain a unvoiced speech uttered in a silence manner.

Step S210, transmitting the silent voice to the voice receiving apparatus.

Step S211, detecting whether each module is closed, when each module is closed, ending the communication, and when each module is not closed, executing step S212, and waiting.

Step S212, standby state, and then returning to step S205, it is detected whether the user starts speaking.

Through the steps S201-S212, three feature signals commonly used in the silent speech recognition technology are fused, including facial myoelectric signals, lip optical images and oral cavity ultrasonic images, preprocessing and feature extraction are carried out on the three signals, and a recognition model is trained on the three signals, so that a silent speech recognition result with higher accuracy and higher recognition speed can be obtained.

In order to facilitate understanding of the silencing communication method provided in the embodiments of the present application, the following description is made with reference to fig. 3. As shown in fig. 3, the silence communication method includes a training mode and an application mode, in the training mode, firstly, synchronously acquiring a facial electromyogram signal, a lip optical image and an oral cavity ultrasonic image, then, preprocessing the facial electromyogram signal, the lip optical image and the oral cavity ultrasonic image, extracting MFCCs (Mel-frequency cepstral Coefficients) from the preprocessed facial electromyogram signal, performing PCA (principal component Analysis) on the preprocessed lip optical image, performing DCT (discrete cosine Transform) on the preprocessed oral cavity ultrasonic image to obtain motion characteristic data of the facial electromyogram signal, the lip optical image and the oral cavity ultrasonic image, training a silence speech recognition model according to the obtained motion characteristic data of the facial electromyogram signal, the lip optical image and the oral cavity ultrasonic image to obtain a trained speech recognition model, and putting the trained unvoiced speech recognition model into a recognition network in an application mode.

In an application mode, firstly, a facial electromyographic signal, a lip optical image and an oral cavity ultrasonic image are synchronously acquired, then the facial electromyographic signal, the lip optical image and the oral cavity ultrasonic image are preprocessed, MFCCs are extracted from the preprocessed facial electromyographic signal, PCA is carried out on the preprocessed lip optical image, DCT is carried out on the preprocessed oral cavity ultrasonic image to obtain motion characteristic data of the facial electromyographic signal, the lip optical image and the oral cavity ultrasonic image, the motion characteristic data of the facial electromyographic signal, the lip optical image and the oral cavity ultrasonic image are input into a recognition network model, namely a silent voice recognition model trained in a training mode, voice recognition is carried out, a recognition result is output, and the recognition result is sent to an interactive communication module.

In a second aspect, an embodiment of the present disclosure provides a silencing communication device, including:

the system comprises an S601 data acquisition module, a data acquisition module and a data processing module, wherein the data acquisition module is used for synchronously acquiring facial electromyographic signals, lip optical images and oral cavity ultrasonic images;

the S602 data processing module is used for preprocessing the facial myoelectric signal, the optical lip image and the oral cavity ultrasonic image and extracting characteristics to obtain processed characteristic data;

s603, a recognition module for inputting the processed feature data into a pre-trained silent speech recognition model to obtain silent speech when speaking in a silence mode;

and S604, the communication module is used for sending the silent voice to the voice receiving equipment.

Further, still include:

and the model training module is used for training the silent speech recognition model by adopting a deep learning method according to the processed characteristic data.

In order to facilitate understanding of the silencing communication device provided in the embodiments of the present application, the following description is made with reference to fig. 7. As shown in fig. 7, the silencing communication device includes:

the data acquisition module comprises an electromyographic signal acquisition unit for acquiring facial electromyographic signals of a speaker, a lip optical image acquisition unit for acquiring lip optical images of the speaker, and an oral cavity ultrasonic image acquisition unit for acquiring oral cavity ultrasonic images.

The signal processing module comprises a preprocessing unit and is used for preprocessing the facial myoelectric signals, the lip optical images and the oral cavity ultrasonic images, and comprises a step of filtering, denoising, moving segment extracting, data normalizing and baseline removing processing on the facial myoelectric signals to obtain preprocessed facial myoelectric signals, a step of processing the lip optical images by converting gray level images, cutting target areas and compressing images to obtain preprocessed lip optical images, and a step of processing the oral cavity ultrasonic images by data smoothing denoising and cutting the target areas to obtain preprocessed oral cavity ultrasonic images. The system comprises a feature extraction and fusion unit, a feature extraction and fusion unit and a feature fusion unit, wherein the feature extraction and fusion unit is used for extracting mel frequency cepstrum coefficients from preprocessed facial electromyographic signals to obtain dynamic feature data of facial muscle movement, extracting movement feature data of lips by adopting a principal component analysis method on preprocessed lip optical images, and extracting movement feature data of oral cavities and tongues by adopting a discrete cosine transform method on preprocessed oral cavity ultrasonic images. The method comprises a training unit, which is used for training the unvoiced speech recognition model by adopting a deep learning method according to the processed characteristic data. The voice recognition device comprises a recognition unit, a voice recognition unit and a voice recognition unit, wherein the recognition unit is used for inputting the processed characteristic data into a pre-trained silent voice recognition model to obtain the silent voice when speaking in a silence mode.

And the communication interaction module is used for sending the silent voice to the voice receiving equipment.

It should be noted that, when the silencing communication device provided in the foregoing embodiment executes the silencing communication method, only the division of the functional modules is illustrated, and in practical applications, the functions may be distributed to different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules to complete all or part of the functions described above. In addition, the silencing communication device and the silencing communication method provided by the above embodiments belong to the same concept, and the detailed implementation process is shown in the method embodiments, which is not described herein again.

In a third aspect, an embodiment of the present disclosure further provides an electronic device corresponding to the muting communication method provided in the foregoing embodiment, so as to execute the muting communication method.

Referring to fig. 8, a schematic diagram of an electronic device provided in some embodiments of the present application is shown. As shown in fig. 8, the electronic apparatus includes: a processor 800, a memory 801, a bus 802 and a communication interface 803, the processor 800, the communication interface 803 and the memory 801 being connected by the bus 802; the memory 801 stores a computer program that is executable on the processor 800, and the processor 800 executes the computer program to perform the silencing method according to any of the embodiments of the present application.

The Memory 801 may include a high-speed Random Access Memory (RAM) and may also include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. The communication connection between the network element of the system and at least one other network element is realized through at least one communication interface 803 (which may be wired or wireless), and the internet, a wide area network, a local network, a metropolitan area network, etc. may be used.

Bus 802 can be an ISA bus, PCI bus, EISA bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. The memory 801 is used for storing a program, and the processor 800 executes the program after receiving an execution instruction, and the silencing communication method disclosed in any embodiment of the present application may be applied to the processor 800, or implemented by the processor 800.

The processor 800 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 800. The Processor 800 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 801, and the processor 800 reads the information in the memory 801 and completes the steps of the method in combination with the hardware thereof.

The electronic device provided by the embodiment of the application and the silence communication method provided by the embodiment of the application have the same inventive concept and have the same beneficial effects as the method adopted, operated or realized by the electronic device.

In a fourth aspect, the present invention further provides a computer-readable storage medium corresponding to the muting communication method provided in any of the foregoing embodiments, please refer to fig. 9, which illustrates a computer-readable storage medium as an optical disc 900 on which a computer program (i.e., a program product) is stored, where the computer program, when executed by a processor, performs the muting communication method provided in any of the foregoing embodiments.

It should be noted that examples of the computer-readable storage medium may also include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory, or other optical and magnetic storage media, which are not described in detail herein.

The computer-readable storage medium provided by the above embodiments of the present application and the silencing communication method provided by the embodiments of the present application have the same beneficial effects as the method adopted, executed or implemented by the application program stored in the computer-readable storage medium.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

19页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种配音方法和系统

Method, device, equipment and storage medium for silencing communication

相关技术

网友询问留言