Prosodic phrase identification method and device and electronic equipment

文档序号：1230198 发布日期：2020-09-08 浏览：14次中文

阅读说明：本技术 一种韵律短语识别方法、装置及电子设备 (Prosodic phrase identification method and device and electronic equipment ) 是由高岩贾晓丰张晰王大亮赵聃齐红威于 2020-05-29 设计创作，主要内容包括：本申请公开了一种韵律短语识别方法、装置及电子设备,获得待识别的目标数据,目标数据中至少包含文本数据和文本数据对应的音频数据,文本数据中包含至少一个语句；获得文本数据对应的文本特征编码和音频数据对应的声学特征编码；对文本特征编码和声学特征编码进行处理,以得到关于文本和音频相对齐的多模态特征；将多模态特征输入到预先训练完成的韵律识别模型,以得到韵律识别模型输出的韵律短语序列,韵律短语序列中包含多个韵律短语,且韵律短语之间至少利用韵律符号分割；其中,韵律识别模型为利用至少两个具有韵律短语标签的语句样本和语句样本对应的音频样本进行训练得到。(The application discloses a prosodic phrase recognition method, a prosodic phrase recognition device and electronic equipment, wherein target data to be recognized are obtained, the target data at least comprise text data and audio data corresponding to the text data, and the text data comprises at least one sentence; acquiring text feature codes corresponding to the text data and acoustic feature codes corresponding to the audio data; processing the text feature encoding and the acoustic feature encoding to obtain multi-modal features regarding alignment of text and audio; inputting the multi-modal characteristics into a prosody recognition model which is trained in advance to obtain a prosody phrase sequence output by the prosody recognition model, wherein the prosody phrase sequence comprises a plurality of prosody phrases, and the prosody phrases are divided at least by using prosody symbols; the prosody recognition model is obtained by training at least two sentence samples with prosody phrase labels and audio samples corresponding to the sentence samples.)

1. A prosodic phrase identification method, the method comprising:

obtaining target data to be identified, wherein the target data at least comprises text data and audio data corresponding to the text data, and the text data comprises at least one sentence;

acquiring text feature codes corresponding to the text data and acoustic feature codes corresponding to the audio data;

processing the text feature encoding and the acoustic feature encoding to obtain multi-modal features regarding text and audio alignment;

inputting the multi-modal characteristics to a prosody recognition model which is trained in advance to obtain a prosody phrase sequence output by the prosody recognition model, wherein the prosody phrase sequence comprises a plurality of prosody phrases, and the prosody phrases are divided at least by prosody symbols;

the prosody recognition model is obtained by training at least two sentence samples with prosody phrase labels and audio samples corresponding to the sentence samples.

2. The method of claim 1, wherein the prosodic recognition model is trained by:

obtaining multi-modal feature samples of the sentence samples and the corresponding audio samples;

inputting the multi-modal characteristic samples into a prosody recognition model which is initially established to obtain an output result of the prosody recognition model;

comparing the prosodic phrase sequence in the output result with the prosodic phrase tag of the sentence sample to obtain a comparison result;

and adjusting the model parameters of the rhythm identification model according to the comparison result.

3. The method of claim 2, obtaining multimodal feature samples of the sentence samples and their corresponding audio samples, comprising:

obtaining a text feature coding sample corresponding to the statement sample and an acoustic feature coding sample corresponding to the audio sample;

and processing the text feature coding samples and the acoustic feature coding samples to obtain multi-modal feature samples which are aligned with respect to text and audio.

4. The method of claim 3, wherein obtaining the text feature coding samples corresponding to the sentence samples and the acoustic feature coding samples corresponding to the audio samples comprises:

respectively converting the statement sample and the audio sample to obtain a text vector sample corresponding to the statement sample and an acoustic vector sample corresponding to the audio sample;

and respectively carrying out feature coding on the text vector sample and the acoustic vector sample to obtain a text feature coding sample corresponding to the text vector sample and an acoustic feature coding sample corresponding to the acoustic vector sample.

5. The method of claim 3, processing the text feature encoding samples and the acoustic feature encoding samples to obtain multi-modal feature samples with respect to text and audio alignment, comprising:

aligning the text feature coding samples and the acoustic feature coding samples with respect to text and audio using an attention mechanism to obtain aligned feature samples;

and converting the vector characteristics of the alignment characteristic sample to obtain a multi-modal characteristic sample.

6. The method of claim 1, wherein processing the text feature encodings and the acoustic feature encodings to derive multimodal features with respect to text and audio alignment comprises:

aligning the text feature encoding and the acoustic feature encoding with respect to text and audio using an attention mechanism to obtain an alignment feature;

and converting the vector characteristics of the alignment characteristics to obtain multi-modal characteristics.

7. The method of claim 1, wherein obtaining the text feature code corresponding to the text data and the acoustic feature code corresponding to the audio data comprises:

respectively converting the text data and the audio data to obtain a text vector corresponding to the text data and an acoustic vector corresponding to the audio data;

and respectively carrying out feature coding on the text vector and the acoustic vector to obtain a text feature code corresponding to the text vector and an acoustic feature code corresponding to the acoustic vector.

8. A prosodic phrase recognition apparatus, the apparatus comprising:

the data acquisition unit is used for acquiring target data to be identified, wherein the target data at least comprises text data and audio data corresponding to the text data, and the text data comprises at least one sentence;

a feature code obtaining unit, configured to obtain a text feature code corresponding to the text data and an acoustic feature code corresponding to the audio data;

a multi-modal feature obtaining unit, configured to process the text feature encoding and the acoustic feature encoding to obtain multi-modal features regarding alignment of text and audio;

the model operation unit is used for inputting the multi-modal characteristics to a prosody recognition model which is trained in advance so as to obtain a prosodic phrase sequence output by the prosody recognition model, wherein the prosodic phrase sequence comprises a plurality of prosodic phrases, and the prosodic phrases are divided at least by prosodic symbols;

the prosody recognition model is obtained by training at least two sentence samples with prosody phrase labels and audio samples corresponding to the sentence samples.

9. The apparatus of claim 8, further comprising:

the model training unit is used for obtaining the sentence samples and multi-modal feature samples of the corresponding audio samples; inputting the multi-modal feature samples corresponding to the statement samples into an initially created prosody recognition model to obtain an output result of the prosody recognition model; comparing the prosodic phrase sequence in the output result with the prosodic phrase tag of the sentence sample to obtain a comparison result; and adjusting the model parameters of the rhythm identification model according to the comparison result.

10. An electronic device, comprising:

a memory for storing an application program and data generated by the application program running;

a processor for executing the application to implement: obtaining target data to be identified, wherein the target data at least comprises text data and audio data corresponding to the text data, and the text data comprises at least one sentence; acquiring text feature codes corresponding to the text data and acoustic feature codes corresponding to the audio data; processing the text feature encoding and the acoustic feature encoding to obtain multi-modal features regarding text and audio alignment; inputting the multi-modal characteristics to a prosody recognition model which is trained in advance to obtain a prosody phrase sequence output by the prosody recognition model, wherein the prosody phrase sequence comprises a plurality of prosody phrases, and the prosody phrases are divided at least by prosody symbols; the prosody recognition model is obtained by training at least two sentence samples with prosody phrase labels and audio samples corresponding to the sentence samples.

Technical Field

The present application relates to the field of text recognition technologies, and in particular, to a prosodic phrase recognition method, an apparatus, and an electronic device.

Background

Prosody is an important element of language interaction, and is a concept combining hearing and perception. Prosodic phrases are intended to mean that certain words in spoken language are naturally associated together, while certain words are significantly spaced or separated from each other. Prosodic phrase recognition refers to determining whether a prosodic boundary exists after a given vocabulary. For example, after performing prosodic phrase recognition on "little pool spring water immersion light cloud", we obtain "little pool #1 spring water immersion light cloud #1 immersion light cloud # 4", wherein "little pool", "spring water" and "immersion light cloud" are the recognized prosodic phrases, divided with the symbol "#", and a number capable of indicating a pause level is added after "#".

In the existing scheme for recognizing prosodic phrases, a prosodic recognition model constructed in advance is usually trained by using sentences with artificially labeled prosodic tags, and the trained prosodic recognition model can recognize prosodic phrases in sentences with unknown prosody.

However, in the above implementation schemes, text-dependent prosody labeling is adopted, so that there is a case that a model training sample is single, and the finally recognized prosodic phrase may have an inaccurate problem.

Disclosure of Invention

In view of the above, the present application provides a prosodic phrase recognition method, apparatus and electronic device, including:

a prosodic phrase recognition method, the method comprising:

obtaining target data to be identified, wherein the target data at least comprises text data and audio data corresponding to the text data, and the text data comprises at least one sentence;

acquiring text feature codes corresponding to the text data and acoustic feature codes corresponding to the audio data;

processing the text feature encoding and the acoustic feature encoding to obtain multi-modal features regarding text and audio alignment;

the prosody recognition model is obtained by training at least two sentence samples with prosody phrase labels and audio samples corresponding to the sentence samples.

In the above method, preferably, the prosody recognition model is obtained by training in the following manner:

obtaining multi-modal feature samples of the sentence samples and the corresponding audio samples;

inputting the multi-modal characteristic samples into a prosody recognition model which is initially established to obtain an output result of the prosody recognition model;

comparing the prosodic phrase sequence in the output result with the prosodic phrase tag of the sentence sample to obtain a comparison result;

and adjusting the model parameters of the rhythm identification model according to the comparison result.

The above method, preferably, obtaining a multi-modal feature sample of the sentence sample and its corresponding audio sample, includes:

obtaining a text feature coding sample corresponding to the statement sample and an acoustic feature coding sample corresponding to the audio sample;

and processing the text feature coding samples and the acoustic feature coding samples to obtain multi-modal feature samples which are aligned with respect to text and audio.

Preferably, the obtaining of the text feature coding sample corresponding to the sentence sample and the acoustic feature coding sample corresponding to the audio sample includes:

respectively converting the statement sample and the audio sample to obtain a text vector sample corresponding to the statement sample and an acoustic vector sample corresponding to the audio sample;

The method preferably processes the text feature coding samples and the acoustic feature coding samples to obtain multi-modal feature samples with respect to text and audio alignment, and includes:

aligning the text feature coding samples and the acoustic feature coding samples with respect to text and audio using an attention mechanism to obtain aligned feature samples;

and converting the vector characteristics of the alignment characteristic sample to obtain a multi-modal characteristic sample.

The method preferably processes the text feature encoding and the acoustic feature encoding to obtain multi-modal features regarding text and audio alignment, and includes:

aligning the text feature encoding and the acoustic feature encoding with respect to text and audio using an attention mechanism to obtain an alignment feature;

and converting the vector characteristics of the alignment characteristics to obtain multi-modal characteristics.

Preferably, the method for obtaining the text feature code corresponding to the text data and the acoustic feature code corresponding to the audio data includes:

respectively converting the text data and the audio data to obtain a text vector corresponding to the text data and an acoustic vector corresponding to the audio data;

A prosodic phrase recognition apparatus, the apparatus comprising:

a feature code obtaining unit, configured to obtain a text feature code corresponding to the text data and an acoustic feature code corresponding to the audio data;

a multi-modal feature obtaining unit, configured to process the text feature encoding and the acoustic feature encoding to obtain multi-modal features regarding alignment of text and audio;

the prosody recognition model is obtained by training at least two sentence samples with prosody phrase labels and audio samples corresponding to the sentence samples.

The above apparatus, preferably, further comprises:

An electronic device, the electronic device comprising:

a memory for storing an application program and data generated by the application program running;

According to the above technical scheme, in the prosodic phrase recognition method, device and electronic device disclosed in the present application, when prosodic phrase recognition is required, not only text data but also audio data corresponding to the text data are required to be obtained, then multi-modal features related to text and audio alignment are obtained by using text feature codes corresponding to the text data and acoustic feature codes corresponding to the audio data, and then the multi-modal features are used as input of a prosodic recognition model obtained by training a comprehensive sentence sample and an audio sample together, instead of using the text features as input of the prosodic recognition model alone, so that the prosodic recognition model processes the multi-modal features and outputs a corresponding prosodic phrase sequence which comprises a plurality of prosodic phrases and is segmented by prosodic symbols. Therefore, the audio samples corresponding to the sentence samples and the sentence samples are added into the training of the prosody recognition model, so that the training samples of the prosody recognition model are enriched, and the audio corresponding to the text can be more attached to prosody pronunciations in a real environment, so that the trained prosody recognition model can process multi-modal characteristics corresponding to text data and corresponding audio data and output more accurate prosody phrase sequences, the condition of low recognition accuracy caused by prosody recognition by the text alone is avoided, and the purpose of improving the prosody phrase recognition accuracy is achieved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

Fig. 1 is a flowchart of a prosodic phrase recognition method according to an embodiment of the present application;

FIG. 2 is a partial flowchart of a first embodiment of the present application;

FIG. 3 is a partial flowchart of a first embodiment of the present application;

fig. 4 is a schematic structural diagram of a prosodic phrase recognition apparatus according to a second embodiment of the present application;

FIG. 5 is a schematic structural diagram of another prosodic phrase recognition apparatus according to the second embodiment of the present application;

FIG. 6 is a schematic diagram illustrating a partial structure of another prosodic phrase recognition apparatus according to the second embodiment of the present application;

fig. 7 is a schematic structural diagram of an electronic device according to a third embodiment of the present application;

FIG. 8 is a block diagram of a prosodic phrase recognition scheme provided in an embodiment of the present application;

fig. 9 is a schematic diagram of an acoustic feature extraction provided in an embodiment of the present application;

FIG. 10 is a schematic diagram illustrating conditional random field tag probability prediction according to an embodiment of the present application;

FIG. 11 is a diagram of an example model inference provided by an embodiment of the application;

fig. 12 is an example of the embodiment of the present application.

Detailed Description

At present, there is an implementation scheme for predicting sample labeling based on artificial intelligence prosody, which obtains text characteristics and pronunciation duration of each word in a text sequence by using a sample audio file and a corresponding text sequence, and labels the text sequence by using a pre-trained prosodic phrase recognition model. Prosodic phrases refer to intermediate blocks of tempo between prosodic words and intonation phrases.

The inventor of the present application finds, through research, that in the above scheme, boundary points of prosodic phrases are mainly predicted through machine learning and deep learning, or recognition of prosodic phrases is realized through a model fusion mode, but in the implementation scheme, texts are used as training samples, so that the prosodic phrase recognition model recognizes prosodic phrases by means of text features alone, and there is a case of accurate and wrong recognition.

In view of the above, through further research, the inventors of the present application propose a technical solution capable of performing prosodic phrase recognition in combination with a text and an audio corresponding to the text, which is specifically as follows:

firstly, target data to be identified is obtained, wherein the target data at least comprises text data and audio data corresponding to the text data, and the text data comprises at least one sentence; then, obtaining a text feature code corresponding to the text data and an acoustic feature code corresponding to the audio data; based on the above, after the text feature coding and the acoustic feature coding are processed to obtain multi-modal features related to alignment of text and audio, inputting the multi-modal features to a prosody recognition model trained in advance to obtain a prosody phrase sequence output by the prosody recognition model, wherein the prosody phrase sequence comprises a plurality of prosody phrases, and the prosody phrases are at least divided by prosody symbols; the prosody recognition model is obtained by training at least two sentence samples with prosody phrase labels and audio samples corresponding to the sentence samples.

Therefore, the audio samples corresponding to the sentence samples and the sentence samples are added into the training of the prosody recognition model together, the training samples of the prosody recognition model are enriched, and the audio corresponding to the text can be more attached to prosody pronunciations in a real environment, so that the trained prosody recognition model can process multi-modal characteristics corresponding to the text data and the corresponding audio data and output more accurate prosody phrase sequences, the condition that the recognition accuracy is low due to the fact that the text is used for performing prosody recognition alone is avoided, and the purpose of improving the prosody phrase recognition accuracy is achieved.

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In this application, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Referring to fig. 1, a flowchart of an implementation of a prosodic phrase recognition method provided in an embodiment of the present application is shown, where the method may be applied to an electronic device, such as a computer or a server, capable of performing data processing, particularly feature fusion processing. The technical scheme in the embodiment is mainly used for processing multi-modal characteristics based on texts and audios during prosodic phrase recognition to obtain corresponding prosodic phrase sequences which comprise a plurality of prosodic phrases and are divided by prosodic symbols, so that the problem of low recognition accuracy caused by prosodic recognition by only depending on texts is avoided.

In a specific implementation, the method in this embodiment may include the following steps:

step 101: target data to be identified is obtained.

The target data at least includes text data and audio data corresponding to the text data, for example, the target data includes: text data "is the weather true today, is everyone going out to play badminton? Where to beat the woolen cloth? "also contain these text data" is the weather good today, is everyone going out and playing badminton? Where to beat the woolen cloth? "corresponding audio data.

In an implementation manner, in this embodiment, a segment of video and audio data that needs to be subjected to prosody recognition may be obtained first, then the video and audio data are extracted, and finally voice recognition is performed on the audio data based on an automatic voice recognition technology asr (automatic Speech recognition) to obtain corresponding text data, so as to obtain target data including the text data and audio data corresponding to the text data;

or, in this embodiment, text data, such as an article composed of a plurality of sentences, may be obtained first, and then audio data corresponding to the text data or sentences in the text data are generated by using an audio-based generation tool to obtain corresponding audio data, so as to obtain target data including the text data and the audio data corresponding to the text data;

that is, in this embodiment, before prosodic phrase recognition is performed, if the acquired target data simultaneously contains audio data, video data, and text data, a subsequent prosodic phrase recognition process can be directly performed according to the audio data and the text data; if the obtained target data only contains audio and video data, extracting the audio and video data in the audio and video data through an ASR system, converting the audio data into text data, and then performing a subsequent prosodic phrase identification process according to the audio data and the text data; if the obtained target data only contains text data, audio data corresponding to the text data can be obtained through an artificial reading or audio generation tool, and then a subsequent prosodic phrase identification process is carried out according to the audio data and the text data.

The text data includes at least one sentence, for example, the text data includes: "is the weather good today, and is everyone going out to play badminton? Where to beat the woolen cloth? "the weather of the people is good today", "do you go out and play badminton? "," where to go and get well? "these multiple statements. Correspondingly, the audio data includes an audio segment corresponding to each sentence.

In a specific implementation, in this embodiment, the text data may be acquired by the text data input unit, and the audio file or the video file may be acquired by the audio and video input unit, for example, the text data "today weather is true and good" is acquired by the text data input unit, and the audio and video input unit acquires the audio data file or the video file corresponding to the text data "today weather is true and good".

Step 102: and acquiring text feature codes corresponding to the text data and acoustic feature codes corresponding to the audio data.

In one implementation, step 102 may encode the text data and the audio data by using a feature encoding algorithm, such as a neural network, to obtain a text feature code corresponding to the text data and an acoustic feature code corresponding to the audio data.

Specifically, step 102 may be implemented by:

firstly, respectively carrying out vector conversion on text data and audio data to obtain text vectors corresponding to the text data and acoustic vectors corresponding to the audio data, wherein the text vectors comprise statement vectors of all statements, each statement vector comprises one or more word vectors, similarly, the acoustic vectors can comprise fragment vectors of acoustic fragments corresponding to all statements, and each fragment vector consists of vectors corresponding to one or more vocalized or pronounced fragments;

in the above formula, E_WordRepresenting word-embedding matrices, x_iRepresenting a one-hot representation of the ith character index,

a word vector representing the ith character.

Wherein, the calculation formula used for converting the audio data is as formula (2):

v_i＝W_FB·e_FB+W_MFCC·e_MFCC+ b formula (2)

In the above formula, W_FB、W_MFCCB is the parameter to be trained, e_FBAcoustic vectors extracted for the Filter bank feature extraction Algorithm, e_MFCCIn order to extract acoustic vectors based on the MFCC feature extraction algorithm (in the application, in order to improve the accuracy, the acoustic vectors extracted by different methods are fused), the acoustic vectors corresponding to the audio are obtained by multiplying the parameters to be trained and the acoustic feature extracted vectors by weights and then adding the vectors.

In specific implementation, in this embodiment, the sentence sample is converted by a word vector pre-training algorithm through a text embedding and representing unit to obtain a corresponding text vector, pre-emphasis, framing, windowing and other operations are performed on audio data through an acoustic feature extraction unit to enable acoustic signals to become smooth and to perform fast fourier transform, then fast fourier transform is performed on each section of signals, a spectrogram can be obtained after the transform, the signals are filtered by a linearly distributed triangular window Filter on a Mel frequency scale, and finally logarithm is taken on the output of the triangular Filter to generate a Filter bank vector, wherein the vector is generally 40-dimensional. If the MFCC feature vectors can be obtained by adding discrete cosine transform, all the MFCC feature vectors can be used as acoustic features, and the acoustic vectors are obtained after fusion.

And then, respectively carrying out feature coding on the text vector and the acoustic vector to obtain text feature codes corresponding to the text vector and acoustic feature codes corresponding to the acoustic vector.

The text feature coding refers to feature coding obtained after coding a text vector, specifically, the text feature coding may be coding a text vector generated by the text data through a bidirectional long-and-short-term memory neural network, and the text vector may be coded by a plurality of methods, for example, the text vector may be coded by the following equations (3) to (5).

In the above formula, e_iRepresenting individual word vectors in the text vector,

the state of the positive hidden layer is obtained by coding a positive long-time and short-time memory neural network,

representing the reverse hidden layer state, obtained by reverse long-and-short-term memory neural network coding, h_iThe method represents the encoding of text features through a neural network, and is characterized in that two vectors are connected, and the value range of i is 1, 2.

Specifically, the acoustic feature coding may be coding an acoustic vector generated by encoding audio data through a bidirectional long-and-short-term memory neural network, and the acoustic vector may be coded by a plurality of methods, for example, a Mel-Frequency Cepstrum coefficient mfcc (Mel Frequency Cepstrum coefficient) vector may be coded by the following formulas (6) to (8).

In the above formula, vi denotes an acoustic vector generated by audio data,the state of the positive hidden layer is obtained by coding a positive long-time and short-time memory neural network,representing the state of a reverse hidden layer, obtained by reverse long-and-short-term memory neural network coding, s_iThe coding of representing the acoustic features through the neural network is the connection of two vectors, and the value range of i is 1, 2.

In specific implementation, in this embodiment, the text feature encoding unit performs an encoding operation on a text vector (or referred to as a text feature) generated by the text data by using a bidirectional long-and-short-term memory neural network, converts the text feature into vectorization representation, and obtains a text feature code h_iThe acoustic feature coding unit uses a bidirectional long-and-short-term memory neural network to perform coding operation on acoustic vectors (or called as acoustic features) generated by audio data, the acoustic features are converted into vectorization representation, and an acoustic feature code s is obtained_i。

Step 103: the text feature encodings and the acoustic feature encodings are processed to derive multimodal features with respect to text and audio alignment.

In step 103, the two feature codes can be aligned and fused to obtain a multi-modal feature, where the multi-modal feature is a feature that two feature codes, namely a text feature code and an acoustic feature code, are fused and the two feature codes are aligned with respect to text and audio.

Specifically, step 103 may be implemented by:

first, using an attention mechanism, text feature encoding and acoustic feature encoding are aligned with respect to text and audio to obtain an alignment feature in which the characters of the text and the segments of the audio are aligned accordingly, e.g., the text character "today" and the audio segment "today" are aligned or otherwise have a mapping relationship.

Among them, the attention mechanism is a mechanism for rapidly extracting important features of sparse data. In a specific implementation, in this embodiment, the attention mechanism aligning unit uses an attention mechanism to calculate an attention weight between an acoustic feature code and a text feature code, which aims to enable a learning model based on the attention mechanism to learn alignment weights of the acoustic feature and the text feature, so that the learning model can learn alignment of the two features at a phonetic level, in this embodiment, the text feature code and the acoustic feature code are aligned with respect to text and audio by following equations (9) - (11) to obtain an alignment feature.

a_i,j=tabh(u^Ts_i+v^Th_j+ b) formula (9)

In the above formula, tanh is a hyperbolic tangent function, t represents the number of iterations of each training or processing, e represents a word vector, N represents the number of words of a sentence, si represents an acoustic coding feature, hi represents a text coding feature, u represents a text coding feature, and^T、v^Tand b represents the parameter to be learned,representing the alignment vector, a, generated after the attention mechanism alignment_i，jIndicates an attention weight, is a value of [0, 1 ]]Represents the degree of similarity of each word with the corresponding audio.

And finally, converting the vector characteristics of the alignment characteristics to obtain multi-modal characteristics.

In this embodiment, the alignment features may be encoded by using a feature encoding algorithm, such as a neural network, to obtain multi-modal features represented by alignment feature vectorization.

In a specific implementation, in this embodiment, feature alignment may be performed on the text feature codes and the acoustic feature codes by an attention mechanism feature alignment unit, so that text characters and acoustic segments are aligned between the text feature codes and the acoustic feature codes, and then, feature coding is performed on the aligned features or vectors through a bidirectional long-and-short-term memory neural network by an alignment feature coding unit, so as to obtain vectorization representation fusing the acoustic features and the text features, that is, multi-modal features. In the present embodiment, the text feature codes and the acoustic feature codes may be processed by the following equations (12) to (14) to obtain multi-modal features regarding alignment of text and audio.

In the above formula, c_iA composite vector representing the text feature and the acoustic feature after the alignment, i.e., a multi-modal feature, whose effective length is the length of the text after the word segmentation,a vector representing alignment features generated after attention-driven alignment, i having a range of values 1, 2.

Step 104: and inputting the multi-modal characteristics into a prosody recognition model which is trained in advance to obtain a prosody phrase sequence output by the prosody recognition model.

The prosodic phrases can be divided by at least prosodic symbols, for example, the "today," "weather," "true good" is included in "today #1 weather #1 true good # 4" and divided by "# 1", "# 4", respectively, where "#" represents a prosodic pause, and "1" and "4" represent the level of pause, respectively.

In specific implementation, the prosody recognition model in this embodiment is constructed in advance based on a sequence label prediction method, such as a conditional random field, a hidden markov method, a viterbi coding, and the like, and is constructed based on a state probability transition method. After the prosody recognition model is initially constructed, the prosody recognition model may be trained using at least two sentence samples having prosody phrase tags and audio samples corresponding to the sentence samples.

For example, in the present embodiment, a plurality of sentence samples such as a sentence sample "shouchun #1 spring water #1 immersion clouds # 4" and a corresponding audio sample and a corresponding multi-modal feature sample obtained by preprocessing are sequentially inputted into a prosody recognition model, where the sentence samples have prosody phrase tags, such as "# 1" after "shouchun" and "# 4" after "shouchun" and the like, based on which, the prosody recognition model learns the multi-modal feature samples of the sentence samples and the corresponding audio sample based on the initialized model parameters, and outputs a corresponding prosody recognition result, and the prosody recognition result includes a prosody phrase sequence of "shouchun immersion clouds", at this time, the prosody phrase sequence in the prosody recognition result is compared with the prosody phrase tags in the sentence samples to compare whether the prosody recognition result outputted by the prosody recognition model under the current model parameters is accurate, and then adjusting model parameters of the prosody recognition model according to the comparison result, after training of a plurality of sentence samples and corresponding audio samples, adjusting the model parameters of the prosody recognition model for a plurality of times until the comparison results of continuous times all indicate that the prosody recognition result is accurate, completing model training, and the trained prosody recognition model can accurately recognize prosody phrases of text data and corresponding audio data.

It can be known from the foregoing solution that, in a prosodic phrase recognition method provided in an embodiment of the present application, when prosodic phrase recognition is required, not only text data but also audio data corresponding to the text data are required to be obtained, then multi-modal features aligned with text and audio are obtained by using text feature codes corresponding to the text data and acoustic feature codes corresponding to the audio data, and the multi-modal features are used as inputs of a prosodic recognition model obtained by training a synthesized sentence sample and an audio sample together, instead of using the text features as inputs of a prosodic recognition model alone, so that the prosodic recognition model processes the multi-modal features and outputs a corresponding prosodic phrase sequence that includes a plurality of prosodic phrases and is segmented by prosodic symbols. Therefore, the audio samples corresponding to the sentence samples and the sentence samples are added into the training of the prosody recognition model, so that the training samples of the prosody recognition model are enriched, and the audio corresponding to the text can be more attached to prosody pronunciations in a real environment, so that the trained prosody recognition model can process multi-modal characteristics corresponding to text data and corresponding audio data and output more accurate prosody phrase sequences, the condition of low recognition accuracy caused by prosody recognition by the text alone is avoided, and the purpose of improving the prosody phrase recognition accuracy is achieved.

In one implementation, the prosody recognition model in this embodiment may be specifically obtained by training as shown in fig. 2:

step 201: and obtaining multi-modal feature samples of the sentence samples and the corresponding audio samples.

The multi-modal feature samples can be understood as multi-modal features after text vector samples generated by statement samples and acoustic vector samples generated by audio samples are aligned and subjected to vectorization conversion.

In a specific implementation, in the present embodiment, a sentence sample is obtained through the text data input unit, and an audio sample corresponding to the sentence sample is obtained through the audio and video input unit, for example, a sentence sample "wait for a moment #1 and then beat you with #3, i now do not go home # 1" is obtained through the text data input unit, where the sentence sample contains a prosodic phrase tag, a sentence sample "wait for a moment #1 and then beat you with #3 is obtained through the audio and video input unit, i now do not go home # 4" and corresponding audio sample.

It should be noted that, when obtaining the statement sample, the statement sample needs to be converted into a machine-readable encoding format, for example, an UTF-8 encoding format; when obtaining audio samples, the audio file needs to be uniformly processed into an input format of the model, for example, input formats such as pcm, wav, mp3, and the like.

Step 202: and inputting the multi-modal characteristic samples into the initially created prosody recognition model to obtain an output result of the prosody recognition model.

Specifically, after the multi-modal feature samples are input into the initially created prosody recognition model, the prosody recognition model learns the multi-modal feature samples based on the initialized model parameters, and outputs a corresponding prosody recognition result, i.e., an output result, which contains a prosody phrase sequence, such as "shochu #3 spring water #1 immersion clouds # 4".

In a specific implementation, in this embodiment, the multi-modal feature samples corresponding to the sentence samples are input into the initially created prosody recognition model, acquiring text feature codes of the statement samples through a text feature encoding unit, acquiring acoustic feature codes of audio samples corresponding to the text feature codes through an acoustic feature encoding unit, aligning the text feature codes and the acoustic feature codes by using an attention mechanism aligning unit, encoding the aligned features through an aligning feature encoding unit to obtain comprehensive vector representation of fused text and acoustic features, and finally calculating scores of all possible tag sequences by using a sequence labeling scoring algorithm through a model decision unit in consideration of the occurrence condition of front and rear tags, and selecting a sequence with the largest score as an output sequence of the model to obtain an output result of the prosody recognition model.

Step 203: and comparing the prosodic phrase sequence in the output result with the prosodic phrase tag of the sentence sample to obtain a comparison result.

In this embodiment, the prosodic phrase sequence in the output result, such as "cuvette #3 spring water #1 immersion clouds # 4" and the prosodic phrase tag in the sentence sample, such as "cuvette #1 spring water #1 immersion clouds # 4" may be compared to compare whether the output result of the prosodic recognition model under the current model parameters is accurate. For example, the alignment result may be a result of cross entropy calculated according to the prosodic phrase sequences in the output result and the prosodic phrase tags of the sentence samples, wherein the cross entropy can represent the degree of similarity between the prosodic phrase sequences in the output result and the prosodic phrase tags of the sentence samples, for example, the smaller the cross entropy, the more similar the prosodic phrase sequences in the output result and the prosodic phrase tags of the sentence samples are.

Step 204: and adjusting the model parameters of the rhythm identification model according to the comparison result.

In a specific implementation, in this embodiment, whether to adjust the model parameters of the prosody recognition model and determine an adjustment manner of the model parameters of the prosody recognition model by determining whether the comparison result satisfies a preset adjustment condition, for example, a value of one or more model parameters is increased or decreased by a certain value. Based on this, if the comparison result meets the preset adjustment condition and the model parameter is adjusted, in this embodiment, the process may return to step 201, obtain a new set of multimodal feature samples again, and continue to obtain the comparison result, and so on, until the obtained comparison result does not meet the adjustment condition, and if the similarity degree between the prosody phrase sequence in the cross entropy representation output result in the comparison result and the prosody phrase tag of the sentence sample is greater than a certain threshold, the training is completed.

In one implementation, in the embodiment, a large number of sentence samples and corresponding audio samples are obtained before training the prosody recognition model. The original sentence sample obtained in this embodiment may have characters, such as a coding format, a network tag, an emoticon, and the like, which are meaningless for the prosodic phrase recognition task, and at this time, data pre-denoising processing may be performed on the data, for example, denoising operations such as removing an illegal network tag, turning a simplified sentence into a traditional sentence, performing half-full angle conversion, removing a tag symbol, counting data phonemes, checking a phoneme balance condition, and the like through a data preprocessing unit, and performing word segmentation processing on the data through an artificial labeling result to ensure that each word corresponds to each prosodic tag.

In specific implementation, in this embodiment, when obtaining the text feature coding sample corresponding to the sentence sample and the acoustic feature coding sample corresponding to the audio sample, step 201 may be implemented in the following manner, as shown in fig. 3:

step 301: and obtaining a text feature coding sample corresponding to the statement sample and an acoustic feature coding sample corresponding to the audio sample.

In one implementation, in step 301, the sentence samples and the audio samples may be encoded in a feature coding algorithm, such as a neural network, to obtain text feature coding samples corresponding to the sentence samples and acoustic feature coding samples corresponding to the audio samples.

Specifically, step 301 may be implemented by:

firstly, respectively carrying out vector conversion on a statement sample and an audio sample to obtain acoustic vector samples corresponding to the statement sample and the audio sample, wherein the text vector sample comprises the statement vector sample of each statement sample, each statement vector sample comprises one or more word vector samples, similarly, the acoustic vector sample can comprise a fragment vector sample of an acoustic fragment corresponding to each statement sample, and each fragment vector sample consists of one or more vector samples corresponding to one sounding or more fragment samples;

specifically, in this embodiment, the conversion of the sentence samples may be performed through a pre-trained word vector matrix, so as to convert the sentence samples into a vectorization representation that can be understood by a computer. For example, first, a word vector pre-training algorithm is used to perform word vector training on all labeled data of the segmented words, wherein a word vector refers to a real number vector representing each word as K latitude, and similar word groups are mapped to different parts of a vector space. The calculation formula of word vector training is shown as formula (1), and the calculation formula used for converting the audio samples is shown as formula (2), wherein in order to improve accuracy in the application, the acoustic vector samples extracted by different methods are fused by using formula (2), and the acoustic vector samples corresponding to the audio samples are obtained by multiplying the parameters to be trained and the vectors extracted by the acoustic features by weight and then adding the vectors.

In specific implementation, in this embodiment, the sentence sample is converted by a word vector pre-training algorithm through a text embedding representation unit to obtain a corresponding text vector sample, the acoustic feature extraction unit performs operations such as pre-emphasis, framing, windowing on the audio sample to make the acoustic signal smooth and perform fast fourier transform, then performs fast fourier transform on each section of signal, and after the transform, a spectrogram can be obtained, the signal is filtered by using a triangular window Filter in linear distribution on a Mel frequency scale, and finally, the logarithm of the output of the triangular Filter is taken to generate a Filter bank vector, where the vector is generally 40-dimensional. If the MFCC feature vectors can be obtained by adding discrete cosine transform, the MFCC feature vectors can be used as acoustic feature samples, and acoustic vector samples are obtained after fusion.

And then, respectively carrying out feature coding on the text vector sample and the acoustic vector sample to obtain a text feature coding sample corresponding to the text vector sample and an acoustic feature coding sample corresponding to the acoustic vector.

The text feature coding sample refers to a feature coding sample obtained after coding a text vector sample, specifically, the text feature coding sample may be a text vector sample coded by a bidirectional long-and-short-term memory neural network, and the text vector sample may be coded by a plurality of methods, for example, the text vector may be coded by the above equations (3) to (5).

The acoustic feature coding samples refer to feature coding samples obtained after vector coding is performed on the acoustic vector samples, specifically, the acoustic feature coding samples may be obtained by coding the acoustic vector samples through a bidirectional long-and-short-term memory neural network, and there are various methods for coding the acoustic vector samples, for example, the MFCC vectors may be coded by using the above equations (6) to (8).

In specific implementation, in this embodiment, the text feature encoding unit uses the bidirectional long-and-short-term memory neural network to encode the text vector sample, and converts the text vector sample into vectorization representation to obtain the text feature encoding sample h_iThe acoustic feature coding unit uses a bidirectional long-time and short-time memory neural network to code acoustic vector samples, the acoustic feature samples are converted into vectorization expression, and acoustic feature coding samples s are obtained_i。

Step 302: and processing the text feature coding samples and the acoustic feature coding samples to obtain multi-modal feature samples which are aligned with respect to text and audio.

In this embodiment, the two feature coding samples may be aligned and fused to obtain a multi-modal feature sample, where the multi-modal feature sample is a feature sample in which two feature coding samples, namely a text feature coding sample and an acoustic feature coding sample, are fused and the two feature coding samples are aligned with respect to text and audio.

In one implementation, step 302 may be implemented by:

firstly, aligning the text feature coding samples and the acoustic feature coding samples with respect to text and audio by using an attention mechanism to obtain aligned feature samples;

among them, the attention mechanism is a mechanism for rapidly extracting important features of sparse data. In a specific implementation, in this embodiment, an attention mechanism aligning unit uses an attention mechanism to calculate an attention weight for an acoustic feature coding sample and a text feature coding sample, which aims to enable a learning model based on the attention mechanism to learn alignment weights of the acoustic feature sample and the text feature sample, so that the learning model can learn alignment of the two features at a phonetic-word level, in this embodiment, the text feature coding sample and the acoustic feature coding sample are aligned with respect to text and audio through equations (9) - (11) to obtain an alignment feature sample.

And then, converting the vector characteristics of the alignment characteristic sample to obtain a multi-modal characteristic sample.

In this embodiment, the alignment feature samples may be encoded by using a feature encoding algorithm, such as a neural network, to obtain multi-modal feature samples represented by alignment feature vectorization.

In a specific implementation, in this embodiment, feature alignment may be performed on the text feature coding samples and the acoustic feature coding samples by an attention mechanism feature alignment unit, so that text characters and acoustic segments are aligned between the text feature coding samples and the acoustic feature coding samples, and then, feature coding is performed on the aligned feature or vector samples by the aligned feature coding unit through a bidirectional long-and-short-term memory neural network, so as to obtain vectorized representation fusing acoustic features and text features, that is, a multi-modal feature sample. In this embodiment, the text feature coding samples and the acoustic feature coding samples may be processed by equations (12) to (14) to obtain multi-modal feature samples aligned with respect to text and audio.

It should be emphasized that the present embodiment is applicable to recognition of a prosodic phrase in the case that both text data and audio data exist, and certainly, in the case that text data or audio data are missing, the technical solution in the present embodiment can still implement recognition of a prosodic phrase, for example, in the case that text data or audio data are missing, speech recognition may be performed on the audio data, after the text data is obtained, recognition of a prosodic phrase is performed by integrating the text data and the audio data, or recognition of a prosodic phrase is performed only based on the audio data; alternatively, in the case of missing audio data, corresponding audio data may be generated for the text data by a human or audio generation tool, and then the text data and the audio data are integrated to perform prosodic phrase recognition, or prosodic phrase recognition may be performed based on only the text data.

Referring to fig. 4, a schematic structural diagram of a prosodic phrase recognition device provided in the second embodiment of the present application is shown, where the device may be configured in an electronic device capable of performing data processing, and the technical solution in the present application is mainly used to add an audio sample corresponding to a sentence sample and the sentence sample to training of a prosodic recognition model, so as to enrich training samples of the prosodic recognition model, and the trained prosodic recognition model can process multi-modal features corresponding to text data and corresponding audio data and output a more accurate prosodic phrase sequence, thereby avoiding a situation of low recognition accuracy caused by performing prosodic recognition by text alone, and achieving a purpose of improving the accuracy of prosodic phrase recognition.

Specifically, the apparatus may include the following units:

a data obtaining unit 401, configured to obtain target data to be identified, where the target data at least includes text data and audio data corresponding to the text data, and the text data includes at least one sentence;

a feature code obtaining unit 402, configured to obtain a text feature code corresponding to the text data and an acoustic feature code corresponding to the audio data;

a multi-modal feature obtaining unit 403, configured to process the text feature codes and the acoustic feature codes to obtain multi-modal features regarding alignment of text and audio;

a model operation unit 404, configured to input the multi-modal features into a prosody recognition model that is trained in advance, so as to obtain a prosodic phrase sequence output by the prosody recognition model, where the prosodic phrase sequence includes a plurality of prosodic phrases, and the prosodic phrases are divided by using at least prosodic symbols;

the prosody recognition model is obtained by training at least two sentence samples with prosody phrase labels and audio samples corresponding to the sentence samples.

It can be known from the foregoing solution that, in the prosodic phrase recognition device provided in the second embodiment of the present application, when prosodic phrase recognition is required, not only text data but also audio data corresponding to the text data are required to be obtained, then multi-modal features aligned with text and audio are obtained by using text feature codes corresponding to the text data and acoustic feature codes corresponding to the audio data, and the multi-modal features are used as inputs of a prosodic recognition model obtained by training a synthesized sentence sample and an audio sample together, instead of using the text features as inputs of a prosodic recognition model alone, so that the prosodic recognition model processes the multi-modal features and outputs a corresponding prosodic phrase sequence that includes a plurality of prosodic phrases and is segmented by prosodic symbols. Therefore, the audio samples corresponding to the sentence samples and the sentence samples are added into the training of the prosody recognition model, so that the training samples of the prosody recognition model are enriched, and the audio corresponding to the text can be more attached to prosody pronunciations in a real environment, so that the trained prosody recognition model can process multi-modal characteristics corresponding to text data and corresponding audio data and output more accurate prosody phrase sequences, the condition of low recognition accuracy caused by prosody recognition by the text alone is avoided, and the purpose of improving the prosody phrase recognition accuracy is achieved.

Referring to fig. 5, the apparatus in the second embodiment of the present application may further include the following structure:

a model training unit 405, configured to obtain a multi-modal feature sample of the sentence sample and an audio sample corresponding to the sentence sample; inputting the multi-modal feature samples corresponding to the statement samples into an initially created prosody recognition model to obtain an output result of the prosody recognition model; comparing the prosodic phrase sequence in the output result with the prosodic phrase tag of the sentence sample to obtain a comparison result; and adjusting the model parameters of the rhythm identification model according to the comparison result.

The model training unit 405 may be specifically implemented by the following modules, as shown in fig. 6:

a data obtaining module 601, configured to obtain a multi-modal feature sample of the sentence sample and an audio sample corresponding to the sentence sample;

the data obtaining module 601 is specifically configured to: obtaining a text feature coding sample corresponding to the sentence sample and an acoustic feature coding sample corresponding to the audio sample, for example, first respectively converting the sentence sample and the audio sample to obtain a text vector sample corresponding to the sentence sample and an acoustic vector sample corresponding to the audio sample, and then respectively performing feature coding on the text vector sample and the acoustic vector sample to obtain a text feature coding sample corresponding to the text vector sample and an acoustic feature coding sample corresponding to the acoustic vector sample; the text feature coding samples and the acoustic feature coding samples are processed to obtain multi-modal feature samples related to alignment of text and audio, for example, firstly, the text feature coding samples and the acoustic feature coding samples are aligned related to text and audio by using an attention mechanism to obtain aligned feature samples, and then, the aligned feature samples are subjected to vector feature conversion to obtain multi-modal feature samples.

A data input module 602, configured to input a multi-modal feature sample corresponding to the statement sample into an initially created prosody recognition model to obtain an output result of the prosody recognition model;

a data comparison module 603, configured to compare the prosodic phrase sequence in the output result with the prosodic phrase tag of the sentence sample to obtain a comparison result;

the data adjusting module 604 adjusts the model parameters of the prosody recognition model according to the comparison result.

In one implementation, the feature encoding obtaining unit 402 is specifically configured to: aligning the text feature encoding and the acoustic feature encoding with respect to text and audio using an attention mechanism to obtain an alignment feature; and converting the vector characteristics of the alignment characteristics to obtain multi-modal characteristics.

In one implementation, the multi-modal feature obtaining unit 403 is specifically configured to: aligning the text feature encoding and the acoustic feature encoding with respect to text and audio using an attention mechanism to obtain an alignment feature; and converting the vector characteristics of the alignment characteristics to obtain multi-modal characteristics.

It should be noted that, for the specific implementation of each unit in the present embodiment, reference may be made to the corresponding content in the foregoing, and details are not described here.

Referring to fig. 7, a schematic structural diagram of an electronic device according to a third embodiment of the present disclosure is provided, where the electronic device may be an electronic device capable of performing data processing, such as a computer or a server. The electronic device in this embodiment mainly establishes a prosody recognition model.

Specifically, the electronic device in this embodiment may include the following structure:

a memory 701 for storing an application program and data generated by the operation of the application program;

a processor 702 for executing the application to implement: obtaining target data to be identified, wherein the target data at least comprises text data and audio data corresponding to the text data, and the text data comprises at least one sentence; acquiring text feature codes corresponding to the text data and acoustic feature codes corresponding to the audio data; processing the text feature encoding and the acoustic feature encoding to obtain multi-modal features regarding text and audio alignment; inputting the multi-modal characteristics to a prosody recognition model which is trained in advance to obtain a prosody phrase sequence output by the prosody recognition model, wherein the prosody phrase sequence comprises a plurality of prosody phrases, and the prosody phrases are divided at least by prosody symbols; the prosody recognition model is obtained by training at least two sentence samples with prosody phrase labels and audio samples corresponding to the sentence samples.

According to the above scheme, when prosodic phrase recognition is required, the electronic device provided in the third embodiment of the present application not only obtains text data but also obtains audio data corresponding to the text data, and then obtains multi-modal features aligned with text and audio by using text feature codes corresponding to the text data and acoustic feature codes corresponding to the audio data, and further uses the multi-modal features as input of a prosodic recognition model obtained by training a comprehensive sentence sample and an audio sample together, instead of using the text features as input of the prosodic recognition model alone, so that the prosodic recognition model processes the multi-modal features and outputs a corresponding prosodic phrase sequence which includes a plurality of prosodic phrases and is segmented by prosodic symbols. Therefore, the audio samples corresponding to the sentence samples and the sentence samples are added into the training of the prosody recognition model, so that the training samples of the prosody recognition model are enriched, and the audio corresponding to the text can be more attached to prosody pronunciations in a real environment, so that the trained prosody recognition model can process multi-modal characteristics corresponding to text data and corresponding audio data and output more accurate prosody phrase sequences, the condition of low recognition accuracy caused by prosody recognition by the text alone is avoided, and the purpose of improving the prosody phrase recognition accuracy is achieved.

It should be noted that, the specific implementation of the processor in the present embodiment may refer to the corresponding content in the foregoing, and is not described in detail here.

Fig. 8 is a block diagram of a prosodic phrase recognition scheme according to an embodiment of the present application, where the following blocks are used for training a prosodic recognition model in an earlier stage and for prosodic phrase recognition in practical applications:

1. audio and video input unit: the unit is used for acquiring audio or video files, wherein the audio files are uniformly processed into an input format of a model, and the video files are subjected to audio extraction.

2. An acoustic feature extraction unit: the unit extracts acoustic features, and performs operations such as segmentation and Fourier transform on the files of the audio type to obtain acoustic feature vectors (namely acoustic vectors or acoustic vector samples).

3. An acoustic feature encoding unit: and (3) performing an encoding operation on the acoustic features by using a bidirectional long-term memory neural network, and converting the acoustic features into vectorization representations (namely acoustic feature encoding or acoustic feature encoding samples).

4. A text data input unit: the unit is used for reading the prosodic phrase data marked manually. And converted into computer readable code and format.

5. A text data preprocessing unit: the unit carries out data denoising preprocessing on the data, such as denoising work for removing labels, turning from traditional to simple and the like and word segmentation processing.

6. Text embedding representation unit: text embedding means that an input word is converted into a word vector (i.e., a text vector or a text vector sample) by mapping.

7. A text feature encoding unit: and (3) performing encoding operation on the text information by using a bidirectional long-term and short-term memory neural network, and converting the text characteristics into vectorization representation (namely text characteristic encoding or text characteristic encoding samples).

8. Attention mechanism alignment unit: using an attention mechanism, an attention weight is calculated from the acoustic feature vector and the text feature vector for alignment of the two (i.e., alignment features or alignment feature samples).

9. An alignment feature encoding unit: and coding the aligned feature vectors through a bidirectional long-time and short-time memory neural network to obtain vectorization representation (namely multi-modal features or multi-modal feature samples) fusing the acoustic features and the text features.

10. A model decision unit: and calculating the scores of all possible label sequences by using a sequence labeling score algorithm, and selecting the sequence with the largest score as an output sequence of the model (namely, outputting a result, wherein the output result comprises a prosodic phrase sequence, and certainly, in a training stage, adjusting the parameters of the model according to the output sequence to realize model training).

11. A result output unit: and converting the sequence with the maximum probability of the conditional random field prediction into a prosodic phrase recognition result label as the final output of the whole model.

Specifically, the technical scheme of the application realizes training of a prosody recognition model and recognition of prosody phrases through the following processes:

1. data acquisition

A large number of sentence samples are collected and labeled in the application, audio is recorded in a manual mode or generated through an audio generating tool according to the sentence samples, and prosodic phrase labeling texts are mapped with audio files. And constructing a multi-modal prosodic phrase recognition data set, namely a training sample.

Similarly, the data acquisition is also used for obtaining target data to be prosody-recognized.

2. Data pre-processing

The data preprocessing is to perform phoneme statistics and denoising processing on the acquired training samples and target data to be recognized, wherein the original training data includes, for example, a coding format, a network tag, an emoticon, and the like, and for characters which have no meaning to a prosodic phrase recognition task, the following processing needs to be performed: counting data phonemes, checking phoneme balance and other conditions, removing illegal network labels, removing emoticons, performing simplified and simplified conversion, performing full-half-angle conversion and the like.

In addition, the training samples are subjected to word segmentation processing through the manual marking result, and each word is ensured to correspond to one prosodic label.

Similarly, the text data in the target data to be recognized is subjected to word segmentation processing.

3. Text embedding vector generation

The pre-trained word vector matrix is used in this application to convert words in the sample into a vectorized representation that can be understood by a computer. Specifically, firstly, word vector training is carried out on all labeled data of the divided words through a word vector pre-training algorithm. The basic idea of word vectors is to characterize each word as a K-dimensional real vector, mapping similar word groupings into different parts of the vector space. The relation between words can be learned in the word vector training process, so that the vocabulary can be well expressed in the form of word vectors, for example, the vector conversion is realized by using the calculation formula (1).

4. Acoustic feature extraction

The acoustic features in the audio can be extracted by various methods, such as MFCC, Filter bank, etc. Taking Filter bank as an example, the feature extraction mode is shown in fig. 9. Firstly, pre-emphasis, framing, windowing and other operations are carried out on an audio file, and the aim is to smooth an acoustic signal and carry out fast Fourier transform; then, carrying out fast Fourier transform on each section of signal, and obtaining a spectrogram after the transform; filtering the signal by using a triangular window filter which is linearly distributed on the Mel frequency scale; finally, the logarithm of the output of the triangular Filter is taken to generate a Filter bank vector, and the vector is generally 40-dimensional. The MFCC feature vectors can be used as acoustic features if the discrete cosine transform is added.

5. Acoustic feature fusion

In the application, for acoustic feature vectors extracted by using different methods, full-connection networks are used for fusing acoustic features, and the fusion of multiple acoustic features is realized as shown in formula (2) by taking MFCC features and Filter bank features as examples.

6. Acoustic feature coding

In the application, the MFCC vectors can be encoded through a bidirectional long-time and short-time memory neural network, the neural network learns the acoustic feature vectors, and deep feature expression vectors of acoustic features are obtained, as shown in formulas (6) - (8).

7. Text feature coding

In the application, the text vector can be encoded through a bidirectional long-term and short-term memory neural network, parameters of the neural network and parameters of an acoustic feature encoding network are not shared, the neural network is an independent text encoding network, and a deep feature representation vector of the text vector is obtained through the neural network as shown in formulas (3) to (5).

8. Acoustic feature, text feature alignment

Since the acoustic features are obtained by segmenting the acoustic file by using fixed duration and transforming for a plurality of times, and the text is obtained by word segmentation, the acoustic features and the text features need to be aligned, which is realized by using an attention mechanism in the present link, aiming at enabling the model to learn the alignment weights of the acoustic features and the text features, and enabling the model to learn the alignment of the two features on the level of the phonetic characters, as shown in formulas (9) - (11).

9. Alignment feature coding

The aligned features still need to be coded through a bidirectional long-and-short-term memory neural network to obtain comprehensive vector representation of the fused text and the acoustic features, and the calculation formulas are shown in (12) - (14).

10. Sequence tag prediction

And performing label prediction on each hidden layer state of the alignment coding features by a sequence label prediction method. In general, the model can be processed by using a softmax function in a labeling (prosodic phrase segmentation) stage, and a label with the highest probability predicted by each word is used as a label of the prosodic phrase, but the method has a limited effect when processing data with output labels directly having strong relations. Prosodic phrases recognize prosodic tags of each word as interacting, and therefore require tag prediction using state probability transition based methods such as conditional random fields, hidden markov methods, viterbi coding, etc. Taking the conditional random field as an example, the algorithm can consider the probability of a path in units of paths, rather than considering each word independently, so that the output is the best tag sequence. As shown in FIG. 10, the prosodic phrase sequence of 2-5-2-4-4 is the best tag sequence.

11. Result output

According to the method, the conditional random field prediction rhythm unit is converted into the corresponding actual label according to the label probability value, and the model loss is calculated with the actual label for optimizing model training.

12. Model inference

As shown in fig. 11, the inferred data (data to be identified) can be divided into three categories, and if the test data contains audio, video and text files at the same time, the prosodic phrase can be inferred directly; if the inferred data only contains voice and video, the voice and the video can be converted into characters through an ASR system, and then prosodic phrases are inferred; also when the inferred data contains only text, only text is used for prediction. By the scheme, the model can be used for aiming at more scenes, and the adaptability of the model is improved.

Therefore, the technical scheme in the application realizes prosodic phrase recognition based on multi-modal feature fusion, specifically fuses acoustic features and text features, performs prosodic phrase recognition by using the fused features, performs sequence label prediction based on the prosodic phrase recognition, and enables a sequence prediction part to take context dependency into consideration more fully.

Taking prosodic phrase recognition of ancient poetry as an example, the technical scheme of the application is exemplified as follows:

as shown in fig. 12, except for the punctuation marks as the cut points, the "warm day #1 window # maps to the blue #1 yarn # 4. Cisterna #1 spring water #1 soaked minxia # 4. "smaller prosodic phrase boundaries are also included in spoken language, in which case it is necessary to introduce prosodic auto-recognition techniques to segment the text into prosodic phrase fragments, using" # "segmentation, followed by numbers representing the pause level. In the conventional prosodic phrase collecting method, voice data are manually listened, and whether a prosodic boundary exists behind each word or not is judged, so that the prosodic boundary is wrongly predicted due to the influence of word segmentation, as shown in fig. 12(a), the method is particularly prominent in the bodies of poems, ancient languages, novels and other languages. Based on the technical scheme of the application, as shown in fig. 12(b), after the acoustic features are added, the text features and the acoustic features are fused, so that the accuracy of prosodic phrase identification can be effectively improved.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

27页详细技术资料下载

Prosodic phrase identification method and device and electronic equipment

相关技术

网友询问留言