Dialect voice data segmentation and labeling method and system

文档序号：1955160 发布日期：2021-12-10 浏览：21次中文

阅读说明：本技术 一种方言语音数据切分及标注方法、系统 (Dialect voice data segmentation and labeling method and system ) 是由谢萦钱跃良王向东于 2021-09-13 设计创作，主要内容包括：本发明提出一种方言语音数据切分及标注方法和系统,使用普通话语音识别模型对第一方言语音进行语音数据切分及标注,得到第一方言语音的汉字识别结果,并基于第一方言语音及其对应汉字文本和汉字识别结果,构建方言词汇库；获取待切分和标注的语音数据作为第二方言语音,基于方言词汇库和第二方言语音对应的汉字文本,得到第二方言语音对应的音素序列,并采用音素对齐算法和普通话语音识别模型,得到第二方言语音的语音切分标注结果；基于经人工校对的语音切分标注结果及其对应的语音数据训练得到方言语音声学模型,将方言语音数据输入至方言语音声学模型,得到方言语音数据的切分和标注结果。(The invention provides a dialect voice data segmentation and labeling method and system, wherein a mandarin voice recognition model is used for carrying out voice data segmentation and labeling on a first party speech sound to obtain a Chinese character recognition result of the first dialect voice, and a dialect vocabulary library is constructed on the basis of the first party speech sound and a corresponding Chinese character text and a Chinese character recognition result thereof; acquiring voice data to be segmented and labeled as second dialect voice, obtaining a phoneme sequence corresponding to the second dialect voice based on the dialect vocabulary library and the Chinese character text corresponding to the second dialect voice, and obtaining a voice segmentation labeling result of the second dialect voice by adopting a phoneme alignment algorithm and a mandarin voice recognition model; and training based on the manually corrected voice segmentation and labeling result and the corresponding voice data to obtain a dialect voice acoustic model, and inputting the dialect voice data into the dialect voice acoustic model to obtain the segmentation and labeling result of the dialect voice data.)

1. A dialect voice data segmentation and labeling method is characterized by comprising the following steps:

step 1, carrying out voice data segmentation and labeling on a first party speech sound by using a mandarin voice recognition model to obtain a Chinese character recognition result of the first party speech sound, and constructing a dialect vocabulary library based on the first party speech sound, a corresponding Chinese character text and the Chinese character recognition result;

step 2, obtaining voice data to be segmented and labeled as second dialect voice, obtaining a phoneme sequence corresponding to the second dialect voice based on the dialect vocabulary library and the Chinese character text corresponding to the second dialect voice, and obtaining a voice segmentation labeling result of the second dialect voice by adopting a phoneme alignment algorithm and the mandarin voice recognition model;

and 3, training based on the manually corrected voice segmentation and labeling result and the corresponding voice data to obtain a dialect voice acoustic model, and inputting the dialect voice data to be segmented and labeled to the dialect voice acoustic model to obtain the segmentation and labeling result of the dialect voice data to be segmented and labeled.

2. The method for segmenting and labeling dialect speech data according to claim 1, wherein the step 1 comprises:

step 11, recognizing the first dialect voice by using the mandarin speech recognition model to obtain a corresponding recognition result, and obtaining a Chinese character text corresponding to the first dialect voice and a phoneme sequence of the recognition result by using a corresponding table from Chinese words and Chinese characters to phoneme sequences;

step 12, automatically aligning the phoneme sequence of the Chinese character text of the first dialect voice with the phoneme sequence of the corresponding recognition result based on the editing distance, wherein each phoneme in the phoneme sequence of the Chinese character text of the first dialect voice corresponds to one phoneme in the phoneme sequence of the recognition result or does not correspond to any phoneme after alignment;

step 13, segmenting the Chinese character text of the first dialect voice to obtain a phoneme sequence of the Chinese character text corresponding to each word, and obtaining a phoneme sequence of an identification result corresponding to each word according to the corresponding relation between the phoneme sequence of the Chinese character text and the phoneme sequence of the identification result in the step 12; if the phoneme sequence of the Chinese character text corresponding to a word is not consistent with the phoneme sequence of the recognition result corresponding to the word, adding the word, the phoneme sequence of the Chinese character text corresponding to the word and the phoneme sequence of the recognition result corresponding to the word into a candidate word list;

and step 14, through de-duplication and combination processing, only one entry is reserved for each word in the candidate word list, the entry comprises the phoneme sequence in the Chinese character text corresponding to the entry and the phoneme sequences in all the corresponding recognition results, and the candidate word list is corrected and labeled manually to obtain the dialect vocabulary library.

3. The method for segmenting and labeling dialect speech data of claim 1, wherein the step 2 comprises:

step 21, for the second dialect voice, firstly segmenting the corresponding Chinese character text, judging whether the segmentation result exists in the dialect vocabulary library, if so, acquiring a phoneme sequence of the segmentation result in the dialect vocabulary library, otherwise, acquiring a phoneme sequence corresponding to the second dialect voice by using a corresponding table of the Chinese words and the Chinese characters to the phoneme sequence, and connecting the phoneme sequences corresponding to all words of the second dialect voice to be used as the phoneme sequence corresponding to the second dialect voice;

and step 22, aligning the second dialect speech with the corresponding phoneme sequence based on the speech recognition acoustic model, and obtaining the starting time and the ending time corresponding to each phoneme in the phoneme sequence after aligning, so as to realize speech segmentation and labeling.

4. The method for segmenting and labeling dialect speech data according to claim 1, wherein the step 3 comprises:

mixing the manually corrected voice segmentation marking result and the corresponding voice data with the mandarin voice data, and training an acoustic model based on a DNN + HMM model together to serve as the dialect voice acoustic model; or

Carrying out acoustic model self-adaption operation based on maximum likelihood linear transformation on the existing acoustic model facing to Mandarin and based on the GMM + HMM model by using the manually corrected voice segmentation marking result and the corresponding voice data to obtain the dialect voice acoustic model; or

And carrying out fine adjustment on the existing transducer-based deep neural network acoustic model by using the manually corrected voice segmentation marking result and the corresponding voice data thereof so as to obtain the dialect voice acoustic model.

5. The dialect speech data segmenting and labeling method of claim 1, further comprising the step of 4: training a dialect voice recognition model and/or a dialect voice synthesis model based on the segmentation and labeling result output by the dialect voice acoustic model, converting dialect audio into Chinese character information through the dialect voice recognition model, and/or converting the Chinese character information into the dialect audio through the dialect voice synthesis model.

6. A dialect voice data segmentation and labeling system is characterized by comprising:

the module 1, use the speech recognition model of Mandarin to carry on the phonetic data segmentation and label to the first party's speech, get the Chinese character recognition result of the speech of the first party, and on the basis of the speech of the first party and corresponding Chinese character text and recognition result of the Chinese character, construct the vocabulary bank of the party's language;

the module 2 is used for acquiring the voice data to be segmented and labeled as second dialect voice, obtaining a phoneme sequence corresponding to the second dialect voice based on the dialect vocabulary library and the Chinese character text corresponding to the second dialect voice, and obtaining a voice segmentation labeling result of the second dialect voice by adopting a phoneme alignment algorithm and the mandarin voice recognition model;

and the module 3 is used for training to obtain a dialect voice acoustic model based on the manually corrected voice segmentation and labeling result and the corresponding voice data, inputting the dialect voice data to be segmented and labeled to the dialect voice acoustic model, and obtaining the segmentation and labeling result of the dialect voice data to be segmented and labeled.

7. The dialect speech data segmentation and labeling system of claim 6, wherein the module 1 comprises:

the module 11, utilize the speech recognition model of the Mandarin Chinese, discern the speech of the first dialect, obtain its correspondent recognition result, utilize Chinese word and Chinese character to the corresponding table of the phoneme sequence, obtain the Chinese character text and phoneme sequence of the recognition result that the speech sound of the first party corresponds;

the module 12 automatically aligns the phoneme sequence of the Chinese character text of the first dialect voice with the phoneme sequence of the corresponding recognition result based on the editing distance, and after the alignment, for each phoneme in the phoneme sequence of the Chinese character text of the first dialect voice, the phoneme sequence corresponds to one phoneme in the phoneme sequence of the recognition result or does not correspond to any phoneme;

the module 13 is used for segmenting the Chinese character text of the first dialect speech to obtain a phoneme sequence of the Chinese character text corresponding to each word, and obtaining a phoneme sequence of an identification result corresponding to each word according to the corresponding relation between the phoneme sequence of the Chinese character text and the phoneme sequence of the identification result in the module 12; if the phoneme sequence of the Chinese character text corresponding to a word is not consistent with the phoneme sequence of the recognition result corresponding to the word, adding the word, the phoneme sequence of the Chinese character text corresponding to the word and the phoneme sequence of the recognition result corresponding to the word into a candidate word list;

the module 14, through de-duplication and merging processing, only one entry is reserved for each word in the candidate word list, the entry includes the phoneme sequence in the corresponding Chinese character text and the phoneme sequences in all the corresponding recognition results, and the candidate word list is verified and labeled manually to obtain the dialect vocabulary library.

8. The dialect speech data segmentation and labeling system of claim 6, wherein the module 2 comprises:

module 21, for the second dialect speech, first segmenting the corresponding Chinese character text, and judging whether the segmentation result exists in the dialect vocabulary library, if so, obtaining the corresponding phoneme sequence of the segmentation result in the dialect vocabulary library, otherwise, obtaining the corresponding phoneme sequence by using the Chinese word and the corresponding table from the Chinese character to the phoneme sequence, and connecting the phoneme sequences corresponding to all words of the second dialect speech to be used as the phoneme sequence corresponding to the second dialect speech;

the module 22 aligns the second dialect speech with the corresponding phoneme sequence based on the speech recognition acoustic model, and obtains the start time and the end time corresponding to each phoneme in the phoneme sequence after alignment, thereby implementing speech segmentation and labeling.

9. The dialect speech data segmentation and labeling system of claim 6, wherein the module 3 comprises:

10. The dialect speech data segmentation and labeling system of claim 6, further comprising module 4: training a dialect voice recognition model and/or a dialect voice synthesis model based on the segmentation and labeling result output by the dialect voice acoustic model, converting dialect audio into Chinese character information through the dialect voice recognition model, and/or converting the Chinese character information into the dialect audio through the dialect voice synthesis model.

Technical Field

The invention relates to the technical field of voice processing and voice corpus construction, in particular to the field of dialect voice data segmentation and annotation.

Background

The segmentation and labeling of the speech data refers to segmenting the speech into phonemes (the phonemes in the Chinese are similar to the initials and finals in the pinyin), namely, labeling the start time and the end time of each phoneme in the speech. For example, for the Chinese word "Chinese", the corresponding phonemes are "zh", "ong", "g" and "uo", respectively. For the voice data with the content of Chinese, the starting time and the ending time of each phoneme are accurately labeled, and the segmentation and the labeling of the voice data are realized. For example, after the segmentation and labeling of the voice data, a labeled file containing the beginning time and the ending time in seconds in the first two columns and the corresponding phoneme in the 3 rd column can be obtained.

The accurate segmentation and labeling of the voice data are the basis for training a high-quality voice recognition model so as to establish a high-quality voice recognition system. With the development of speech recognition technology and the expansion of application fields, the demand of dialect speech recognition systems is increasing. To build a high quality dialect speech recognition system, a dialect speech base database that is precisely aligned and labeled is needed. If the voice basic database is manually manufactured, because the general data volume of the voice basic database is very large, a large amount of manpower and material resources are needed, and the requirement for building the voice basic database with large data volume is difficult to meet. There are methods and tools available for standard mandarin chinese speech-based database banking. The existing method mainly trains a speech recognition acoustic model for mandarin chinese, and then performs forced alignment (forced alignment) between a phoneme sequence (obtained by converting a chinese character text corresponding to speech) and the speech by using the acoustic model, so as to obtain a start time and an end time of each phoneme. However, this method is difficult to be applied to the segmentation and labeling of dialect voice data, because for a specific dialect which needs to be segmented and labeled, a high-accuracy voice recognition system and a voice recognition model for the dialect type are often lacked, and if a model with low accuracy, including a model for mandarin, is adopted, the alignment is deviated, so that the voice segmentation and labeling results both contain a large number of errors and are difficult to use. Among them, when using a model for mandarin chinese, the biggest difficulty is: many words pronounce significantly differently in dialects than in Mandarin, which if unaligned according to Mandarin pronunciations, can result in alignment failures. This requires that the phoneme sequence of the dialect pronunciation be obtained before alignment and then aligned according to the dialect pronunciation. For a specific dialect, an expert who is skilled in the language is generally required to label the phoneme sequence, which is time-consuming, labor-consuming and basically difficult to use.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: the dialect voice data segmentation and labeling method is characterized in that a dialect vocabulary library is constructed through automatic extraction and combination of a small amount of manual proofreading and labeling, and based on the dialect vocabulary library, a dialect phoneme sequence corresponding to voice can be obtained, so that when the dialect voice recognition acoustic model for the mandarin is used for aligning the voice and the phonemes, alignment deviation and errors caused by different phonemes are avoided, and the accuracy of segmentation and labeling of the voice data is improved.

Specifically, the invention provides a dialect voice data segmentation and labeling method aiming at the requirements of dialect voice recognition basic database construction and the problems existing at present, wherein the method comprises the following steps:

The dialect voice data segmentation and labeling method comprises the following steps of 1:

The dialect voice data segmentation and labeling method comprises the following steps of 2:

The dialect voice data segmentation and labeling method comprises the following steps of:

The dialect voice data segmentation and labeling method further comprises the following steps: training a dialect voice recognition model and/or a dialect voice synthesis model based on the segmentation and labeling result output by the dialect voice acoustic model, converting dialect audio into Chinese character information through the dialect voice recognition model, and/or converting the Chinese character information into the dialect audio through the dialect voice synthesis model.

The invention also provides a dialect voice data segmentation and labeling system, which comprises the following steps:

the module 1 is used for enabling a Mandarin speech recognition model to perform speech data segmentation and labeling on speech sounds of a first party to obtain a Chinese character recognition result of the speech sounds of the first party, and constructing a dialect vocabulary library based on the speech sounds of the first party, Chinese character texts corresponding to the speech sounds of the first party and the Chinese character recognition result;

The dialect voice data segmentation and labeling system comprises a module 1 and a module, wherein the module comprises:

a module 11, configured to enable the mandarin chinese speech recognition model to recognize the first dialect speech to obtain a corresponding recognition result, and obtain a chinese character text corresponding to the speech sound of the first dialect and a phoneme sequence of the recognition result by using a correspondence table from a chinese word and a chinese character to a phoneme sequence;

the module 12 is configured to automatically align a phoneme sequence of the chinese character text of the first dialect speech with a phoneme sequence of the recognition result corresponding to the phoneme sequence based on the editing distance, and after the alignment, for each phoneme in the phoneme sequence of the chinese character text of the first dialect speech, correspond to one phoneme in the phoneme sequence of the recognition result or not correspond to any phoneme;

a module 13, configured to segment words of a Chinese character text of the first dialect speech to obtain a phoneme sequence of the Chinese character text corresponding to each word, and obtain a phoneme sequence of an identification result corresponding to each word according to a correspondence between the phoneme sequence of the Chinese character text in the module 12 and a phoneme sequence of the identification result; if the phoneme sequence of the Chinese character text corresponding to a word is not consistent with the phoneme sequence of the recognition result corresponding to the word, adding the word, the phoneme sequence of the Chinese character text corresponding to the word and the phoneme sequence of the recognition result corresponding to the word into a candidate word list;

and the module 14 is configured to, through de-duplication merging processing, only reserve one entry for each word in the candidate word list, where the entry includes a phoneme sequence in the corresponding Chinese character text and phoneme sequences in all recognition results corresponding to the entry, and manually correct and label the candidate word list to obtain the dialect vocabulary library.

The dialect voice data segmentation and labeling system comprises a module 2 and a module for processing the dialect voice data, wherein the module comprises:

a module 21, configured to, for a second dialect speech, first segment a corresponding Chinese character text, and determine whether a segmentation result exists in the dialect vocabulary library, if so, obtain a phoneme sequence of the segmentation result in the dialect vocabulary library, otherwise, obtain a phoneme sequence corresponding to the second dialect speech by using a correspondence table of the Chinese words and the Chinese characters to the phoneme sequence, and connect phoneme sequences corresponding to all words of the second dialect speech to serve as the phoneme sequence corresponding to the second dialect speech;

and a module 22, configured to align the second dialect speech with the corresponding phoneme sequence based on the speech recognition acoustic model, and obtain a start time and an end time corresponding to each phoneme in the phoneme sequence after alignment, so as to implement speech segmentation and labeling.

The dialect voice data segmentation and labeling system comprises a module 3:

The dialect voice data segmentation and labeling system further comprises a module 4, wherein the module 4 is used for training a dialect voice recognition model and/or a dialect voice synthesis model based on segmentation and labeling results output by the dialect voice acoustic model, converting dialect audio into Chinese character information through the dialect voice recognition model, and/or converting the Chinese character information into dialect audio through the dialect voice synthesis model.

By adopting the dialect voice data segmentation and labeling method provided by the invention, the problem that a large amount of segmentation and labeling errors are caused by mismatching between an acoustic model of Mandarin and dialect pronunciation in the field of the segmentation and labeling of the dialect voice data can be solved. The method provided by the invention constructs a dialect vocabulary library through automatic extraction and combination of a small amount of manual proofreading and labeling, and can obtain the dialect phoneme sequence corresponding to the voice based on the dialect vocabulary library, thereby avoiding alignment deviation and errors caused by different phonemes when using a voice recognition acoustic model aiming at the Putonghua, and improving the accuracy of segmentation and labeling of the voice data.

Drawings

FIG. 1 is a flow chart of dialect speech data segmentation and labeling according to the present invention.

Detailed Description

The dialect voice data segmentation and labeling method provided by the invention comprises the following steps:

1. and constructing a dialect vocabulary library by utilizing the collected dialect voice data and the corresponding Chinese character text thereof and the recognition result of the mandarin voice recognition system on the dialect voice data in combination with manual proofreading and labeling.

Step 1.1, recognizing each dialect voice data by using the existing mandarin voice recognition system to obtain a corresponding recognition result. Note that since dialects are recognized using the mandarin chinese speech recognition system, the recognition results inevitably have errors, especially for some chinese characters whose pronunciation is different from that of mandarin.

And 1.3, automatically aligning the phoneme sequence of the corresponding Chinese character text with the phoneme sequence of the corresponding recognition result of each dialect voice data by using an editing distance method. After the alignment, for each phoneme in the phoneme sequence of the kanji text corresponding to the speech data, the phoneme in the phoneme sequence of the recognition result corresponding to the speech data may or may not correspond to any phoneme.

Step 1.4, for each piece of voice data, segmenting the Chinese character text corresponding to the data to obtain a phoneme sequence of the Chinese character text corresponding to each word, and obtaining a phoneme sequence of an identification result corresponding to each word according to the corresponding relation between the phoneme sequence of the Chinese character text corresponding to the voice data obtained in step 1.3 and the phoneme sequence of the identification result. And if the phoneme sequence of the Chinese character text corresponding to a word is not consistent with the phoneme sequence of the recognition result corresponding to the word, adding the word, together with the phoneme sequence of the Chinese character text corresponding to the word and the phoneme sequence of the recognition result corresponding to the word into a candidate word list.

Step 1.5 optionally, automatically filters the words in the candidate word list. The automatic filtering may be based on the frequency of occurrence of the word in the list of candidate words and the consistency of the word in terms of phoneme sequences in the recognition results. For example, a word is retained only when the number of times the word appears in the candidate word list is greater than a preset value, otherwise the word is deleted from the candidate word list; for another example, a word is retained only when the number of occurrences of the word in the candidate word list is greater than a preset value and the phoneme sequences in the corresponding recognition results are consistent among the occurrences, otherwise, the word is deleted from the candidate word list.

Step 1.6, the candidate word list is subjected to de-duplication combination processing, that is, only one entry is reserved for one word, but the phoneme sequence in the Chinese character text corresponding to the entry and the phoneme sequences in all the corresponding recognition results are listed in one entry.

And step 1.7, adopting a manual proofreading module to proofread and label the candidate word list manually to obtain a dialect vocabulary library. The manual proofreading module presents all entries and relevant information thereof to proofreading personnel familiar with the dialect, including each Chinese word, a phoneme sequence in a Chinese character text corresponding to the Chinese word and a phoneme sequence in all recognition results corresponding to the Chinese word, and then provides an interactive interface, and proofreading and labeling are carried out by the proofreading personnel, and the manual proofreading module mainly comprises: and determining whether the entry is reserved, referring to the phoneme sequence in the Chinese character text corresponding to the Chinese word to label the mandarin phoneme sequence corresponding to the Chinese word, and referring to the phoneme sequences in all the recognition results corresponding to the Chinese word to label the dialect phoneme sequence corresponding to the Chinese word. And finally obtaining all reserved entries including information such as Chinese words, mandarin phoneme sequences corresponding to the Chinese words, dialect phoneme sequences corresponding to the Chinese words and the like, so as to form the dialect vocabulary library.

Step 1.8 optionally, a manual editing module is used to manually edit the dialect vocabulary library and supplement some entries which do not appear in the candidate word list. The manual editing module provides a corresponding interface for a user to input entry information, including information such as a Chinese word, a mandarin phoneme sequence corresponding to the Chinese word, a dialect phoneme sequence corresponding to the Chinese word, and the like.

2. And obtaining a dialect phoneme sequence corresponding to the voice data to be divided and labeled based on the constructed dialect vocabulary library and the Chinese character text corresponding to the voice data to be divided and labeled, and then obtaining a voice segmentation and labeling result by adopting a mandarin-oriented voice recognition acoustic model and a phoneme alignment algorithm.

And 2.1, obtaining a dialect phoneme sequence corresponding to the voice data to be divided and labeled based on the constructed dialect vocabulary library and the Chinese character text corresponding to the voice data to be divided and labeled.

And 2.2, based on the dialect phoneme sequence corresponding to the obtained voice data, obtaining a voice segmentation and labeling result by adopting a mandarin-oriented voice recognition acoustic model and a phoneme alignment algorithm.

3. And providing a corresponding module and an interactive interface for the user to correct and modify the voice segmentation and labeling results obtained in the step 2.

4. And training a speech recognition acoustic model by adopting the segmentation and marking information which is manually corrected and modified and the corresponding speech data to obtain an updated dialect speech acoustic model, and segmenting and marking new dialect speech data by adopting the updated dialect speech acoustic model.

And 4.1, training a speech recognition acoustic model by adopting the segmentation and marking information which is manually corrected and modified and the corresponding speech data to obtain an updated dialect speech acoustic model.

The specific training mode may be various, for example, the segmentation and labeling information and the corresponding voice data that are manually corrected and modified may be directly used for acoustic model training, or the segmentation and labeling information and the corresponding voice data that are manually corrected and modified may be mixed with the existing mandarin data for acoustic model training, or the segmentation and labeling information and the corresponding voice data that are manually corrected and modified may be used for incremental training such as model adaptation (self-adaptation) and fine-tuning (fine-tuning) on the basis of the existing mandarin voice-oriented acoustic model.

And 4.2, segmenting and labeling new dialect voice data by adopting the updated dialect voice acoustic model.

For new voice data needing to be segmented and labeled, firstly, a dialect phoneme sequence corresponding to the voice data to be segmented and labeled is obtained based on a dialect vocabulary library and a Chinese character text corresponding to the voice data to be segmented and labeled, then, a dialect voice acoustic model and an alignment algorithm after updating are adopted, the voice signal and the corresponding dialect phoneme sequence are aligned forcibly, and the starting time and the ending time corresponding to each phoneme in the phoneme sequence can be obtained after alignment, so that the voice segmentation and labeling are realized.

In order to make the aforementioned features and effects of the present invention more comprehensible, embodiments accompanied with figures are described in detail below.

The invention provides a method for segmenting and labeling dialect voice data, which realizes accurate segmentation and labeling of dialect voice. The precisely labeled dialect corpus has important significance for training a dialect speech recognition model and a speech synthesis model and constructing a dialect speech recognition system and a speech synthesis system. The dialect voice data is provided with corresponding Chinese character texts. The Chinese character text acquisition has two modes: firstly, voice data are obtained by a speaker through reading compared with a pre-selected Chinese character text, and the Chinese character text exists before voice; and secondly, transcribing the voice data manually to obtain a corresponding Chinese character text. These chinese character texts can be considered correct without containing erroneous chinese character texts.

The dialect voice data segmentation and labeling method flow provided by the invention is shown in fig. 1. The concrete steps are as follows:

Step 1.1, recognizing each dialect voice data by using the existing Mandarin voice recognition system (model) to obtain the corresponding recognition result. Note that since dialects are recognized using the mandarin chinese speech recognition system, the recognition results inevitably have errors, especially for some chinese characters whose pronunciation is different from that of mandarin.

And step 1.2, respectively obtaining a Chinese character text corresponding to each dialect voice data and a phoneme sequence of the recognition result by utilizing a Chinese word and a Chinese character-to-phoneme sequence correspondence table. The Chinese words have the advantage of better processing polyphonic characters. For example, "multiple persons participate" after the participle, and the phoneme sequence "c anj ia" can be obtained by directly processing "participation", so that the problem that whether the choice "c an" or "sh en" is selected alone is not known can be avoided.

For the Chinese character text and the recognition result corresponding to each dialect voice data, firstly, a word segmentation tool is adopted to segment the words, then, for each obtained word, a corresponding table from a Chinese word to a phoneme sequence is utilized to obtain a phoneme sequence corresponding to the word, if the word does not exist in the corresponding table from the Chinese word to the phoneme sequence, the word is further decomposed into a sequence of the corresponding Chinese character, for each Chinese character, the phoneme sequence corresponding to the Chinese character is obtained by utilizing the corresponding table from the Chinese character to the phoneme sequence, and the phoneme sequences corresponding to all the Chinese characters of the word are connected to be used as the phoneme sequence corresponding to the word. And connecting the phoneme sequences corresponding to all the words to obtain the Chinese character text corresponding to each dialect voice data and the phoneme sequence of the recognition result.

The Edit Distance (Edit Distance), also known as the Levenshtein Distance (Levenshtein Distance), refers to the minimum number of Edit operations required to transition from one symbol string to another. The allowed editing operations include replacing one character with another, inserting one character, and deleting one character. A dynamic programming algorithm may be employed to calculate the edit distance between two symbol strings. According to the edit distance between the two symbol strings, the corresponding relationship between the symbols in the two symbol strings can be obtained. For example, for the symbol strings "abdef" and "bcdeg", the following correspondence may be established:

a→

b→b

→c

d→d

e→e

f→g

this means that the transition from "abdef" to "bcdeg" requires one deletion (delete a), one insertion (insert c) and one replacement (replace f by g), so the edit distance between the two is 3. In the above correspondence, since there is insertion or deletion, "a" in "abdef" and "c" in "bcdeg" do not have a corresponding symbol in another symbol string.

In the implementation process of the present invention, each phoneme is regarded as the "symbol" in the above example to calculate the editing distance and obtain the corresponding relationship between the phonemes in the two phoneme sequences.

Step 1.5 optionally, automatically filters the words in the candidate list. The automatic filtering may be based on the frequency with which the word appears in the list of candidate words and the consistency of the word in the phoneme sequence in the recognition results. For example, a word is retained only when the number of times the word appears in the candidate word list is greater than a preset value, otherwise the word is deleted from the candidate word list; for another example, a word is retained only when the number of occurrences of the word in the candidate word list is greater than a preset value and the phoneme sequences in the corresponding recognition results are consistent among the occurrences, otherwise, the word is deleted from the candidate word list.

And step 1.7, adopting a manual proofreading module to proofread and label the candidate word list manually to obtain a dialect vocabulary library. The manual proofreading module presents all entries and relevant information thereof to proofreading personnel familiar with the dialect, including each Chinese word, a phoneme sequence in a Chinese character text corresponding to the Chinese word and phoneme sequences in all recognition results corresponding to the Chinese, and then provides an interactive interface, and proofreading and labeling are performed by the proofreading personnel, and the manual proofreading module mainly comprises: and determining whether the entry is reserved, referring to the phoneme sequence in the Chinese character text corresponding to the Chinese word to label the mandarin phoneme sequence corresponding to the Chinese word, and referring to the phoneme sequences in all the recognition results corresponding to the Chinese word to label the dialect phoneme sequence corresponding to the Chinese word. And finally obtaining all reserved entries including information such as Chinese words, mandarin phoneme sequences corresponding to the Chinese words, dialect phoneme sequences corresponding to the Chinese words and the like, so as to form the dialect vocabulary library.

2. And then, aligning the dialect phoneme sequence corresponding to the speech data in the speech data by adopting a mandarin-oriented speech recognition acoustic model and a phoneme alignment algorithm to obtain the starting time and the ending time of each phoneme, thereby obtaining a speech segmentation and labeling result. It should be noted that the speech segmentation and labeling results are different from the phoneme sequence, and the difference between the phoneme sequence and the phoneme sequence is that the phoneme sequence only has phonemes and does not have a start time and an end time.

The model in the step 1 is used for voice recognition, and has the function of performing voice recognition by matching with a voice recognition algorithm, namely, for input voice, a voice recognition result (Chinese character text) is given; the speech recognition acoustic model in step 2 is used for phoneme alignment and functions to align phonemes with the phoneme alignment algorithm, i.e. to give the start and end times of each phoneme for the input speech and phoneme sequences. In practical application, the two models can be used for doing the two things by adopting the same model, and only the algorithm of the matched model is different. But it can also be different because for the speech recognition function, other language models can be added to make the obtained Chinese characters more accurate, and the speech alignment is only for phonemes and is not related to Chinese characters, so other language models are not needed.

For each piece of voice data, firstly segmenting a corresponding Chinese character text into words, and if the words exist in a dialect vocabulary library, obtaining a dialect phoneme sequence of the words corresponding to the dialect vocabulary library; if the word does not exist in the dialect vocabulary library, a corresponding phoneme sequence is obtained by utilizing a Chinese word and a corresponding table from the Chinese character to the phoneme sequence, and the specific method is the same as that in the step 1.2. And connecting the dialect phoneme sequences or phoneme sequences corresponding to all the words obtained in the previous step to form the dialect phoneme sequence corresponding to the speech.

And 2.2, based on the dialect phoneme sequence corresponding to the obtained speech data and the speech data corresponding to the dialect phoneme sequence, aligning by adopting a speech recognition acoustic model facing to the Mandarin and a phoneme alignment algorithm to obtain the starting time and the ending time of each phoneme, thereby obtaining a speech segmentation and labeling result.

The phoneme alignment algorithm is based on a speech recognition acoustic model, the speech signal is aligned with the phoneme sequence corresponding to the speech signal in a forced mode, and the starting time and the ending time corresponding to each phoneme in the phoneme sequence can be obtained after alignment, so that speech segmentation and labeling are achieved.

Various existing speech recognition acoustic models and alignment algorithms can be adopted, for example, the speech recognition acoustic model can adopt an acoustic model based on GMM (Gaussian mixture model) + HMM (hidden Markov model), an acoustic model based on DNN (deep neural network) + HMM, an acoustic model based on RNN (recurrent neural network), an acoustic model based on Transformer (Transformer), and the like; the alignment algorithm may employ a Viterbi-based (Viterbi) decoding algorithm or the like.

Through the method, although the acoustic model for mandarin chinese speech recognition is adopted, the adopted phoneme sequence is the dialect phoneme sequence, so that the actual pronunciation in the speech is more consistent with the phoneme sequence during alignment, and the alignment effect is better. The errors caused by inconsistent pronunciations when the mandarin chinese phoneme sequence is adopted are avoided.

3. And providing a corresponding module and an interactive interface for the user to correct and modify the voice segmentation and labeling results obtained in the step 2.

And providing corresponding modules and interactive interfaces, so that a user can listen to the voice data, check the oscillogram or spectrogram of the voice data, check the voice segmentation and labeling results obtained in the step 2, and check and modify the results.

And 4.2, segmenting and labeling new dialect voice data by adopting the updated dialect voice acoustic model.

The dialect speech data segmentation and labeling method proposed by the present invention is further described below with reference to specific embodiments.

In one embodiment, the focus is automatically segmented and labeled for the voice data of the Shandong near-Yili dialect. Prior to implementation, a large amount of dialect speech with text labels has been acquired. Each sentence is stored as a wav file and is provided with a label file, wherein a Chinese character text corresponding to each sentence (each wav file) is stored.

According to the method provided by the invention, the following steps are carried out:

step 1, constructing a dialect vocabulary library by utilizing collected dialect voice data and Chinese character texts corresponding to the collected dialect voice data and recognition results of a mandarin voice recognition system on the dialect voice data in combination with manual proofreading and labeling.

Step 1.1, each dialect voice data is recognized by utilizing an existing mandarin voice recognition system to obtain a corresponding recognition result. In one embodiment, a PC (personal computer) local speech recognition system is used; in another embodiment, a cloud speech recognition service engine is adopted to upload the wav file to a cloud speech recognition platform and then obtain a recognition result. The recognition result is a Chinese character string corresponding to the voice.

In one embodiment, a piece of speech data corresponds to a chinese character text: "my feet are very painful". In Shandong Linyi, the pronunciation of "foot" is closer to "jue". This results in a recognition error that may occur with a mandarin chinese speech recognition system. In one embodiment, the recognition result of the piece of speech data is: "my jazz is very painful".

In the above-described embodiment, the text and the recognition result corresponding to one piece of speech data are respectively:

text corresponding to the voice: my feet are very painful

And the recognition result corresponding to the voice is as follows: my jazz is very painful

The Chinese word segmentation tools are adopted to segment the Chinese words respectively to obtain:

the word segmentation result of the text corresponding to the voice is as follows: my feet are very painful

The word segmentation result of the recognition result corresponding to the voice is as follows: my jazz is very painful

Adopting a corresponding table from Chinese words to phoneme sequences and a corresponding table from Chinese characters to phoneme sequences to obtain texts corresponding to the voice and phoneme sequences of recognition results as follows:

phoneme sequence of text corresponding to speech: w o d e j iao h en t eng

Phoneme sequence of recognition result corresponding to the speech: w o d e j ue ie h en t eng

Step 1.3, for each dialect voice data, automatically aligning the phoneme sequence of the corresponding text with the phoneme sequence of the corresponding recognition result by using an editing distance method.

In the above embodiment, the phoneme sequence of the text corresponding to the speech is aligned with the phoneme sequence of the recognition result corresponding to the speech based on the editing distance, and the editing distance between the two is calculated to be 3, and the correspondence between the phonemes is:

after the alignment, for each phoneme in the phoneme sequence of the text corresponding to the voice data, the phoneme in the phoneme sequence of the recognition result corresponding to the voice data may or may not correspond to any phoneme. As shown in the above example.

In the above embodiment, the chinese character text corresponding to the speech data is: "my feet are very painful", and after word segmentation, the phoneme sequence of each word, the Chinese character text corresponding to the word, and the phoneme sequence of the recognition result corresponding to the word can be obtained according to the above steps, as follows:

word	Phoneme sequence of Chinese character text	Phoneme sequence of recognition result
			I am	w o	w o
Is/are as follows	d e	d e
			Foot	j iao	j ue
Very much	h en	h en
			Pain (ache)	t eng	t eng

Note that the phoneme sequence of the chinese character text corresponding to the "foot" is different from the phoneme sequence of the recognition result corresponding thereto, and thus the following entry (including a word, the phoneme sequence of the chinese character text corresponding to the word, and the phoneme sequence of the recognition result corresponding to the word) is added to the candidate word list.

Foot jaio j ue

In the above-described embodiment, the rule employed is the second rule described above, namely: and only when the frequency of the word appearing in the candidate word list is larger than a preset value and the phoneme sequences in the corresponding recognition results are consistent in the multiple appearance, the word is reserved, otherwise, the word is deleted from the candidate word list. Counting the frequency of occurrence of the 'foot' in the corpus, and finding that the frequency is 15 and is greater than a preset value (3); when the 5 occurrences are checked simultaneously, the phoneme sequences of the recognition results corresponding to the word are found to be the same "j ue", so that the entry is retained in the candidate word list.

In the embodiment described above, the 15 entries resulting from 15 occurrences of the "foot" are merged into one entry:

foot jaio j ue

In the above described embodiment, entries are retained after manual collation inspection:

foot jaio j ue

In another embodiment, after manual collation check, it is considered that there is a phoneme sequence error caused by a speech recognition error, and thus the entry is used

Foot jaio zh uo

Is modified into

Foot jaio j ue

In one embodiment, an entry is manually added

Ridge j i l iang j i n iang

In one embodiment, a Chinese character text corresponding to the voice data to be divided and labeled is "pain-free to foot". For the voice, a Chinese word segmentation software is firstly adopted to segment the voice, and the result after the word segmentation is 'the pain of feet is not painful'. Then, each word is searched in turn in the dialect vocabulary library, and the "foot" in this embodiment exists in the dialect vocabulary library, so that the dialect phoneme sequence "j ue" thereof in the dialect vocabulary library can be obtained. For other words, because the words do not exist in the dialect vocabulary library, the corresponding phoneme sequence is obtained by adopting the Chinese words and the corresponding table from the Chinese character to the phoneme sequence. The dialect phoneme sequence or phoneme sequence corresponding to all words is connected to be used as the dialect phoneme sequence corresponding to the speech, and in the embodiment, the dialect phoneme sequence is:

j ue t eng b u t eng

and 2.2, based on the dialect phoneme sequence corresponding to the obtained voice data, obtaining a voice segmentation and labeling result by adopting a mandarin-oriented voice recognition acoustic model and a phoneme alignment algorithm.

In one embodiment, the speech and the phonemes are forcibly aligned by using an acoustic model based on DNN + HMM and an alignment algorithm based on Viterbi decoding, and the start time and the end time of each phoneme can be obtained as the speech segmentation and annotation result.

In one embodiment, the Chinese character text of a piece of voice data is "painful and painless", and the dialect phoneme sequence is "j ue t eng b u t eng". After alignment, the obtained speech segmentation and labeling results are as follows:

0.709 0.790 j

0.790 0.963 ue

1.055 1.110 t

1.110 1.316 eng

1.316 1.346 b

1.346 1.459 u

1.459 1.538 t

1.538 1.739 eng

therefore, in the segmentation and labeling result, the dialect pronunciation is represented by the true pronunciation phoneme rather than the mandarin pronunciation phoneme corresponding to the Chinese character, so that the training of a more accurate dialect speech recognition model is facilitated; meanwhile, because the phoneme sequence which is closer to the true pronunciation is adopted, the alignment deviation and the error generated in the alignment process are less. Conversely, if "jaio" with distinct and true pronunciations is used to align the speech corresponding to the "foot" word, the alignment may be mistaken, resulting in a wrong speech segmentation and labeling result.

3. And providing a corresponding module and an interactive interface for the user to correct and modify the voice segmentation and labeling results obtained in the step 2.

In one embodiment, a tool software running on a Personal Computer (PC) is provided, which can be used by the user to listen to the speech data, view the oscillogram or spectrogram of the speech data, view the results of the speech segmentation and labeling obtained in step 2, and collate and modify the results, for example, modify the start time and end time of a certain phoneme.

In one embodiment, the segmentation and annotation information which is manually corrected and modified and the corresponding voice data thereof are mixed with the original common speech data, and an acoustic model based on a DNN + HMM model is trained together to be used as an updated dialect voice acoustic model; in another embodiment, the segmentation and labeling information which is manually corrected and modified and the corresponding voice data are used for carrying out maximum likelihood linear transformation (MLLR) -based acoustic model adaptation (MLLR) on the existing Mandarin-oriented GMM + HMM model-based acoustic model so as to enable the newly obtained acoustic model to be more adaptive to the characteristics of dialect voice; in another embodiment, the existing transducer (Transformer) -based deep neural network acoustic model is fine-tuned (fine-tuning) using the manually collated and modified segmentation and labeling information and its corresponding speech data to make the newly derived acoustic model more adaptive to the characteristics of dialect speech.

And 4.2, segmenting and labeling new dialect voice data by adopting the updated dialect voice acoustic model.

For new voice data needing to be segmented and labeled, firstly, a dialect phoneme sequence corresponding to the voice data to be segmented and labeled is obtained based on a dialect vocabulary library and a Chinese character text corresponding to the voice data to be segmented and labeled, then, a dialect voice acoustic model and an alignment algorithm after updating are adopted, the voice signal and the phoneme sequence corresponding to the voice signal are aligned forcibly, and after alignment, the start time and the end time corresponding to each phoneme in the phoneme sequence can be obtained, so that the voice segmentation and labeling are realized.

The following are system examples corresponding to the above method examples, and this embodiment can be implemented in cooperation with the above embodiments. The related technical details mentioned in the above embodiments are still valid in this embodiment, and are not described herein again in order to reduce repetition. Accordingly, the related-art details mentioned in the present embodiment can also be applied to the above-described embodiments.

The invention also provides a dialect voice data segmentation and labeling system, which comprises the following steps:

The dialect voice data segmentation and labeling system comprises a module 1 and a module, wherein the module comprises:

The dialect voice data segmentation and labeling system comprises a module 2 and a module for processing the dialect voice data, wherein the module comprises:

The dialect voice data segmentation and labeling system comprises a module 3:

19页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种基于多任务的语种识别方法

Dialect voice data segmentation and labeling method and system

相关技术

网友询问留言