Voice data labeling method and device

文档序号：193305 发布日期：2021-11-02 浏览：34次中文

阅读说明：本技术 一种语音数据标注方法和装置 (Voice data labeling method and device ) 是由李睿端武卫东于 2021-06-28 设计创作，主要内容包括：本发明提供了一种语音数据标注方法和装置,涉及自然语言技术领域。本发明提供的语音数据标注方法和装置,通过获取待标注语音信息的待标注文本数据和待标注音频数据；将所述待标注文本数据转换为拼音序列数据；将所述待标注文本数据输入韵律标注模型中,获得输出的所述待标注文本数据的韵律标识；将所述拼音序列数据和所述待标注音频数据输入强制对齐模型,获得输出的所述拼音序列数据的起止时间标识；将所述拼音序列数据、所述韵律标识以及所述拼音序列数据的起止时间标识进行合并,生成语音标识拼音序列。本发明实施例从韵律标注及音素切分两方面,基于序列韵律标注及强制对齐模型的进行音素起止时间标注,实现自动标注语音数据的目的。(The invention provides a voice data labeling method and device, and relates to the technical field of natural languages. According to the voice data labeling method and device, text data to be labeled and audio data to be labeled of voice information to be labeled are obtained; converting the text data to be annotated into pinyin sequence data; inputting the text data to be annotated into a prosody annotation model to obtain prosody identifiers of the output text data to be annotated; inputting the pinyin sequence data and the audio data to be marked into a forced alignment model to obtain start and stop time identifiers of the output pinyin sequence data; and merging the pinyin sequence data, the prosody identifier and the start-stop time identifier of the pinyin sequence data to generate a voice identifier pinyin sequence. The embodiment of the invention labels the start time and the end time of the phoneme based on the sequence prosody label and the forced alignment model from two aspects of prosody label and phoneme segmentation, thereby realizing the purpose of automatically labeling the voice data.)

1. A method for annotating voice data, the method comprising:

acquiring text data to be marked and audio data to be marked of voice information to be marked;

converting the text data to be annotated into pinyin sequence data;

inputting the text data to be annotated into a prosody annotation model to obtain prosody identifiers of the output text data to be annotated;

inputting the pinyin sequence data and the audio data to be marked into a forced alignment model to obtain start and stop time identifiers of the output pinyin sequence data;

and merging the pinyin sequence data, the prosody identifier and the start-stop time identifier of the pinyin sequence data to generate a voice identifier pinyin sequence.

2. The method of claim 1, wherein the pinyin sequence data includes pinyin phonemes; the start-stop time stamp includes a time stamp for each pinyin element in the pinyin sequence data.

3. The method of claim 1, wherein the prosodic annotation model is trained by the steps comprising:

acquiring text data to be trained of voice information to be trained;

performing prosodic information marking on the text data to be trained to obtain training text data; the prosodic information at least comprises one of prosodic words, primary prosodic phrases and secondary prosodic phrases;

and carrying out neural network model training on the training text data to generate a rhythm marking model.

4. The method of claim 1, wherein the forced alignment model is trained by the steps comprising:

acquiring text data to be trained and audio data to be trained of voice information to be trained;

segmenting the audio data to be trained to obtain audio frame data;

acquiring acoustic features of the audio frame data; the acoustic features at least comprise mel-frequency cepstrum coefficients and unvoiced and voiced features;

and training a probability model by taking the acoustic features and the corresponding text data to be trained as training data to obtain a forced alignment model.

5. An apparatus for annotating voice data, the apparatus comprising:

the information acquisition module is used for acquiring text data to be marked and audio data to be marked of the voice information to be marked;

the pinyin sequence conversion module is used for converting the text data to be marked into pinyin sequence data;

the prosody identifier labeling module is used for inputting the text data to be labeled into a prosody labeling model to obtain the output prosody identifier of the text data to be labeled;

the time identification module is used for inputting the pinyin sequence data and the audio data to be marked into a forced alignment model to obtain start and stop time identifications of the output pinyin sequence data;

and the merging module is used for merging the pinyin sequence data, the prosody identifier and the start-stop time identifier of the pinyin sequence data to generate a voice identifier pinyin sequence.

6. The apparatus of claim 5,

the pinyin sequence data comprises pinyin phonemes; the start-stop time stamp includes a time stamp for each pinyin element in the pinyin sequence data.

7. The apparatus of claim 5, further comprising:

the prosody labeling model training module is used for acquiring text data to be trained of the voice information to be trained; performing prosodic information marking on the text data to be trained to obtain training text data; the prosodic information at least comprises one of prosodic words, primary prosodic phrases and secondary prosodic phrases; and carrying out neural network model training on the training text data to generate a rhythm marking model.

8. The apparatus of claim 5, further comprising:

the forced alignment model training module is used for acquiring text data to be trained and audio data to be trained of the voice information to be trained; segmenting the audio data to be trained to obtain audio frame data; acquiring acoustic features of the audio frame data; the acoustic features at least comprise mel-frequency cepstrum coefficients and unvoiced and voiced features; and training a probability model by taking the acoustic features and the corresponding text data to be trained as training data to obtain a forced alignment model.

9. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing mutual communication by the memory through the communication bus;

a memory for storing a computer program;

a processor for implementing the method of any one of claims 1 to 4 when executing a program stored in the memory.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-4.

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a method and an apparatus for labeling voice data, an electronic device, and a computer-readable medium.

Background

TTS (text-to-speech) technology can complete the task of generating audio from text, and the speech synthesis technology has undergone three major development stages, namely splicing, parameter synthesis, and end-to-end synthesis. The speech synthesis data typically includes text and its corresponding pinyin annotations, phoneme segmentation information, prosody annotation data, and the like. By taking an end-to-end synthesis example, the TTS finished by the technology can be greatly separated from machine feel, has high naturalness and has lower requirement on the recording data volume. Even end-to-end, data-less demanding models, the input usually needs to be accompanied by prosodic information. However, the data are often marked by human, and the human marking data have the problems of long time consumption, high labor consumption and strong subjectivity due to the subjective intervention of a marker.

Therefore, in order to accelerate the completion of TTS tasks and speed up the construction of sound libraries, the automatic labeling of training data is an urgent problem to be solved.

Disclosure of Invention

In view of the above, the present invention has been made to provide a method, an apparatus, an electronic device and a computer readable medium for annotating voice data that overcome or at least partially solve the above problems.

According to a first aspect of the present invention, there is provided a method for annotating voice data, the method comprising:

acquiring text data to be marked and audio data to be marked of voice information to be marked;

converting the text data to be annotated into pinyin sequence data;

inputting the text data to be annotated into a prosody annotation model to obtain prosody identifiers of the output text data to be annotated;

inputting the pinyin sequence data and the audio data to be marked into a forced alignment model to obtain start and stop time identifiers of the output pinyin sequence data;

and merging the pinyin sequence data, the prosody identifier and the start-stop time identifier of the pinyin sequence data to generate a voice identifier pinyin sequence.

According to a second aspect of the present invention, there is provided a speech data annotation apparatus, comprising:

the information acquisition module is used for acquiring text data to be marked and audio data to be marked of the voice information to be marked;

the pinyin sequence conversion module is used for converting the text data to be marked into pinyin sequence data;

the prosody identifier labeling module is used for inputting the text data to be labeled into a prosody labeling model to obtain the output prosody identifier of the text data to be labeled;

In a third aspect implemented by the present invention, there is also provided a computer-readable storage medium having stored therein instructions which, when run on a computer, cause the computer to perform any of the methods described above.

In a fourth aspect of the present invention, there is also provided a computer program product comprising instructions which, when run on a computer, cause the computer to perform any of the methods described above.

In the embodiment of the invention, text data to be marked and audio data to be marked of voice information to be marked are obtained; converting the text data to be annotated into pinyin sequence data; inputting the text data to be annotated into a prosody annotation model to obtain prosody identifiers of the output text data to be annotated; inputting the pinyin sequence data and the audio data to be marked into a forced alignment model to obtain start and stop time identifiers of the output pinyin sequence data; and merging the pinyin sequence data, the prosody identifier and the start-stop time identifier of the pinyin sequence data to generate a voice identifier pinyin sequence. The invention avoids the problems of long time consumption and strong subjectivity caused by manual work on a large amount of voice data to be labeled, and obtains the voice data with rhythm labeling information by a method for labeling the start time and the end time of the phoneme based on sequence rhythm labeling and a forced alignment model from two angles of rhythm labeling and phoneme segmentation, thereby achieving the purpose of accelerating the speed of voice library construction.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

fig. 1 is a flowchart illustrating steps of a method for annotating voice data according to an embodiment of the present invention;

FIG. 1A is a schematic diagram of an artificial audio annotation process provided by an embodiment of the present invention;

FIG. 1B is a schematic diagram of an automatic speech data annotation process according to an embodiment of the present invention;

fig. 2 is a block diagram of a voice data annotation device according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the invention can be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.

Aiming at a text-to-speech task, a sound library needs to be built in advance, and the construction of the sound library needs to carry out preparation work of speech rhythm labeling. Generally, speech synthesis mainly includes splicing, parameter synthesis, and end-to-end synthesis. As shown in fig. 1A, by combining the recording text and the recording segment, text proofreading is performed to determine an accurate text; and then, performing prosody annotation by listening to the prosody symbols, namely, annotating the prosody symbols into the text. Common prosody labels, namely, a primary prosodic phrase (#4) and a secondary prosodic phrase (#3) are marked; then, combining the sound and the text to write a phoneme sequence corresponding to the audio is required because some announcers read the transposition in some words and need to mark it accurately. Finally, transcribing the rhythm information into the phoneme sequence to obtain the input of TTS (text-to-speech) model.

Aiming at the traditional manual prosody labeling mode, the embodiment of the invention realizes automation of a prosody labeling process.

Fig. 1 is a flowchart illustrating steps of a method for annotating voice data according to an embodiment of the present invention, as shown in fig. 1, the method may include:

step 101, acquiring text data to be marked and audio data to be marked of voice information to be marked;

in the embodiment of the present invention, an end-to-end synthesis example is used, and a prosody identifier is attached to text information when synthesizing speech.

In practical application, firstly, a text of prosody to be identified and a corresponding audio data product need to be extracted from voice information to be labeled, and the voice information can be understood as being divided into two parts, namely text information and corresponding audio data. For example, for the voice information to be annotated, the extraction of the text information "the game still employs the win-win or loss-win" and the corresponding audio data.

102, converting the text data to be annotated into pinyin sequence data;

in the embodiment of the present invention, as shown in fig. 1B, the obtained text data to be annotated is converted into pinyin sequences, for example, if the text data to be annotated is "the game still uses the multi-win and multi-win system", the corresponding pinyin sequence is "bi 3 sai4 reg 2 cai3 yong4 yi3 duo1 sheng4 shao3 zhi 4", wherein each pinyin of the chinese characters is the phoneme data in the pinyin sequence data, and the numbers in the pinyin sequence are marked as pinyin tones.

Step 103, inputting the text data to be annotated into a prosody annotation model to obtain prosody identifiers of the output text data to be annotated;

in the embodiment of the invention, the obtained text data still adopts a rhythm marking model which is input by a multi-win and few-win system and is trained, and each character (position) in the text sequence is automatically marked with rhythm information to obtain the output text sequence with rhythm identification.

Specifically, as described further by taking the above data as an example, the prosody labeling model trained in the "race" is input by using the win-loss system, and the output result is "0100300004", where the prosody label expresses pause between words and pitch information, and the like, where 0 represents word information, 1 represents word information, 3 represents #3, and 4 represents # 4.

For example, "win-loss" is a word, and its prosody label is #1, which means that when pronouncing, "win-loss" is pronounced as a prosodic word, which conforms to the prosodic habit, making the pronunciation more natural and fluent.

Step 104, inputting the pinyin sequence data and the audio data to be marked into a forced alignment model to obtain start and stop time identifiers of the output pinyin sequence data;

preferably, the pinyin sequence data contains pinyin phonemes; the start-stop time stamp includes a time stamp for each pinyin element in the pinyin sequence data.

In the embodiment of the present invention, the text data to be labeled and the audio data to be labeled, which are obtained in step 101, are simultaneously input into the forced alignment model, and the model will label the start-stop time stamp, i.e., the start-stop time identifier, of each pinyin element in the pinyin sequence corresponding to the text data.

Specifically, as shown in FIG. 1B, the forced alignment model determines the start-stop position of each phoneme for a given audio and text. Most commonly, using viterbi decoding, i.e. simply cutting the audio into many frames, one of which is usually 10ms, it can be considered that in this short time, various features of the audio remain stable, performing feature extraction within the frames, and performing similarity calculation on the main features including Mfcc (mel frequency cepstral coefficient) and voiced-unvoiced-voiced feature and the features of the standard phoneme. The similarity between the t-th sample (frame) and the i-th phoneme model is represented by bi (ot). δ t (i) represents the maximum probability δ t (i) that the current audio reaches phoneme i at the time of sample t, and the result δ t +1(i) at the t +1 th time can be derived from the t-th sample by using a formula. In the process, t is continuously increased from 0 until the audio is finished, and finally, δ n (i) corresponding to each phoneme i is obtained.

It can be understood that, since the pronunciation characteristics of each speaker are different, the forced alignment model may be trained with a small portion of data, i.e., a model is trained according to the pronunciation (phoneme) characteristics of the speaker, and then the trained model is used to predict a large amount of data.

And 105, combining the pinyin sequence data, the prosody identifier and the start-stop time identifier of the pinyin sequence data to generate a voice identifier pinyin sequence.

In the embodiment of the present invention, if a phoneme corresponds to a silence segment (silence flag sp, sil) and is consistent with the pause position predicted by the prosody model, the result is retained in the final sequence. Thus, a phoneme sequence with prosodic information can be obtained.

As shown in fig. 1B, the pinyin sequence data corresponding to the text data, the pinyin sequence data with the prosody identifier output by the prosody tagging model, and the pinyin sequence data with the start-stop time stamp output by the forced alignment model are combined to obtain the pinyin sequence data with the prosody identifier and the start-stop time stamp, that is, the voice identifier pinyin sequence.

In practical application, a prosody prediction result can be performed by using a prosody labeling model based on a Recurrent Neural Network (RNN) model, and a kaldi automatic labeling tool in a time dimension based on an audio file and a text can be used to generate voice data with prosody labels.

Kaldi is a common tool in speech recognition, and can extract and model acoustic features of a specific frame. The text and corresponding audio are input into a Kaldi tool, and the alignment model in the process can be extracted to be used as an automatic marking tool.

It is to be understood that the prosody labeling model is not limited to the RNN model in practical use, and the forced alignment model is not limited to the kaldi automatic labeling tool, depending on the specific application, and the embodiment of the present invention is not particularly limited thereto.

Specifically, a prosody labeling model is constructed in the following way:

s11, acquiring text data to be trained of the voice information to be trained;

in the embodiment of the invention, for the training prosody labeling model, the input text sequence is predicted, and the output label sequence with the corresponding equal length is output.

The training samples used by the training prosody labeling model are manually labeled with texts of prosody information. And manually listening a section of audio, and labeling prosodic information on the text of the audio according to the pause of the audio. The method comprises the steps of acquiring text data to be trained under a specific pronunciation scene or a pronunciation scene the same as voice information to be labeled according to different pronunciation scenes (pronouncing persons or text backgrounds). For example, in a navigation map, after a user selects a speaker "lie two", the terminal device determines a pronunciation scene in which the speaker is "lie two" according to the user operation, that is, a prosody prediction model performs prosody prediction on a text based on prosody habits of "lie two", and the target scene at this time is a pronunciation scene in which "lie two" is used as the speaker in a navigation system.

After the voice data of the specific pronunciation scene is determined as described above, the text data corresponding to the voice data is extracted as the text data to be trained.

S12, performing prosodic information marking on the text data to be trained to obtain training text data; the prosodic information at least comprises one of prosodic words, primary prosodic phrases and secondary prosodic phrases;

specifically, the obtained text data to be trained is marked with prosodic information, where the prosodic information at least includes one of prosodic words, primary prosodic phrases and secondary prosodic phrases, where the prosodic words are generally phrases or phrases commonly pronounced as a whole in chinese, for example, the phrase "go movie", and if the phrase is classified as "go", "see", "movie", and "do go movie" according to common syntax, the phrase "go movie" in spoken pronunciation can be regarded as a whole. Is it prosody labeled "go to watch #2 movie # 2? #4 ", where #1 is a prosodic word #2 and #3 are both secondary prosodic phrases, but in different ranks, #4 is a primary prosodic phrase.

And S13, carrying out neural network model training on the training text data to generate a prosody labeling model.

Specifically, the text data to be trained marked with prosodic words is used as the text data to be trained, a neural network model is trained, the neural network model after training convergence is used as a prosodic marking model, and prosodic marking is carried out on other sample data.

Specifically, the forced alignment model is constructed in the following manner:

s21, acquiring text data to be trained and audio data to be trained of the voice information to be trained;

specifically, for training the forced alignment model, firstly, extracting to-be-trained voice information of a preset scene, and splitting the voice information into text data and audio data which are respectively used as the to-be-trained text data and the to-be-trained audio data. The audio data may be obtained by reading the third text sample by the target speaker, or obtaining the audio sample from the sound source library of the specific scene, for example, obtaining the common voice from the sound source library of the promotion scene, for example: the content is a voice such as "recommend a few products for you".

S22, segmenting the audio data to be trained to obtain audio frame data;

specifically, the audio data to be trained is segmented into audio frame data, one frame is usually 10ms, and we generally consider that in this short time, various features of the audio remain stable, for example, segmenting "hello" in the text into the 5 th to 10 th frames of the corresponding audio of 'i' in the "n i h ao" sequence.

S23, acquiring acoustic features of the audio frame data; the acoustic features at least comprise mel-frequency cepstrum coefficients and unvoiced and voiced features;

and S24, taking the acoustic features and the corresponding text data to be trained as training data, training a probability model, and obtaining a forced alignment model.

Specifically, the phoneme features of audio frame data, generally Mfcc and unvoiced and voiced features, are obtained, the audio data with the phoneme features and corresponding text data to be trained are used as training data, a probability model is trained, and the converged probability model is used as a forced alignment model for predicting the start-stop time stamps of phonemes in sample data.

In the probabilistic model training, the similarity between the phoneme data in the sample data and the standard phoneme model is mainly calculated, δ t (i) represents the maximum probability δ t (i) that the current audio reaches the phoneme i at the time of sampling t, and then a result δ t +1(i) at the t +1 th time can be calculated by using a formula from the t-th sampling. In the process, t is continuously increased from 0 until the audio is finished, and finally, δ n (i) corresponding to each phoneme i is obtained. In the process, t is continuously increased from 0 until the audio is finished, and finally, δ n (i) corresponding to each phoneme i is obtained. Because the pronunciation characteristics of each speaker are different, the forced alignment tool can be trained by using a small part of data, namely, the characteristics of the pronunciation (phoneme) of the speaker are trained, and then a large amount of data is predicted.

If a phoneme corresponds to a silence segment (sp, sil) and behaves in accordance with the pause position predicted by the prosodic model, the result is retained in the final sequence.

In practical applications, the forced alignment tool may use other Hmm-based forced alignment models. Of course, it can be understood that, for different sample data and application scenarios, the training tool is different and is not limited to training the forced alignment model by using Hmm, and the embodiment of the present invention is not particularly limited thereto.

In summary, in the embodiment of the present invention, text data to be annotated and audio data to be annotated of voice information to be annotated are obtained; converting the text data to be annotated into pinyin sequence data; inputting the text data to be annotated into a prosody annotation model to obtain prosody identifiers of the output text data to be annotated; inputting the pinyin sequence data and the audio data to be marked into a forced alignment model to obtain start and stop time identifiers of the output pinyin sequence data; and merging the pinyin sequence data, the prosody identifier and the start-stop time identifier of the pinyin sequence data to generate a voice identifier pinyin sequence. The embodiment of the invention labels the start time and the end time of the phoneme based on the sequence prosody label and the forced alignment model from two aspects of prosody label and phoneme segmentation, thereby realizing the purpose of automatically labeling the voice data.

Fig. 2 is a block diagram of a voice data annotation device according to an embodiment of the present invention, and as shown in fig. 2, the device 200 may include:

the information acquisition module 201 is configured to acquire text data to be labeled and audio data to be labeled of voice information to be labeled;

a pinyin sequence conversion module 202, configured to convert the text data to be annotated into pinyin sequence data;

the prosody identifier labeling module 203 is configured to input the text data to be labeled into a prosody labeling model, and obtain a prosody identifier of the output text data to be labeled;

a time identifier module 204, configured to input the pinyin sequence data and the audio data to be labeled into a forced alignment model, and obtain start-stop time identifiers of the output pinyin sequence data;

a merging module 205, configured to merge the pinyin sequence data, the prosody identifier, and the start-stop time identifier of the pinyin sequence data to generate a voice identifier pinyin sequence.

Optionally, the pinyin sequence data contains pinyin phonemes; the start-stop time stamp includes a time stamp for each pinyin element in the pinyin sequence data.

Optionally, the method further comprises:

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

In another embodiment of the present invention, there is also provided a computer-readable storage medium, which stores instructions that, when executed on a computer, cause the computer to execute the voice data annotation method described in any one of the above embodiments.

In a further embodiment of the present invention, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform the method for annotating speech data as described in any of the above embodiments.

As is readily imaginable to the person skilled in the art: any combination of the above embodiments is possible, and thus any combination between the above embodiments is an embodiment of the present invention, but the present disclosure is not necessarily detailed herein for reasons of space.

In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the invention and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the claims, any of the claimed embodiments may be used in any combination.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.

14页详细技术资料下载

Voice data labeling method and device

相关技术

网友询问留言