Voice conversion method, device, equipment and storage medium

文档序号:344503 发布日期:2021-12-03 浏览:12次 中文

阅读说明:本技术 一种语音转换方法、装置、设备及存储介质 (Voice conversion method, device, equipment and storage medium ) 是由 黄朔 孙明刚 于 2021-08-30 设计创作,主要内容包括:本发明公开了一种语音转换方法、装置及设备,该方法包括:确定当前需要进行语音转换的音视频为待转换音视频,并提取所述待转换音视频中包含的语音为待转换语音;对所述待转换语音进行切片得到相应的多个语音切片,并将得到的多个语音切片分别输入至语音转换模型中,得到所述语音转换模型对各所述语音切片进行转换所得的文本数据;将各所述语音切片对应的文本数据进行整合得到相应的文本文件。可见,本申请能够全自动将音视频中的语音转换为相应的文本数据,有效提高语音转换效率。(The invention discloses a voice conversion method, a device and equipment, wherein the method comprises the following steps: determining the current audio and video needing voice conversion as the audio and video to be converted, and extracting the voice contained in the audio and video to be converted as the voice to be converted; slicing the voice to be converted to obtain a plurality of corresponding voice slices, and respectively inputting the obtained voice slices into a voice conversion model to obtain text data obtained by converting each voice slice by the voice conversion model; and integrating the text data corresponding to the voice slices to obtain a corresponding text file. Therefore, the method and the device can fully automatically convert the voice in the audio and video into the corresponding text data, and effectively improve the voice conversion efficiency.)

1. A method of speech conversion, comprising:

determining the current audio and video needing voice conversion as the audio and video to be converted, and extracting the voice contained in the audio and video to be converted as the voice to be converted;

slicing the voice to be converted to obtain a plurality of corresponding voice slices, and respectively inputting the obtained voice slices into a voice conversion model to obtain text data obtained by converting each voice slice by the voice conversion model;

and integrating the text data corresponding to the voice slices to obtain a corresponding text file.

2. The method of claim 1, wherein before slicing the speech to be converted into a corresponding plurality of speech slices, further comprising:

and performing noise reduction processing on the voice to be converted by using a spectral subtraction voice noise reduction method or a Fourier noise reduction method.

3. The method according to claim 2, wherein extracting the voice contained in the audio/video to be converted into the voice to be converted comprises:

and extracting the voice contained in the audio and video to be converted into the voice to be converted by using a clipy.io.wavfile method in a librosa library or a clipy library.

4. The method of claim 3, wherein slicing the speech to be converted into a corresponding plurality of speech slices comprises:

and slicing the voice to be converted according to the time length or the occupied space size to obtain a plurality of corresponding voice slices.

5. The method of claim 4, wherein inputting the obtained plurality of speech slices into the speech conversion model respectively comprises:

and respectively inputting the obtained multiple voice slices into a local deep learning model or a remote transfer learning model.

6. The method of claim 5, wherein integrating the text data corresponding to each of the voice slices to obtain a corresponding text file comprises:

and processing the text data corresponding to each voice slice into text data in a uniform format, and splicing the processed text data in the uniform format according to the position of the corresponding voice slice in the voice to be converted to obtain a corresponding text file.

7. The method of claim 6, further comprising:

acquiring a test data set containing multiple sections of test audios and videos and corresponding text files;

taking each section of the test audio and video as the audio and video to be converted, and executing the step of extracting the voice to be converted until a text file obtained by performing voice conversion on each section of the test audio and video is obtained;

obtaining the accuracy of voice conversion by comparing the text file obtained by converting each section of test audio/video with the corresponding text file in the test data set;

if the accuracy does not reach the accuracy threshold, modifying the currently used noise reduction method and/or the currently used voice extraction method and/or the currently used slicing basis and/or the currently used voice conversion model, and returning to the step of executing the step of extracting the voice to be converted by taking each section of the test audio/video as the audio/video to be converted until the accuracy reaches the accuracy threshold; the slicing basis comprises slicing according to time length and slicing according to the size of occupied space.

8. A speech conversion apparatus, comprising:

an extraction module to: determining the current audio and video needing voice conversion as the audio and video to be converted, and extracting the voice contained in the audio and video to be converted as the voice to be converted;

a conversion module to: slicing the voice to be converted to obtain a plurality of corresponding voice slices, and respectively inputting the obtained voice slices into a voice conversion model to obtain text data obtained by converting each voice slice by the voice conversion model;

an integration module to: and integrating the text data corresponding to the voice slices to obtain a corresponding text file.

9. A speech conversion apparatus, comprising:

a memory for storing a computer program;

a processor for implementing the steps of the speech conversion method according to any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the speech conversion method according to any one of claims 1 to 7.

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a method, an apparatus, a device, and a storage medium for voice conversion.

Background

In the era of the rapid development of the internet industry, communication among enterprises becomes more and more compact, and purchasing, communication, learning, meeting, training, service transfer and the like can generate a large amount of materials, wherein the audio and video materials are widely adopted as the most intuitive and efficient materials. At present, corresponding staff usually converts the voice in the audio and video material into corresponding text data for operations such as using the text data, but the conversion efficiency is low in this way.

Disclosure of Invention

The invention aims to provide a voice conversion method, a voice conversion device, voice conversion equipment and a storage medium, which can fully automatically convert voice in audio and video into corresponding text data and effectively improve the voice conversion efficiency.

In order to achieve the above purpose, the invention provides the following technical scheme:

a method of speech conversion, comprising:

determining the current audio and video needing voice conversion as the audio and video to be converted, and extracting the voice contained in the audio and video to be converted as the voice to be converted;

slicing the voice to be converted to obtain a plurality of corresponding voice slices, and respectively inputting the obtained voice slices into a voice conversion model to obtain text data obtained by converting each voice slice by the voice conversion model;

and integrating the text data corresponding to the voice slices to obtain a corresponding text file.

Preferably, before slicing the speech to be converted to obtain a plurality of corresponding speech slices, the method further includes:

and performing noise reduction processing on the voice to be converted by using a spectral subtraction voice noise reduction method or a Fourier noise reduction method.

Preferably, the extracting the voice contained in the audio and video to be converted into the voice to be converted includes:

and extracting the voice contained in the audio and video to be converted into the voice to be converted by using a clipy.io.wavfile method in a librosa library or a clipy library.

Preferably, slicing the speech to be converted to obtain a plurality of corresponding speech slices includes:

and slicing the voice to be converted according to the time length or the occupied space size to obtain a plurality of corresponding voice slices.

Preferably, the inputting the obtained plurality of voice slices into the voice conversion model respectively includes:

and respectively inputting the obtained multiple voice slices into a local deep learning model or a remote transfer learning model.

Preferably, integrating the text data corresponding to each of the voice slices to obtain a corresponding text file includes:

and processing the text data corresponding to each voice slice into text data in a uniform format, and splicing the processed text data in the uniform format according to the position of the corresponding voice slice in the voice to be converted to obtain a corresponding text file.

Preferably, the method further comprises the following steps:

acquiring a test data set containing multiple sections of test audios and videos and corresponding text files;

taking each section of the test audio and video as the audio and video to be converted, and executing the step of extracting the voice to be converted until a text file obtained by performing voice conversion on each section of the test audio and video is obtained;

obtaining the accuracy of voice conversion by comparing the text file obtained by converting each section of test audio/video with the corresponding text file in the test data set;

if the accuracy does not reach the accuracy threshold, modifying the currently used noise reduction method and/or the currently used voice extraction method and/or the currently used slicing basis and/or the currently used voice conversion model, and returning to the step of executing the step of extracting the voice to be converted by taking each section of the test audio/video as the audio/video to be converted until the accuracy reaches the accuracy threshold; the slicing basis comprises slicing according to time length and slicing according to the size of occupied space.

A speech conversion apparatus comprising:

an extraction module to: determining the current audio and video needing voice conversion as the audio and video to be converted, and extracting the voice contained in the audio and video to be converted as the voice to be converted;

a conversion module to: slicing the voice to be converted to obtain a plurality of corresponding voice slices, and respectively inputting the obtained voice slices into a voice conversion model to obtain text data obtained by converting each voice slice by the voice conversion model;

an integration module to: and integrating the text data corresponding to the voice slices to obtain a corresponding text file.

A speech conversion device comprising:

a memory for storing a computer program;

a processor for implementing the steps of the speech conversion method as described in any one of the above when executing the computer program.

A computer-readable storage medium, having stored thereon a computer program which, when being executed by a processor, carries out the steps of the speech conversion method as set forth in any one of the preceding claims.

The invention provides a voice conversion method, a device and equipment, wherein the method comprises the following steps: determining the current audio and video needing voice conversion as the audio and video to be converted, and extracting the voice contained in the audio and video to be converted as the voice to be converted; slicing the voice to be converted to obtain a plurality of corresponding voice slices, and respectively inputting the obtained voice slices into a voice conversion model to obtain text data obtained by converting each voice slice by the voice conversion model; and integrating the text data corresponding to the voice slices to obtain a corresponding text file. After determining the audio and video needing voice conversion, extracting the voice in the audio and video, obtaining a plurality of voice slices by the voice slices, converting the plurality of voice slices through a voice conversion model to obtain corresponding text data, and finally integrating all the obtained text data to obtain corresponding text files. Therefore, the method and the device can fully automatically convert the voice in the audio and video into the corresponding text data, and effectively improve the voice conversion efficiency.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a flowchart of a voice conversion method according to an embodiment of the present invention;

fig. 2 is an architecture diagram of an implementation of a voice conversion method according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a speech conversion apparatus according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, a flowchart of a voice conversion method according to an embodiment of the present invention is shown, where the method includes:

s11: determining the current audio and video needing voice conversion as the audio and video to be converted, and extracting the voice contained in the audio and video to be converted as the voice to be converted.

The speech contained in the audio/video (i.e. audio/video file) to be subjected to speech conversion in the present application may be a chinese speech, an english speech, or a speech of other languages set according to actual needs, and is within the protection scope of the present invention. It should be noted that the audio and video may be audio and video obtained by recording a conference or the like, and any audio and video that needs to be subjected to voice conversion may be referred to as audio and video to be converted, so as to extract voice in the audio and video to be converted, thereby implementing subsequent voice conversion.

The extracting of the voice contained in the audio/video to be converted into the voice to be converted may include: and extracting the voice contained in the audio/video to be converted into the voice to be converted by using a clipy.io.wavfile method in a librosa library or a clipy library. Specifically, the embodiment of the application can use python to extract the voice (namely the audio data) in the audio and video so as to be seamlessly integrated with other operations; the method can select to use a library or a clip.io.wavfile method in the library to effectively extract the voice, and can select the clip.io.wavfile method in the library or the clip.io.wavfile method in the library to realize the voice extraction with the best voice extraction effect in the current scene according to the current scene, and certainly can be other voice extraction methods which can realize the best voice extraction effect in the current scene, and the methods are all within the protection scope of the invention.

S12: and slicing the voice to be converted to obtain a plurality of corresponding voice slices, and respectively inputting the plurality of obtained voice slices into the voice conversion model to obtain text data obtained by converting each voice slice by the voice conversion model.

After determining the voice to be converted, the embodiment of the application can slice the voice to be converted, and respectively input the multiple voice slices obtained by slicing into the voice conversion model in sequence, or respectively output the multiple voice slices obtained by slicing into the multiple voice conversion models in parallel (the voice slices are the same as the voice conversion models in number at the moment), so that the text data obtained by voice conversion of the voice slices input into the voice conversion models is obtained, and the stable conversion and the efficiency of the data are facilitated.

The slicing the voice to be converted to obtain a plurality of corresponding voice slices may include: and slicing the voice to be converted according to the time length or the occupied space size to obtain a plurality of corresponding voice slices. Specifically, the embodiment of the present invention provides two modes of slicing according to time length and slicing according to size, and the slicing according to time length or the slicing according to size, which can make the slice in the current scene have the best effect when performing voice conversion, can be selected according to the current scene, and certainly, other modes which can make the slice in the current scene have the best effect when performing voice conversion are also included in the protection scope of the present invention; and slicing the voice to be converted according to the length of time, for example, the voice to be converted is 1 hour, slicing the voice to be converted by taking every 10 minutes in the voice to be converted as a voice slice, slicing the voice to be converted according to the size of time, for example, the voice to be converted is 10M, and slicing the voice to be converted by taking every 1M in the voice to be converted as a voice slice. In the embodiment of the present application, slicing the speech to be converted may be implemented by using python scripts, for example, using pydub library.

In addition, inputting the obtained multiple voice slices into the voice conversion model respectively may include: and respectively inputting the obtained multiple voice slices into a local deep learning model or a remote transfer learning model. Specifically, to ensure the accuracy and reliability of conversion, the embodiment of the present application may provide multiple voice conversion models to flexibly cope with different usage scenarios, that is, select a voice conversion model capable of providing the best voice conversion effect in the current scenario when selecting a voice conversion model for implementing voice conversion, where the voice conversion model may include a local deep learning model and a remote migration learning model. The local deep learning model can be used for training a D-learning (deep learning) network by utilizing a large amount of tagged audio data with correct texts, and a large amount of high-quality data and computing power are needed for ensuring the performance of the model; in the embodiment of the present application, self data or an existing common database, such as LibriSpeech, VoxForge, TED-LIUM, etc., may be utilized, and a GPU or an artificial intelligence server with high computational power is used to train and test the model, and common algorithm models may include CNN-HMM (deep convolutional neural network-hidden markov model), RNN-HMM (deep cyclic neural network-hidden markov model), CTC-LSTM (connection timing classification-long-short time memory model), Attention (Attention model), etc., which has the advantages that the model may exist locally, so as to facilitate modification, parameter adjustment, retraining, etc. to adapt to different application scenarios, and return text data may be obtained at the fastest speed. The remote transfer learning model can comprise a T-learning (transfer learning) model and a cloud model, wherein the T-learning model is a network output layer modified according to self requirements by utilizing an open-source classical formed neural network to obtain a required result, and the method has the advantages that the equipment cost and the time cost required by training can be saved to the maximum extent; the cloud model is a network which is trained by a manufacturer and deployed in the cloud, when the cloud model is used, corresponding API and parameters such as ID (identity), password and the like required by verification can be used for uploading audio data, the parameters are set as required, and a result can be returned after processing is finished. The voice conversion models can be flexibly selected and configured according to self requirements and conditions, and other settings carried out according to actual requirements are also within the protection scope of the invention; no matter any of the models provided by the embodiments of the present application is adopted, the influence caused by memory overflow, bandwidth limitation and network transmission failure can be effectively avoided.

S13: and integrating the text data corresponding to each voice slice to obtain a corresponding text file.

In order to facilitate viewing and the like, the text data corresponding to the voice slice are integrated to obtain a corresponding text file, and the integrated text file is output or stored and the like.

The integrating the text data corresponding to each voice slice to obtain a corresponding text file may include: and processing the text data corresponding to each voice slice into text data in a uniform format, and splicing the processed text data in the uniform format according to the position of the corresponding voice slice in the voice to be converted to obtain a corresponding text file. Specifically, the text data output by the speech conversion model may be in various forms, such as string, dit, json, and the like, and in the embodiment of the present application, different types of text data may be processed into text data in a unified form (i.e., a unified format), and then all the text data are spliced according to the positions of the corresponding speech slices in the speech to be converted, that is, any speech slice is arranged in the speech to be converted, and the corresponding text data is arranged in the text file, and finally, a complete text file is output, which is convenient for reading, comparison, learning, and the like.

After determining the audio and video needing voice conversion, extracting the voice in the audio and video, obtaining a plurality of voice slices by the voice slices, converting the plurality of voice slices through a voice conversion model to obtain corresponding text data, and finally integrating all the obtained text data to obtain corresponding text files. Therefore, the method and the device can fully automatically convert the voice in the audio and video into the corresponding text data, and effectively improve the voice conversion efficiency.

In addition, before slicing the speech to be converted to obtain a plurality of corresponding speech slices, the embodiment of the present application may further include: and performing noise reduction processing on the voice to be converted by using a spectral subtraction voice noise reduction method or a Fourier noise reduction method. It should be noted that in the embodiment of the present application, a noise reduction mode with the best noise reduction effect in the current scene may be selected to implement noise reduction processing of the voice to be converted, specifically, a noise reduction script may be written by python, and the selectable algorithms include spectral subtraction voice noise reduction and fourier noise reduction, so that noise interference is effectively eliminated, a human voice signal is amplified, and accuracy of recognition and conversion is improved.

The voice conversion method provided by the embodiment of the invention can further comprise the following steps:

acquiring a test data set containing multiple sections of test audios and videos and corresponding text files;

taking each section of the test audio and video as the audio and video to be converted, and executing the step of extracting the voice to be converted until a text file obtained by performing voice conversion on each section of the test audio and video is obtained;

obtaining the accuracy of voice conversion by comparing the text file obtained by converting each section of test audio/video with the corresponding text file in the test data set;

if the accuracy does not reach the accuracy threshold, modifying the currently used noise reduction method and/or the currently used voice extraction method and/or the currently used slicing basis and/or the currently used voice conversion model, and returning to the step of executing the step of extracting the voice to be converted by taking each section of the test audio/video as the audio/video to be converted until the accuracy reaches the accuracy threshold; the slicing basis comprises slicing according to time length and slicing according to the size of occupied space.

The embodiment of the application can also realize the feed type closed loop test and set the accuracy threshold in advance according to the actual situation. Specifically, the method can utilize a voice conversion method to realize voice conversion of each test audio and video in a test data level, compare the obtained text file with the text file in the corresponding test data set, and the percentage of the same part in the whole part obtained by comparison is accurate, so that when the accuracy does not reach an accuracy threshold value, the adjustment of parameters or network architecture such as a voice conversion method and the like is realized by changing a currently used noise reduction method and/or a currently used voice extraction method and/or a currently used slicing basis and/or a currently used voice conversion model, so as to ensure the accuracy of the voice conversion method in realizing the voice conversion. The closed loop test may be performed by using pytest tool of python, and may feedback a test report containing accuracy.

The implementation framework of the voice conversion method provided by the embodiment of the application can be as shown in fig. 2, and it can be seen that the embodiment of the application combines processing, conversion and integration into a whole, can automatically complete text conversion work of audio and video materials in a short time and high efficiency, improves the use efficiency and the use value, reduces the learning difficulty degree of developers on such materials to the maximum extent, can reduce money, manpower and time investment required by teams to obtain equivalent resources, and can help individuals and teams to improve the ability.

An embodiment of the present invention further provides a voice conversion apparatus, as shown in fig. 3, which may include:

an extraction module 11 configured to: determining the current audio and video needing voice conversion as the audio and video to be converted, and extracting the voice contained in the audio and video to be converted as the voice to be converted;

a conversion module 12 for: slicing the voice to be converted to obtain a plurality of corresponding voice slices, and respectively inputting the obtained voice slices into a voice conversion model to obtain text data obtained by converting each voice slice by the voice conversion model;

an integration module 13 for: and integrating the text data corresponding to each voice slice to obtain a corresponding text file.

The voice conversion apparatus provided in the embodiment of the present invention may further include:

a noise reduction module to: and before the voice to be converted is sliced to obtain a plurality of corresponding voice slices, performing noise reduction processing on the voice to be converted by using a spectral subtraction voice noise reduction method or a Fourier noise reduction method.

In an embodiment of the present invention, an extracting module of a speech conversion apparatus may include:

an extraction unit for: and extracting the voice contained in the audio/video to be converted into the voice to be converted by using a clipy.io.wavfile method in a librosa library or a clipy library.

In an embodiment of the present invention, a voice conversion apparatus, a conversion module may include:

a slicing unit for: and slicing the voice to be converted according to the time length or the occupied space size to obtain a plurality of corresponding voice slices.

In an embodiment of the present invention, a voice conversion apparatus, a conversion module may include:

an input unit for: and respectively inputting the obtained multiple voice slices into a local deep learning model or a remote transfer learning model.

In an embodiment of the present invention, an integrated module of a voice conversion apparatus may include:

an integration unit for: and processing the text data corresponding to each voice slice into text data in a uniform format, and splicing the processed text data in the uniform format according to the position of the corresponding voice slice in the voice to be converted to obtain a corresponding text file.

The voice conversion apparatus provided in the embodiment of the present invention may further include:

a test module to: acquiring a test data set containing multiple sections of test audios and videos and corresponding text files; taking each section of the test audio and video as the audio and video to be converted, and executing the step of extracting the voice to be converted until a text file obtained by performing voice conversion on each section of the test audio and video is obtained; obtaining the accuracy of voice conversion by comparing the text file obtained by converting each section of test audio/video with the corresponding text file in the test data set; if the accuracy does not reach the accuracy threshold, modifying the currently used noise reduction method and/or the currently used voice extraction method and/or the currently used slicing basis and/or the currently used voice conversion model, and returning to the step of executing the step of extracting the voice to be converted by taking each section of the test audio/video as the audio/video to be converted until the accuracy reaches the accuracy threshold; the slicing basis comprises slicing according to time length and slicing according to the size of occupied space.

An embodiment of the present invention further provides a voice conversion device, which may include:

a memory for storing a computer program;

a processor for implementing the steps of the speech conversion method as described in any one of the above when executing the computer program.

The embodiment of the invention also provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the steps of the voice conversion method can be implemented.

It should be noted that for the description of the relevant parts in the speech conversion apparatus, the device and the storage medium provided in the embodiments of the present invention, reference is made to the detailed description of the corresponding parts in the speech conversion method provided in the embodiments of the present invention, and details are not repeated here. In addition, parts of the technical solutions provided in the embodiments of the present invention that are consistent with the implementation principles of the corresponding technical solutions in the prior art are not described in detail, so as to avoid redundant description.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

11页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种语音端点检测模型的训练方法及语音降噪方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!