Voice recognition method and device, storage medium and electronic equipment

文档序号：196355 发布日期：2021-11-02 浏览：43次中文

阅读说明：本技术 一种语音识别方法、装置、存储介质及电子设备 (Voice recognition method and device, storage medium and electronic equipment ) 是由杨柳于 2021-07-28 设计创作，主要内容包括：本申请公开了一种语音识别方法、装置、存储介质及电子设备,其中,方法包括：获取多媒体数据流,所述多媒体数据流包括视频数据和音频数据,获取所述音频数据中的声纹信息,查找与所述声纹信息相关联的图片数据,采用所述图片数据对所述视频数据进行合成处理,输出所述音频数据和合成处理后的所述视频数据。采用本申请,通过拆分出多媒体数据流中的音频数据,比对音频数据中的声纹信息并匹配相关联的图片,丰富视频内容,提升视频的趣味性,采用声纹信息进行比对提高语音识别的准确性,提升视频应用的使用效果。(The application discloses a voice recognition method, a voice recognition device, a storage medium and electronic equipment, wherein the method comprises the following steps: the method comprises the steps of obtaining a multimedia data stream, wherein the multimedia data stream comprises video data and audio data, obtaining voiceprint information in the audio data, searching picture data related to the voiceprint information, adopting the picture data to carry out synthesis processing on the video data, and outputting the audio data and the video data after the synthesis processing. By means of the method and the device, the audio data in the multimedia data stream are split, voiceprint information in the audio data is compared and matched with the associated pictures, video content is enriched, interestingness of the video is improved, accuracy of voice recognition is improved by comparing the voiceprint information, and using effect of video application is improved.)

1. A method of speech recognition, the method comprising:

acquiring a multimedia data stream, wherein the multimedia data stream comprises video data and audio data;

acquiring voiceprint information in the audio data, and searching picture data associated with the voiceprint information;

and synthesizing the video data by adopting the picture data, and outputting the audio data and the video data after synthesis.

2. The method of claim 1, wherein the picture data is associated with sample voiceprint data, and wherein the sample voiceprint information, the picture data, and the association of the sample voiceprint information and the picture data are stored in a data set.

3. The method according to claim 2, wherein the obtaining voiceprint information in the audio data and searching for picture data associated with the voiceprint information comprises:

splitting the audio data into at least one audio fragment according to a first preset time length, and acquiring target voiceprint information contained in a target audio fragment in the at least one audio fragment;

and if the sample voiceprint information matched with the target voiceprint information exists in the data set, searching the image data associated with the sample voiceprint information in the data set.

4. The method according to claim 3, wherein the synthesizing the video data using the picture data, and outputting the audio data and the synthesized video data comprises:

acquiring a corresponding end time point of the target voiceprint information in a target audio clip;

acquiring a video data segment which takes the ending time point as starting time and takes second preset duration as segment length from the video data;

and synthesizing the video data segments by adopting the picture data, and outputting the audio data and the video data after synthesis.

5. The method according to claim 4, wherein the synthesizing the video data segment with the picture data and outputting the audio data and the synthesized video data comprises:

acquiring a picture layer corresponding to the picture data and a video layer corresponding to the video data fragment;

synthesizing the picture layer and the video layer; wherein the picture layer is superimposed on the video layer;

and outputting the audio data and the video data after the synthesis processing.

6. The method according to claim 5, wherein the synthesizing the picture layer and the video layer comprises:

acquiring a first position corresponding to a face in the video data segment in the video layer;

and acquiring a second position corresponding to the first position on the picture layer, and displaying the picture corresponding to the picture data at the second position.

7. The method according to claim 5, wherein the synthesizing the picture layer and the video layer comprises:

acquiring a third position corresponding to the face in the video data segment in the video layer;

and acquiring a fourth position corresponding to the third position on the picture layer, and displaying the picture corresponding to the picture data at other positions except the fourth position.

8. The method of claim 3, wherein before obtaining the multimedia data stream comprising video data and audio data, further comprising:

acquiring set sound effect information associated with the sample voiceprint information, and storing the sound effect information into the data set;

the synthesizing the video data by using the picture data and outputting the audio data and the synthesized video data comprises:

searching the sound effect information associated with the sample voiceprint information in the data set, and synthesizing the audio data by adopting the sound effect information;

and synthesizing the video data by adopting the picture data, and outputting the audio data after the synthesis and the video data after the synthesis.

9. The method according to claim 1, wherein the obtaining voiceprint information in the audio data and searching for picture data associated with the voiceprint information comprises:

copying the audio data to obtain audio copy data;

and acquiring voiceprint information in the audio copy data, and searching picture data associated with the voiceprint information.

10. A speech recognition apparatus, characterized in that the apparatus comprises:

the data stream acquisition module is used for acquiring a multimedia data stream, and the multimedia data stream comprises video data and audio data;

the voiceprint recognition module is used for acquiring voiceprint information in the audio data and searching picture data associated with the voiceprint information;

and the synthesis processing module is used for synthesizing the video data by adopting the picture data and outputting the audio data and the synthesized video data.

11. A computer storage medium, characterized in that it stores a plurality of instructions adapted to be loaded by a processor and to carry out the method steps according to any one of claims 1 to 9.

12. An electronic device, comprising: a processor and a memory; wherein the memory stores a computer program adapted to be loaded by the processor and to perform the method steps of any of claims 1 to 9.

Technical Field

The present application relates to the field of computer technologies, and in particular, to a speech recognition method, an apparatus, a storage medium, and an electronic device.

Background

Live webcasting and online videos gradually enter the network life of people, but when people watch live webcasting or online videos by using video applications on terminal equipment, the people can only watch video contents acquired from a network, especially live webcasting, and if audiences can only watch video contents collected by a video camera of a main webcasting, the video contents are single, the richness of the video contents is poor, and the using effect of the video applications is reduced.

Disclosure of Invention

The embodiment of the application provides a voice recognition method, a voice recognition device, a storage medium and electronic equipment, and the voice recognition method, the voice recognition device, the storage medium and the electronic equipment can enrich video content by comparing voiceprint information in audio data and matching associated pictures, improve video interestingness, improve voice recognition accuracy by comparing the voiceprint information and improve video application using effect. The technical scheme is as follows:

in a first aspect, an embodiment of the present application provides a speech recognition method, where the method includes:

acquiring a multimedia data stream, wherein the multimedia data stream comprises video data and audio data;

acquiring voiceprint information in the audio data, and searching picture data associated with the voiceprint information;

and synthesizing the video data by adopting the picture data, and outputting the audio data and the video data after synthesis.

In a second aspect, an embodiment of the present application provides a speech recognition apparatus, including:

the data stream acquisition module is used for acquiring a multimedia data stream, and the multimedia data stream comprises video data and audio data;

the voiceprint recognition module is used for acquiring voiceprint information in the audio data and searching picture data associated with the voiceprint information;

and the synthesis processing module is used for synthesizing the video data by adopting the picture data and outputting the audio data and the synthesized video data.

In a third aspect, embodiments of the present application provide a computer storage medium storing a plurality of instructions adapted to be loaded by a processor and to perform the above-mentioned method steps.

In a fourth aspect, an embodiment of the present application provides an electronic device, which may include: a processor and a memory; wherein the memory stores a computer program adapted to be loaded by the processor and to perform the above-mentioned method steps.

The beneficial effects brought by the technical scheme provided by some embodiments of the application at least comprise:

in one or more embodiments of the present application, a multimedia data stream is obtained, where the multimedia data stream includes video data and audio data, voiceprint information in the audio data is obtained, picture data associated with the voiceprint information is searched for, the picture data is adopted to perform synthesis processing on the video data, and the audio data and the video data after the synthesis processing are output. The audio data in the multimedia data stream are split, voiceprint information in the audio data is compared and matched with the associated pictures, video content is enriched, video interestingness is improved, voice recognition accuracy is improved by comparing the voiceprint information, and the using effect of video application is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic structural diagram of a terminal device provided in an embodiment of the present application;

fig. 2 is a schematic flowchart of a speech recognition method according to an embodiment of the present application;

fig. 3 is a schematic flowchart of a speech recognition method according to an embodiment of the present application;

fig. 3a is a schematic diagram illustrating an example of searching picture data according to an embodiment of the present application;

fig. 3b is a schematic diagram illustrating an example of video data segment acquisition according to an embodiment of the present application;

FIG. 3c is an exemplary schematic diagram of a synthesis process provided in an embodiment of the present application;

fig. 4 is an exemplary diagram of performing a live event according to an embodiment of the present application;

fig. 4a is a schematic flowchart of a speech recognition method according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application;

FIG. 8 is a schematic structural diagram of an operating system and a user space provided in an embodiment of the present application;

FIG. 9 is an architectural diagram of the android operating system of FIG. 7;

FIG. 10 is an architecture diagram of the IOS operating system of FIG. 7.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In the description of the present application, it is to be understood that the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. In the description of the present application, it is noted that, unless explicitly stated or limited otherwise, "including" and "having" and any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus. The specific meaning of the above terms in the present application can be understood in a specific case by those of ordinary skill in the art. Further, in the description of the present application, "a plurality" means two or more unless otherwise specified. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

The voice recognition method provided by the embodiment of the application can be realized by relying on a computer program, and can be run on a voice recognition device based on a von neumann system. The computer program may be integrated into the application or may run as a separate tool-like application. The voice recognition device in the embodiment of the application can be a mobile phone, a personal computer, a tablet computer, a handheld device, a vehicle-mounted device, a wearable device and other terminal devices with a display function, and also can be a module capable of executing a voice recognition method in the terminal device, and the terminal device can also be a live broadcast device with a camera shooting function and capable of uploading multimedia data streams collected through the camera shooting function to a network. Referring to fig. 1, a schematic structural diagram of a terminal device is provided for an embodiment of the present application, where the terminal device includes a decapsulation module, an audio decoder, a video decoder, a voiceprint recognition module, a layer composition module, an audio output module, and a video output module. One end of the audio decoder is connected with the decapsulation module, and the other end of the audio decoder is connected with the voiceprint recognition module; one section of the video decoder is connected with the decapsulation module, and the other end of the video decoder is connected with the layer composition module; the voiceprint recognition module is connected with the layer synthesis module, the voiceprint recognition module is connected with the audio output module, and the layer synthesis module is connected with the video output module. It is understood that the terminal device may obtain the multimedia data stream from the video application, the multimedia data stream may be obtained from the local storage of the terminal device or from the network for the video application, or may be collected by the video application invoking the camera function on the terminal device, and the multimedia data stream encapsulates the video data and the audio data together, because the video data has no sound but the audio data has no sound but no image. After the multimedia data stream is acquired by the decapsulation module, original video data and original audio data in the multimedia data stream are decapsulated, that is, the original video data and the original audio data are separated, the original video data include video data but cannot be directly played, and similarly, the original audio data include audio data but cannot be directly played. The decapsulation module transmits the original audio data to an audio decoder for decoding to generate audio data, and the decapsulation module transmits the original video data to a video decoder for decoding to generate video data.

The user of the terminal device may input the sample voiceprint information to the terminal device, for example, input voice information to the terminal device, the terminal device may obtain the sample voiceprint information in the voice information, the user may further set picture data associated with the sample voiceprint information in the terminal device, and both the sample voiceprint information and the picture data are stored in the data set. The audio decoder transmits the audio data to the voiceprint recognition module, the voiceprint recognition module acquires voiceprint information in the audio data, searches sample voiceprint information matched with the voiceprint information in a data set, searches picture data matched with the sample voiceprint information based on the sample voiceprint information, the voiceprint recognition module transmits the picture data to the layer synthesis module, transmits the audio data to the audio output module, and the audio output module outputs the audio data on terminal equipment through video application and also can upload the audio data to a network through the video application. It can be understood that the voiceprint recognition module can be an audio manager packaged with an open source voiceprint recognition module interface, and the audio manager can be accessed to the voiceprint recognition module through the open source voiceprint recognition module interface, for example, a deep speech recognition framework (wav2letter + +), and when the audio manager receives audio data, the audio data can be copied to generate audio copy data, and the audio copy data is collected by adopting wav2letter + + to search for matched sample voiceprint information in a data set.

The video decoder transmits the video data to the layer composition module, the layer composition module performs composition processing on the picture data and the video data, for example, the picture layer corresponding to the picture data and the video layer corresponding to the video data can be subjected to composition processing, the picture layer is overlapped on the video layer, then the video decoder transmits the video data after the composition processing to the video output module, the video output module can display the video data after the composition processing on a display screen of the terminal device through the video application, and the video data after the composition processing can also be uploaded to a network through the video application.

It can be understood that, if the user performs live broadcast activities by using the video application on the terminal device, the multimedia data stream may be collected by the camera function of the terminal device, and the associated picture data may be searched according to the voiceprint information in the voice spoken by the user during live broadcast, the picture data and the video data may be synthesized, and then the audio data and the synthesized video data may be output live broadcast on the network.

The speech recognition method provided by the present application is described in detail below with reference to specific embodiments.

Referring to fig. 2, a flow chart of a speech recognition method according to an embodiment of the present application is schematically shown. As shown in fig. 2, the method of the embodiment of the present application may include the following steps S101-S104.

S101, multimedia data stream is obtained, wherein the multimedia data stream comprises video data and audio data.

Specifically, the terminal device may obtain a multimedia data stream from a video application, where the video application is installed in the terminal device, and may be used to watch a video, live broadcast, and upload a video and live broadcast, and the multimedia data stream may be a video stored in a local storage of the terminal application, may also be a video obtained from a network, and may also be a multimedia data stream acquired by a camera function of the terminal device. The multimedia data stream is composed of video data and audio data, the terminal equipment can split the multimedia data stream into the video data and the audio data, the audio data are split out, the terminal equipment can compare voiceprint information in the audio data conveniently, and the problem that the voice recognition is inaccurate due to the fact that the audio data and the video data are mixed is avoided.

S102, obtaining voiceprint information in the audio data, and searching picture data associated with the voiceprint information.

Specifically, the terminal device can obtain Voiceprint information in the audio data, the Voiceprint (Voiceprint) is a sound wave frequency spectrum which is displayed by an electroacoustic instrument and carries speech information, and the Voiceprint not only has specificity, but also has the characteristic of relative stability. Therefore, voice recognition is carried out by comparing the voiceprint information, the accuracy of voice recognition can be improved, and since each voiceprint has own characteristics, voice recognition is carried out by adopting the voiceprint information, so that not only can the picture data associated with the set statement be searched when the set statement appears in the audio, but also the effect of searching the picture data associated with the set statement when the set statement is spoken by the set user can be achieved. The terminal device may obtain sample voiceprint information in the voice information, the user may further set picture data associated with the sample voiceprint information in the terminal device, both the sample voiceprint information and the picture data are stored in a data set, and the terminal device may search the picture data associated with the voiceprint information in the data set.

And S103, synthesizing the video data by adopting the picture data, and outputting the audio data and the synthesized video data.

Specifically, the terminal device synthesizes the video data by using the picture data, for example, the terminal device may synthesize a picture layer corresponding to the picture data and a video layer corresponding to the video data, and superimpose the picture layer on the video layer, so that when the synthesized video data is displayed, the picture corresponding to the picture data may be overlaid on the video corresponding to the video data. The terminal device may output the audio data and the video data after the synthesis processing in synchronization using a video application. For example, if a user watches a video by using a video application, audio data may be output through a speaker device of the terminal device, and the video data after the synthesis processing is displayed on a screen of the terminal device; if the user carries out live broadcast activities by using the video application, the audio data and the video data after the synthesis processing can be uploaded to the network through the video application, and audiences watching the live broadcast can receive the audio data and the video data after the synthesis processing.

In the embodiment of the application, a multimedia data stream is acquired, the multimedia data stream comprises video data and audio data, voiceprint information in the audio data is acquired, picture data associated with the voiceprint information is searched for, the picture data is adopted to synthesize the video data, and the audio data and the synthesized video data are output. The audio data in the multimedia data stream are split, voiceprint information in the audio data is compared and matched with the associated pictures, video content is enriched, video interestingness is improved, voice recognition accuracy is improved by comparing the voiceprint information, and the using effect of video application is improved.

Referring to fig. 3, a flow chart of a speech recognition method according to an embodiment of the present application is schematically shown. As shown in fig. 3, the method of the embodiment of the present application may include the following steps S201 to S207.

S201, obtaining the input sample voiceprint information, obtaining the set picture data associated with the sample voiceprint information, and storing the sample voiceprint information and the picture data into a data set.

Specifically, the user may input voice information to the terminal device, the terminal device may obtain sample voiceprint information in the voice information, then the user may set picture data associated with the sample voiceprint information, and the terminal device stores the sample voiceprint information, the picture data, and the association relationship between the sample voiceprint information and the picture data in the data set.

S202, multimedia data stream is obtained, wherein the multimedia data stream comprises video data and audio data.

Specifically, the terminal device may obtain a multimedia data stream from a video application, where the video application is installed in the terminal device, and may be used to watch a video, live broadcast, and upload a video and live broadcast, and the multimedia data stream may be a video stored in a local storage of the terminal application, may also be a video obtained from a network, and may also be a multimedia data stream acquired by a camera function of the terminal device. The multimedia data stream is composed of video data and audio data, the terminal equipment can split the multimedia data stream into the video data and the audio data, the audio data are split out, the terminal equipment can compare voiceprint information in the audio data conveniently, and the problem that the voice recognition is inaccurate due to the fact that the voiceprint information is mixed with the video data is avoided.

S203, dividing the audio data into at least one audio segment according to a first preset time length, and acquiring target voiceprint information contained in a target audio segment in the at least one audio segment.

Specifically, because the length of the sample voiceprint information is small, the terminal device may split the audio data into at least one audio segment according to a first preset time length, for example, split the audio data into at least one audio segment with a time length of 50 ms. And then obtaining target voiceprint information contained in a target audio clip in at least one audio clip.

Optionally, before splitting the audio data, the audio data may be copied to obtain audio copy data that is the same as the audio data, and then the audio copy data is split to obtain target voiceprint information in the audio copy data. If the audio data is directly split, detected and the like, the audio data is easily damaged, and if the audio copy data is split, detected and the like and the intact audio data is still output based on the video application, the same voice recognition effect can be achieved, the audio data is prevented from being damaged, and the output quality of the audio data is improved.

S204, if the sample voiceprint information matched with the target voiceprint information exists in the data set, searching the image data associated with the sample voiceprint information in the data set.

Specifically, if the terminal device finds sample voiceprint information matched with the target voiceprint information in the data set, the terminal device searches for picture data associated with the sample voiceprint information in the data set, that is, picture data associated with the target voiceprint information. It can be understood that the target voiceprint information is matched with the sample voiceprint information, the target voiceprint information may be the same as the sample voiceprint information, or the similarity rate of the target voiceprint information and the sample voiceprint information is greater than or equal to a preset similarity rate, and the preset similarity rate may be set on the terminal device by a user or a related worker.

Referring to fig. 3a together, an exemplary schematic diagram of searching for picture data is provided for the embodiment of the present application, in which a terminal device performs copy processing on acquired audio data to obtain audio copy data that is the same as the audio data, then splits the audio copy data into at least one audio segment according to a first preset duration, and acquires voiceprint information in each audio segment. For example, target voiceprint information in a target audio clip is obtained, the terminal device compares the target voiceprint information with voiceprint information in a data set, finds sample voiceprint information matched with the target voiceprint information, and then obtains picture data associated with the sample voiceprint information in the data set.

S205, acquiring an ending time point corresponding to the target voiceprint information in the target audio clip, and acquiring a video data clip which takes the ending time point as a starting time and takes a second preset duration as a clip length in the video data.

Specifically, an ending time point corresponding to the target voiceprint information in the target audio clip, that is, a time when the target voiceprint information ends in the target audio clip is obtained, because the playing durations of the audio data and the video data are the same, a video data clip in which the ending time point is used as the starting time and the second preset duration is used as the clip length can be obtained in the video data, that is, the video data clip starts at the ending time point in the video data and the clip length of the video data clip is the second preset duration. The second preset time length is the time length of the corresponding picture in the picture data displayed in the synthesized video data, and the second preset time length can be a uniform value set in the terminal device by the user, that is, the segment lengths of the video data segments corresponding to all the picture data are the second preset time length; the video data may also be set by the user for the picture data, that is, the segment lengths of the video data segments corresponding to different picture data may be different or may be the same.

Referring to fig. 3b together, an exemplary schematic diagram of video data segment acquisition is provided for the embodiment of the present application, where a terminal device may acquire an end time point of target voiceprint information on a target audio segment, find a time point corresponding to the end time point in video data, and acquire a segment of video data segment with the end time point as a start time and a second preset time length as a segment length.

And S206, acquiring a picture layer corresponding to the picture data and a video layer corresponding to the video data fragment, and synthesizing the picture layer and the video layer.

Specifically, the terminal device may obtain a picture layer corresponding to the picture data and a video layer corresponding to the video data segment, synthesize the picture layer and the video layer, superimpose the picture layer on the video layer, and when the synthesized picture data is displayed on the screen, the picture corresponding to the picture data may be covered on the video corresponding to the video data segment, that is, the picture may be displayed on the screen when the video data is played to the end time point, and the display is ended when the accumulated display time reaches a second preset time.

Optionally, the terminal device may obtain a first position corresponding to a face in the video layer in the video data segment, and since the size and the specification of the picture layer are the same as those of the video layer, a second position corresponding to the first position may be obtained on the picture layer, and the picture corresponding to the picture data is displayed at the second position, so that an effect of displaying the picture on the face may be achieved. Referring to fig. 3c, an exemplary schematic diagram of synthesis processing is provided for the embodiment of the present application, where a terminal device may find a first position corresponding to a human face in a video layer, find a second position corresponding to the first position in a picture layer according to the first position, display a picture at the second position, and after synthesis processing of the video layer and the picture layer, display an effect of displaying the picture at the position of the human face.

Optionally, the terminal device may obtain a third position corresponding to the face in the video layer in the video data segment, and similarly may obtain a fourth position corresponding to the third position on the picture layer, and then display the picture corresponding to the picture data at another position except the fourth position, that is, display the picture at another position except the face, so as to achieve the purpose of avoiding blocking the face.

And S207, outputting the audio data and the video data after the synthesis processing.

Specifically, the terminal device may output audio data and video data after the synthesis processing synchronously by using a video application. For example, if a user watches a video by using a video application, audio data may be output through a speaker device of the terminal device, and the video data after the synthesis processing is displayed on a screen of the terminal device; if the user carries out live broadcast activities by using the video application, the audio data and the video data after the synthesis processing can be uploaded to the network through the video application, and audiences watching the live broadcast can receive the audio data and the video data after the synthesis processing.

It can be understood that when the target voiceprint information matched with the sample voiceprint information is acquired in the audio data, corresponding pictures can be displayed, and corresponding sound effects can be output, for example, a sound effect simulating thunder, a sound effect of aircraft flying over and the like can be output. The user can set sound effect information associated with the sample voiceprint information in the terminal device, and the terminal device can store the sound effect information into the data set. If the terminal device obtains target voiceprint information matched with the sample voiceprint information in the audio data, searching sound effect information associated with the sample voiceprint information in the data set, and synthesizing the audio data by using the sound effect information, for example, referring to a method for synthesizing the video data by using picture data, obtaining an ending time point corresponding to the target voiceprint information in a target audio segment, and inserting a sound track corresponding to the sound effect information into the audio data by using the ending time point as starting time; the terminal device may then output the composite processed audio data and the composite processed video data based on the video application.

In the embodiment of the application, the input sample voiceprint information and the set picture data associated with the sample voiceprint information are stored in a data set, the voiceprint information and the picture data can be set by a user, the personalized requirements of the user are met, a multimedia data stream is obtained, the multimedia data stream comprises video data and audio data, the audio data is copied and then split into at least one audio fragment, the same voice recognition effect can be achieved, the audio data is prevented from being damaged due to splitting and detection processing, and the output quality of the audio data is improved. Obtaining target voiceprint information contained in a target audio clip, if the sample voiceprint information matched with the target voiceprint information exists in the data set, searching the picture data associated with the sample voiceprint information in the data set, synthesizing the video data by adopting the picture data, and outputting the audio data and the video data after synthesis processing based on the video application. The audio data in the multimedia data stream are split, voiceprint information in the audio data is compared and matched with the associated pictures, video content is enriched, video interestingness is improved, voice recognition accuracy is improved by comparing the voiceprint information, and the using effect of video application is improved. And can show the picture in the position of video data human face, promoted the interest of video data output, can also set up audio information to sample vocal print information simultaneously, output audio information when detecting assorted target vocal print information, further promoted the result of use of video application.

It can be understood that the terminal device provided in the embodiment of the present application may be used for performing a live broadcast activity, please refer to fig. 4, which is a schematic diagram illustrating an example of performing a live broadcast activity, a user may place the terminal device in front of the user, capture live video data and live audio data of a current scene with a camera function on the terminal device to generate a live broadcast data stream, the terminal device can then perform speech recognition on the collected live data stream and perform synthesis processing on the live video data, and then, the live broadcast audio data and the live broadcast video data after the synthesis processing are uploaded to a network, so that audiences of live broadcast activities can receive the live broadcast audio data and the live broadcast video data after the synthesis processing from the network, and can also output the live broadcast audio data and the live broadcast video data after the synthesis processing on terminal equipment, and a user can conveniently check the live broadcast state of the user in real time.

Referring to fig. 4a, a flow chart of a speech recognition method according to an embodiment of the present application is schematically shown. As shown in fig. 4a, the method of the embodiment of the present application may include the following steps S301 to S307.

S301, obtaining the input sample voiceprint information, obtaining the set picture data associated with the sample voiceprint information, and storing the sample voiceprint information and the picture data into a data set.

S302, acquiring a live broadcast data stream, wherein the live broadcast data stream comprises live broadcast video data and live broadcast audio data.

Specifically, the terminal device can adopt a camera shooting function to collect images and sounds of a scene, namely, a live data stream of the scene is collected, wherein the live data stream is composed of live video data and live audio data. Terminal equipment can be with live broadcast data stream split for live broadcast video data and live broadcast audio data, and the terminal equipment of being convenient for that will live broadcast audio data split out compares the voiceprint information in the live broadcast audio data, avoids mixing with live broadcast video data and leads to speech recognition inaccurate.

S303, splitting the live broadcast audio data into at least one audio clip according to a first preset time length, and acquiring target voiceprint information contained in a target audio clip in the at least one audio clip.

Specifically, because the length of the sample voiceprint information is small, the terminal device may split the live audio data into at least one audio segment according to a first preset time length, for example, split the live audio data into at least one audio segment with a time length of 50 ms. And then obtaining target voiceprint information contained in a target audio clip in at least one audio clip.

Optionally, before splitting the live broadcast audio data, the live broadcast audio data may be copied to obtain audio copy data that is the same as the live broadcast audio data, and then the audio copy data is split to obtain target voiceprint information in the audio copy data. If directly carry out treatments such as split, detection to live audio data, cause the damage to live audio data easily, if adopt audio frequency duplicate data to carry out treatments such as split, detection, and still apply the intact live audio data of output based on the video, then can reach the same speech recognition effect, and avoid live audio data's damage, promoted live audio data's output quality.

S304, if the sample voiceprint information matched with the target voiceprint information exists in the data set, searching the image data associated with the sample voiceprint information in the data set.

S305, acquiring an end time point corresponding to the target voiceprint information in the target audio clip, and acquiring a live video clip which takes the end time point as a start time and takes a second preset duration as a clip length in the live video data.

Specifically, an end time point corresponding to the target voiceprint information in the target audio clip, that is, the time when the target voiceprint information ends in the target audio clip, is obtained, because the playing durations of the live audio data and the live video data are the same, a live video clip with the end time point as the start time and the second preset duration as the clip length can be obtained in the live video data, that is, the live video clip starts at the end time point in the live video data, and the clip length of the live video clip is the second preset duration. The second preset time length is the time length of a picture corresponding to the picture data displayed in the synthesized live video data, and the second preset time length can be a uniform value set in the terminal device by a user, namely the segment lengths of the live video segments corresponding to all the picture data are the second preset time length; the video clip length of the live video clip corresponding to the different picture data may be different or the same.

S306, acquiring a picture layer corresponding to the picture data and a live broadcast layer corresponding to the live broadcast video clip, and synthesizing the picture layer and the live broadcast layer.

Specifically, the terminal device can acquire a picture layer corresponding to the picture data and a live broadcast layer corresponding to the live broadcast video clip, the picture layer and the live broadcast layer are subjected to synthesis processing, the picture layer is superposed on the live broadcast layer, when the picture data after the synthesis processing is displayed on a screen, a picture corresponding to the picture data covers a video corresponding to the live broadcast video clip, namely, the picture is displayed on the screen when the live broadcast video data is played to the end time point, and the display is finished after the accumulated display time reaches a second preset time.

S307, outputting the live broadcast audio data and the live broadcast video data after the synthesis processing.

Specifically, live audio data and live video data after synthesis processing can be output on the terminal equipment, and a user can directly check the real-time live state of the user on the terminal equipment conveniently. Live audio data and synthesized live video data can also be uploaded to a network, and live audiences can acquire the live audio data and the synthesized live video data from the network.

In the embodiment of the application, the input sample voiceprint information and the set picture data associated with the sample voiceprint information are stored in a data set, the voiceprint information and the picture data can be set by a user, the personalized requirements of the user are met, a live data stream is obtained, the live data stream comprises live video data and live audio data, the live audio data are copied and then split into at least one audio clip, the same voice recognition effect can be achieved, the damage to the live audio data caused by splitting and detection processing is avoided, and the output quality of the live audio data is improved. Acquiring target voiceprint information contained in a target audio clip, if the sample voiceprint information matched with the target voiceprint information exists in the data set, searching the picture data associated with the sample voiceprint information in the data set, synthesizing the live video data by adopting the picture data, outputting the audio data and the video data after synthesis processing, and uploading the audio data and the video data to a network. Live audio data in live data stream is got out through the split, compares the voiceprint information in live audio data and matches the picture that is correlated with, has richened live content, has promoted live interest, adopts voiceprint information to compare the accuracy that has improved speech recognition, has promoted live effect.

The speech recognition apparatus provided by the embodiment of the present application will be described in detail below with reference to fig. 5 to 6. It should be noted that, the speech recognition apparatus shown in fig. 5-6 is used for executing the method of the embodiment shown in fig. 2 and 3 of the present application, and for convenience of description, only the portion related to the embodiment of the present application is shown, and details of the specific technology are not disclosed, please refer to the embodiment shown in fig. 2 and 3 of the present application.

Referring to fig. 5, a schematic structural diagram of a speech recognition apparatus according to an exemplary embodiment of the present application is shown. The speech recognition means may be implemented as all or part of the apparatus in software, hardware or a combination of both. The apparatus 1 comprises a data stream acquisition module 11, a voiceprint recognition module 12 and a synthesis processing module 13.

A data stream obtaining module 11, configured to obtain a multimedia data stream, where the multimedia data stream includes video data and audio data;

the voiceprint recognition module 12 is configured to obtain voiceprint information in the audio data, and search for picture data associated with the voiceprint information;

and a synthesis processing module 13, configured to perform synthesis processing on the video data by using the picture data, and output the audio data and the video data after the synthesis processing.

In this embodiment, a multimedia data stream is obtained, where the multimedia data stream includes video data and audio data, voiceprint information in the audio data is obtained, picture data associated with the voiceprint information is searched for, the picture data is adopted to perform synthesis processing on the video data, and the audio data and the video data after the synthesis processing are output. The audio data in the multimedia data stream are split, voiceprint information in the audio data is compared and matched with the associated pictures, video content is enriched, video interestingness is improved, voice recognition accuracy is improved by comparing the voiceprint information, and the using effect of video application is improved.

Referring to fig. 6, a schematic structural diagram of a speech recognition apparatus according to an exemplary embodiment of the present application is shown. The speech recognition means may be implemented as all or part of the apparatus in software, hardware or a combination of both. The device 1 comprises a data stream acquisition module 11, a voiceprint recognition module 12, a synthesis processing module 13, a voiceprint information storage module 14 and an effect information storage module 15.

The voiceprint information storage module 14 is configured to obtain the input sample voiceprint information, and obtain the set picture data associated with the sample voiceprint information;

storing the sample voiceprint information, the picture data and the incidence relation between the sample voiceprint information and the picture data into a data set;

the sound effect information storage module 15 is configured to acquire the set sound effect information associated with the sample voiceprint information, and store the sound effect information into the data set;

a data stream obtaining module 11, configured to obtain a multimedia data stream, where the multimedia data stream includes video data and audio data;

the voiceprint recognition module 12 is configured to obtain voiceprint information in the audio data, and search for picture data associated with the voiceprint information;

optionally, the voiceprint recognition module 12 is specifically configured to copy the audio data to obtain audio copy data;

and acquiring voiceprint information in the audio copy data, and searching picture data associated with the voiceprint information.

Optionally, the voiceprint recognition module 12 is specifically configured to split the audio data into at least one audio segment according to a first preset time length, and obtain target voiceprint information included in a target audio segment in the at least one audio segment;

Optionally, the voiceprint recognition module 12 is specifically configured to obtain a corresponding end time point of the target voiceprint information in the target audio clip;

acquiring a video data segment which takes the ending time point as starting time and takes second preset duration as segment length from the video data;

and synthesizing the video data fragments by adopting the picture data.

A synthesis processing module 13, configured to perform synthesis processing on the video data by using the picture data, and output the audio data and the video data after the synthesis processing;

optionally, the synthesis processing module 13 is specifically configured to obtain a picture layer corresponding to the picture data and a video layer corresponding to the video data segment;

synthesizing the picture layer and the video layer; wherein the picture layer is superimposed on the video layer;

and outputting the audio data and the video data after the synthesis processing.

Optionally, the synthesis processing module 13 is specifically configured to obtain a first position, in the video layer, of a face in the video data segment;

and acquiring a second position corresponding to the first position on the picture layer, and displaying the picture corresponding to the picture data at the second position.

Optionally, the synthesis processing module 13 is specifically configured to obtain a third position, in the video layer, of a face in the video data segment;

and acquiring a fourth position corresponding to the third position on the picture layer, and displaying the picture corresponding to the picture data at other positions except the fourth position.

Optionally, the synthesis processing module 13 is specifically configured to search the sound effect information associated with the sample voiceprint information in the data set, and perform synthesis processing on the audio data by using the sound effect information;

outputting the audio data after the synthesis processing and the video data after the synthesis processing based on the video application.

In this embodiment, the input sample voiceprint information and the set picture data associated with the sample voiceprint information are stored in a data set, the voiceprint information and the picture data can be set by a user, the personalized requirements of the user are met, and a multimedia data stream is obtained, wherein the multimedia data stream comprises video data and audio data, the audio data is copied and then split into at least one audio segment, so that the same voice recognition effect can be achieved, the audio data is prevented from being damaged due to splitting and detection, and the output quality of the audio data is improved. Obtaining target voiceprint information contained in a target audio clip, if the sample voiceprint information matched with the target voiceprint information exists in the data set, searching the picture data associated with the sample voiceprint information in the data set, synthesizing the video data by adopting the picture data, and outputting the audio data and the video data after synthesis processing based on the video application. The audio data in the multimedia data stream are split, voiceprint information in the audio data is compared and matched with the associated pictures, video content is enriched, video interestingness is improved, voice recognition accuracy is improved by comparing the voiceprint information, and the using effect of video application is improved. And can show the picture in the position of video data human face, promoted the interest of video data output, can also set up audio information to sample vocal print information simultaneously, output audio information when detecting assorted target vocal print information, further promoted the result of use of video application.

It should be noted that, when the speech recognition apparatus provided in the foregoing embodiment executes the speech recognition method, only the division of the functional modules is illustrated, and in practical applications, the functions may be distributed and completed by different functional modules according to needs, that is, the internal structure of the device may be divided into different functional modules to complete all or part of the functions described above. In addition, the speech recognition apparatus and the speech recognition method provided by the above embodiments belong to the same concept, and details of implementation processes thereof are referred to in the method embodiments and are not described herein again.

The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.

An embodiment of the present application further provides a computer storage medium, where the computer storage medium may store a plurality of instructions, and the instructions are suitable for being loaded by a processor and executing the speech recognition method according to the embodiment shown in fig. 2 to 3c, and a specific execution process may refer to specific descriptions of the embodiment shown in fig. 2 to 3c, which is not described herein again.

The present application further provides a computer program product, where at least one instruction is stored, and the at least one instruction is loaded by the processor and executes the speech recognition method according to the embodiment shown in fig. 2 to 3c, where a specific execution process may refer to specific descriptions of the embodiment shown in fig. 2 to 3c, and is not described herein again.

Referring to fig. 7, a block diagram of an electronic device according to an exemplary embodiment of the present application is shown. The electronic device in the present application may comprise one or more of the following components: a processor 110, a memory 120, an input device 130, an output device 140, and a bus 150. The processor 110, memory 120, input device 130, and output device 140 may be connected by a bus 150.

Processor 110 may include one or more processing cores. The processor 110 connects various parts within the overall electronic device using various interfaces and lines, and performs various functions of the electronic device 100 and processes data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 120 and calling data stored in the memory 120. Alternatively, the processor 110 may be implemented in hardware using at least one of Digital Signal Processing (DSP), field-programmable gate Array (FPGA), and Programmable Logic Array (PLA). The processor 110 may integrate one or more of a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a modem, and the like. Wherein, the CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for rendering and drawing display content; the modem is used to handle wireless communications. It is understood that the modem may not be integrated into the processor 110, but may be implemented by a communication chip.

The Memory 120 may include a Random Access Memory (RAM) or a read-only Memory (ROM). Optionally, the memory 120 includes a non-transitory computer-readable medium. The memory 120 may be used to store instructions, programs, code sets, or instruction sets. The memory 120 may include a program storage area and a data storage area, wherein the program storage area may store instructions for implementing an operating system, instructions for implementing at least one function (such as a touch function, a sound playing function, an image playing function, etc.), instructions for implementing various method embodiments described below, and the like, and the operating system may be an Android (Android) system, including a system based on Android system depth development, an IOS system developed by apple, including a system based on IOS system depth development, or other systems. The data storage area may also store data created by the electronic device during use, such as phone books, audio and video data, chat log data, and the like.

Referring to fig. 8, the memory 120 may be divided into an operating system space, where an operating system is run, and a user space, where native and third-party applications are run. In order to ensure that different third-party application programs can achieve a better operation effect, the operating system allocates corresponding system resources for the different third-party application programs. However, the requirements of different application scenarios in the same third-party application program on system resources are different, for example, in a local resource loading scenario, the third-party application program has a higher requirement on the disk reading speed; in the animation rendering scene, the third-party application program has a high requirement on the performance of the GPU. The operating system and the third-party application program are independent from each other, and the operating system cannot sense the current application scene of the third-party application program in time, so that the operating system cannot perform targeted system resource adaptation according to the specific application scene of the third-party application program.

In order to enable the operating system to distinguish a specific application scenario of the third-party application program, data communication between the third-party application program and the operating system needs to be opened, so that the operating system can acquire current scenario information of the third-party application program at any time, and further perform targeted system resource adaptation based on the current scenario.

Taking an operating system as an Android system as an example, programs and data stored in the memory 120 are as shown in fig. 9, and a Linux kernel layer 320, a system runtime library layer 340, an application framework layer 360, and an application layer 380 may be stored in the memory 120, where the Linux kernel layer 320, the system runtime library layer 340, and the application framework layer 360 belong to an operating system space, and the application layer 380 belongs to a user space. The Linux kernel layer 320 provides underlying drivers for various hardware of the electronic device, such as a display driver, an audio driver, a camera driver, a bluetooth driver, a Wi-Fi driver, power management, and the like. The system runtime library layer 340 provides a main feature support for the Android system through some C/C + + libraries. For example, the SQLite library provides support for a database, the OpenGL/ES library provides support for 3D drawing, the Webkit library provides support for a browser kernel, and the like. Also provided in the system runtime library layer 340 is an Android runtime library (Android runtime), which mainly provides some core libraries that can allow developers to write Android applications using the Java language. The application framework layer 360 provides various APIs that may be used in building an application, and developers may build their own applications by using these APIs, such as activity management, window management, view management, notification management, content provider, package management, session management, resource management, and location management. At least one application program runs in the application layer 380, and the application programs may be native application programs carried by the operating system, such as a contact program, a short message program, a clock program, a camera application, and the like; or a third-party application developed by a third-party developer, such as a game application, an instant messaging program, a photo beautification program, a remote diagnosis program, and the like.

Taking an operating system as an IOS system as an example, programs and data stored in the memory 120 are shown in fig. 10, and the IOS system includes: a Core operating system Layer 420(Core OS Layer), a Core Services Layer 440(Core Services Layer), a Media Layer 460(Media Layer), and a touchable Layer 480(Cocoa Touch Layer). The kernel operating system layer 420 includes an operating system kernel, drivers, and underlying program frameworks that provide functionality closer to hardware for use by program frameworks located in the core services layer 440. The core services layer 440 provides system services and/or program frameworks, such as a Foundation framework, an account framework, an advertisement framework, a data storage framework, a network connection framework, a geographic location framework, a motion framework, and so forth, as required by the application. The media layer 460 provides audiovisual related interfaces for applications, such as graphics image related interfaces, audio technology related interfaces, video technology related interfaces, audio video transmission technology wireless playback (AirPlay) interfaces, and the like. Touchable layer 480 provides various common interface-related frameworks for application development, and touchable layer 480 is responsible for user touch interaction operations on the electronic device. Such as a local notification service, a remote push service, an advertising framework, a game tool framework, a messaging User Interface (UI) framework, a User Interface UIKit framework, a map framework, and so forth.

In the framework illustrated in FIG. 10, the framework associated with most applications includes, but is not limited to: a base framework in the core services layer 440 and a UIKit framework in the touchable layer 480. The base framework provides many basic object classes and data types, provides the most basic system services for all applications, and is UI independent. While the class provided by the UIKit framework is a basic library of UI classes for creating touch-based user interfaces, iOS applications can provide UIs based on the UIKit framework, so it provides an infrastructure for applications for building user interfaces, drawing, processing and user interaction events, responding to gestures, and the like.

The Android system can be referred to as a mode and a principle for realizing data communication between the third-party application program and the operating system in the IOS system, and details are not repeated herein.

The input device 130 is used for receiving input instructions or data, and the input device 130 includes, but is not limited to, a keyboard, a mouse, a camera, a microphone, or a touch device. The output device 140 is used for outputting instructions or data, and the output device 140 includes, but is not limited to, a display device, a speaker, and the like. In one example, the input device 130 and the output device 140 may be combined, and the input device 130 and the output device 140 are touch display screens for receiving touch operations of a user on or near the touch display screens by using any suitable object such as a finger, a touch pen, and the like, and displaying user interfaces of various applications. Touch displays are typically provided on the front panel of an electronic device. The touch display screen may be designed as a full-face screen, a curved screen, or a profiled screen. The touch display screen can also be designed to be a combination of a full-face screen and a curved-face screen, and a combination of a special-shaped screen and a curved-face screen, which is not limited in the embodiment of the present application.

In addition, those skilled in the art will appreciate that the configurations of the electronic devices illustrated in the above-described figures do not constitute limitations on the electronic devices, which may include more or fewer components than illustrated, or some components may be combined, or a different arrangement of components. For example, the electronic device further includes a radio frequency circuit, an input unit, a sensor, an audio circuit, a wireless fidelity (WiFi) module, a power supply, a bluetooth module, and other components, which are not described herein again.

In the embodiment of the present application, the main body of execution of each step may be the electronic device described above. Optionally, the execution subject of each step is an operating system of the electronic device. The operating system may be an android system, an IOS system, or another operating system, which is not limited in this embodiment of the present application.

The electronic device of the embodiment of the application can also be provided with a display device, and the display device can be various devices capable of realizing a display function, for example: a cathode ray tube display (CR), a light-emitting diode display (LED), an electronic ink panel, a Liquid Crystal Display (LCD), a Plasma Display Panel (PDP), and the like. A user may utilize a display device on the electronic device 101 to view information such as displayed text, images, video, and the like. The electronic device may be a smartphone, a tablet computer, a gaming device, an AR (Augmented Reality) device, an automobile, a data storage device, an audio playback device, a video playback device, a notebook, a desktop computing device, a wearable device such as an electronic watch, an electronic glasses, an electronic helmet, an electronic bracelet, an electronic necklace, an electronic garment, or the like.

In the electronic device shown in fig. 7, the processor 110 may be configured to call the speech recognition program stored in the memory 120, and specifically perform the following operations:

acquiring a multimedia data stream, wherein the multimedia data stream comprises video data and audio data;

acquiring voiceprint information in the audio data, and searching picture data associated with the voiceprint information;

synthesizing the video data by adopting the picture data;

outputting the audio data and the video data after the synthesizing process based on the video application.

In one embodiment, the processor 110, prior to performing obtaining the multimedia data stream, the multimedia data stream comprising video data and audio data, further performs the following:

acquiring input sample voiceprint information, and acquiring set picture data associated with the sample voiceprint information;

and storing the sample voiceprint information and the picture data into a data set.

In an embodiment, when the processor 110 obtains the voiceprint information in the audio data and searches for the picture data associated with the voiceprint information, the following operations are specifically performed:

In an embodiment, when the processor 110 performs the synthesizing process on the video data by using the picture data, the following operations are specifically performed:

acquiring a corresponding end time point of the target voiceprint information in a target audio clip;

acquiring a video data segment which takes the ending time point as starting time and takes second preset duration as segment length from the video data;

and synthesizing the video data fragments by adopting the picture data.

In an embodiment, when the processor 110 performs the synthesizing process on the video data segment by using the picture data, the following operations are specifically performed:

acquiring a picture layer corresponding to the picture data and a video layer corresponding to the video data fragment;

synthesizing the picture layer and the video layer; wherein the picture layer is superimposed on the video layer.

In an embodiment, when the processor 110 performs the synthesizing process on the picture layer and the video layer, the following operation is specifically performed:

acquiring a first position corresponding to a face in the video data segment in the video layer;

and acquiring a second position corresponding to the first position on the picture layer, and displaying the picture corresponding to the picture data at the second position.

In one embodiment, the processor 110, prior to performing obtaining the multimedia data stream, the multimedia data stream comprising video data and audio data, further performs the following:

and acquiring the set sound effect information associated with the sample voiceprint information, and storing the sound effect information into the data set.

In one embodiment, the processor 110 specifically performs the following operations when executing the video application-based output of the audio data and the video data after the synthesis processing:

searching the sound effect information associated with the sample voiceprint information in the data set, and synthesizing the audio data by adopting the sound effect information;

outputting the audio data after the synthesis processing and the video data after the synthesis processing based on the video application.

copying the audio data to obtain audio copy data;

and acquiring voiceprint information in the audio copy data, and searching picture data associated with the voiceprint information.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a read-only memory or a random access memory.

The above disclosure is only for the purpose of illustrating the preferred embodiments of the present application and is not to be construed as limiting the scope of the present application, so that the present application is not limited thereto, and all equivalent variations and modifications can be made to the present application.

27页详细技术资料下载

Voice recognition method and device, storage medium and electronic equipment

相关技术

网友询问留言