Voice recognition method and device, electronic equipment and storage medium

文档序号：36634 发布日期：2021-09-24 浏览：26次中文

阅读说明：本技术 语音识别方法、装置、电子设备和存储介质 (Voice recognition method and device, electronic equipment and storage medium ) 是由李达于 2021-06-24 设计创作，主要内容包括：本发明实施例涉及通信领域,公开了一种语音识别方法、装置、电子设备和存储介质。语音识别方法包括：获取待识别文件；对所述待识别文件进行分割,获取若干待识别子文件；建立至少两个连接并通过所述至少两个连接同时上传所述待识别子文件,供服务器对所述待识别子文件进行语音识别；获取所述待识别子文件的语音识别结果；拼接所述待识别子文件的语音识别结果,得到所述待识别文件的语音识别结果。实现了语音识别过程中对文件的并行处理,提高了语音识别的效率,减少服务器处理时间,提高了用户体验。(The embodiment of the invention relates to the field of communication, and discloses a voice recognition method, a voice recognition device, electronic equipment and a storage medium. The voice recognition method comprises the following steps: acquiring a file to be identified; dividing the file to be identified to obtain a plurality of sub-files to be identified; establishing at least two connections and uploading the sub-files to be identified simultaneously through the at least two connections, so that a server can perform voice identification on the sub-files to be identified; acquiring a voice recognition result of the subfile to be recognized; and splicing the voice recognition results of the sub-files to be recognized to obtain the voice recognition results of the files to be recognized. The parallel processing of the files in the voice recognition process is realized, the voice recognition efficiency is improved, the server processing time is reduced, and the user experience is improved.)

1. A speech recognition method, comprising:

acquiring a file to be identified;

dividing the file to be identified to obtain a plurality of sub-files to be identified;

establishing at least two connections and uploading the sub-files to be identified simultaneously through the at least two connections, so that a server can perform voice identification on the sub-files to be identified;

acquiring a voice recognition result of the subfile to be recognized;

and splicing the voice recognition results of the sub-files to be recognized to obtain the voice recognition results of the files to be recognized.

2. The method according to claim 1, wherein the format of the file to be recognized is a WAV (wave acoustic wave) format, and the acquiring the file to be recognized comprises:

acquiring an original file;

and when the format of the original file is a non-WAV format, converting the format of the original file into a WAV format to obtain the file to be identified.

3. The method according to claim 1, wherein before the segmenting the to-be-identified file and obtaining a plurality of to-be-identified sub-files, the method further comprises: acquiring the total playing time of the file to be identified;

the step of segmenting the file to be identified to obtain a plurality of sub-files to be identified comprises the following steps:

and segmenting the files to be identified according to the total playing time to obtain the sub-files to be identified.

4. The method according to claim 3, wherein the speech recognition result of the subfile to be recognized comprises text information and time information,

the splicing the voice recognition result of the sub-file to be recognized further comprises, before the obtaining the voice recognition result of the file to be recognized: acquiring the starting time of each recorded subfile to be identified in the total playing time;

the splicing the voice recognition results of the subfiles to be recognized comprises the following steps:

modifying the time information according to the starting time;

and splicing the text information according to the modified time information.

5. The method according to claim 1, wherein the establishing at least two connections and uploading the sub-files to be identified simultaneously via the at least two connections comprises:

respectively creating a WebSocket connection for each sub-file to be identified;

and simultaneously starting to upload the sub-files to be identified through the WebSocket connection until the file data of all the sub-files to be identified are uploaded successfully.

6. The method according to claim 5, wherein the separately creating a WebSocket connection for each of the sub-files to be identified comprises:

creating an instance for each sub-file to be identified;

establishing the WebSocket connection for the sub-file to be identified by using the instance;

the acquiring of the voice recognition result of the subfile to be recognized includes:

and receiving the voice recognition result of the subfile to be recognized, which is returned by the server, through the WebSocket connection corresponding to the instance.

7. The method of claim 1, further comprising:

and storing the voice recognition result of the file to be recognized.

8. A speech recognition apparatus, comprising:

the acquisition module is used for acquiring a file to be identified;

the segmentation module is used for segmenting the file to be identified to obtain a plurality of sub-files to be identified;

the communication module is used for establishing at least two connections and uploading the sub-files to be identified simultaneously through the at least two connections, so that the server can perform voice identification on the sub-files to be identified; acquiring a voice recognition result of the subfile to be recognized;

and the processing module is used for splicing the data of the subfiles to be recognized to obtain the voice recognition result of the subfiles to be recognized.

9. An electronic device, comprising:

at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a speech recognition method according to any one of claims 1 to 7.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the speech recognition method of any one of claims 1 to 7.

Technical Field

The present invention relates to the field of communications, and in particular, to a voice recognition method, apparatus, electronic device, and storage medium.

Background

The voice recognition is a technology for converting voice signals into corresponding texts, and is widely applied to scenes such as long-sentence voice input, audio and video subtitles, live broadcast quality inspection, conference recording and the like. At present, various service providers provide a server capable of performing voice recognition for a user, and when the user needs to perform voice recognition on a certain media file containing a voice signal, the user only needs to upload the file to the server and receive a voice recognition result returned by the server. The speech recognition method mainly has two implementation modes: firstly, real-time voice recognition, after establishing a WebSocket connection between a client and a server, uploading a file to the server through the connection, and acquiring a voice recognition result returned by the server, wherein the WebSocket is a full duplex Protocol based on a Transmission Control Protocol (TCP), that is, after establishing the connection, both communication parties can continuously send data to each other; firstly, a complete file is uploaded and then identified, at this time, streaming transmission can be performed based on WebSocket connection, and a post request can be sent for one-time uploading based on Hyper Text Transfer Protocol (http) connection.

However, in order to facilitate the server to perform voice recognition according to the received file, the client generally converts the format of the file to be recognized into the format specified by the server before performing data transmission, and a common format conversion method is to convert the file in other non-Waveform sound file (WAV) format into the file in WAV format, where the sampling frequency of the file in WAV format is 16 khz, the number of sampling bits is 16 bits, and the file is monaural. In this case, the file content having a duration of 1 second has a data size of 32 Kilobytes (KB). While data transmission on the WebSocket connection is performed in units of frames, in the case of a frame length of 160 milliseconds, the amount of data transmitted per frame is about 5KB, i.e., the amount of data transmitted per time is about 5 KB. Meanwhile, there is an interval time for frame transmission, generally, a frame is transmitted once every 20 ms to 200 ms, and if a frame is transmitted once every 160 ms, an audio file with a 1 minute duration and a size of 1.8 megabytes (Mbyte, MB) has a transmission interval, so that the transmission duration exceeds 1.1 minute, and similarly, it can be obtained that a file transmission duration with a duration of 10 minutes and a size of 18.7MB is greater than 10.24 minutes, and a file transmission completion with a 1 hour duration and a file size of 115.2MB needs to wait for at least 1.024 hours. In addition, due to the influence of factors such as a network, the efficiency of uploading files is very low, which means that the efficiency of receiving and identifying files by a server only through transmission is correspondingly very low, namely the efficiency of the current speech recognition method for uploading files based on WebSocket connection for the server to process the files is very low.

Disclosure of Invention

Embodiments of the present invention provide a voice recognition method, an apparatus, an electronic device, and a storage medium, which implement parallel processing of files in a voice recognition process, improve voice recognition efficiency, reduce server processing time, and improve user experience.

To achieve the above object, an embodiment of the present invention provides a speech recognition method, including: acquiring a file to be identified; dividing the file to be identified to obtain a plurality of sub-files to be identified; establishing at least two connections and uploading the sub-files to be identified simultaneously through the at least two connections, so that a server can perform voice identification on the sub-files to be identified; acquiring a voice recognition result of the subfile to be recognized; and splicing the voice recognition results of the sub-files to be recognized to obtain the voice recognition results of the files to be recognized.

In order to achieve the above object, an embodiment of the present invention further provides a speech recognition apparatus, including: the acquisition module is used for acquiring a file to be identified; the segmentation module is used for segmenting the file to be identified to obtain a plurality of sub-files to be identified; the communication module is used for establishing at least two connections and uploading the sub-files to be identified simultaneously through the at least two connections, so that the server can perform voice identification on the sub-files to be identified; acquiring a voice recognition result of the subfile to be recognized; and the processing module is used for splicing the data of the subfiles to be recognized to obtain the voice recognition result of the subfiles to be recognized.

To achieve the above object, an embodiment of the present invention further provides an electronic device, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the speech recognition method described above.

To achieve the above object, an embodiment of the present invention further provides a computer-readable storage medium storing a computer program, which when executed by a processor implements the above-described speech recognition method.

According to the voice recognition method provided by the embodiment of the invention, after the file to be recognized is divided to obtain the plurality of sub-files to be recognized, at least two connections are established for the sub-files to be recognized, and the sub-files to be recognized are simultaneously uploaded through the at least two connections, so that the server can simultaneously receive and recognize voice signals in the sub-files to be recognized through the at least two connections, the parallel processing of the files in the voice recognition process is realized, the voice recognition efficiency is improved, the server processing time is reduced, and the user experience is improved.

In addition, in the speech recognition method provided in the embodiment of the present invention, the format of the file to be recognized is a waveform sound file WAV format, and the acquiring the file to be recognized includes: acquiring an original file; and when the format of the original file is a non-WAV format, converting the format of the original file into a WAV format to obtain the file to be identified. The files to be processed are uniformly converted into the WAV format, so that the processing pressure of the server is reduced, and the server can conveniently perform voice recognition on the files to be recognized.

In addition, the speech recognition method provided in the embodiment of the present invention further includes, before the segmenting the file to be recognized and acquiring the plurality of sub-files to be recognized: acquiring the total playing time of the file to be identified; the step of segmenting the file to be identified to obtain a plurality of sub-files to be identified comprises the following steps: and segmenting the files to be identified according to the total playing time to obtain the sub-files to be identified. The size of the file can be influenced by factors such as the resolution ratio of the file, the playing time can describe the length of the voice signal, the voice signal can be conveniently segmented and identified, the voice identification results of the sub-files to be identified need to be spliced subsequently, the starting time of each sub-file to be identified in the total playing time is recorded, and the starting time of the sub-files to be identified can be used as a splicing basis to finish splicing.

In addition, in the speech recognition method provided in the embodiment of the present invention, the speech recognition result of the sub-file to be recognized includes text information and time information, and the method further includes, while obtaining the sub-file to be recognized by segmenting the file to be recognized according to the total playing time: acquiring the starting time of each recorded subfile to be identified in the total playing time; the splicing the voice recognition results of the subfiles to be recognized comprises the following steps: modifying the time information according to the starting time; and splicing the text information according to the modified time information. The time information in the voice recognition result of the subfile to be recognized is modified, so that splicing can be performed according to the modified time information, and the method is more accurate and reliable.

In addition, the voice recognition method provided in the embodiment of the present invention, the uploading the to-be-recognized sub-files at the same time includes: respectively creating a WebSocket connection for each sub-file to be identified; and simultaneously starting to upload the sub-files to be identified through the WebSocket connection until the file data of all the sub-files to be identified are uploaded successfully. The file can be uploaded through WebSocket connection, so that the requirement on persistent transmission during file transmission can be met, the situation that the transmission can be completed only by establishing connection for many times is avoided, and the user experience is improved.

In addition, the speech recognition method provided in the embodiment of the present invention, wherein the respectively creating a WebSocket connection for each of the sub-files to be recognized includes: creating an instance for each sub-file to be identified; establishing the WebSocket connection for the sub-file to be identified by using the instance; the acquiring of the voice recognition result of the subfile to be recognized includes: and receiving the voice recognition result of the subfile to be recognized, which is returned by the server, through the WebSocket connection corresponding to the instance. The instances are created to respectively transmit and receive the voice recognition results of the subfiles to be recognized for the subfiles to be recognized, so that different subfiles to be recognized can be corresponding according to different instances, the situation that the voice recognition results cannot be accurately corresponding to the subfiles to be recognized is avoided, the accuracy corresponding to the recognition results is guaranteed, and the accuracy of the voice recognition results of the spliced files to be recognized is improved.

In addition, the speech recognition method provided by the embodiment of the present invention further includes: and storing the voice recognition result of the file to be recognized. And the voice recognition result is stored, so that the user can conveniently use the voice recognition result.

Drawings

One or more embodiments are illustrated by way of example in the accompanying drawings, which correspond to the figures in which like reference numerals refer to similar elements and which are not to scale unless otherwise specified.

FIG. 1 is a flow chart of a speech recognition method provided by an embodiment of the present invention;

FIG. 2 is a flow chart of a speech recognition method including a format conversion step according to another embodiment of the present invention;

FIG. 3 is a flowchart of a speech recognition method according to another embodiment of the present invention, which includes a segmentation step according to the total playing duration of the file to be recognized;

fig. 4 is a flowchart of a speech recognition method including a step of creating a WebSocket connection according to another embodiment of the present invention;

fig. 5 is a diagram of correspondence between WebSocket connections and instances provided by another embodiment of the present invention;

FIG. 6 is a flow chart of a speech recognition method including a step of modifying time information according to another embodiment of the present invention;

FIG. 7 is a flowchart of a speech recognition method including the step of storing speech recognition results of a file to be recognized according to another embodiment of the present invention;

fig. 8 is a schematic structural diagram of a speech recognition apparatus provided in another embodiment of the present invention;

fig. 9 is a schematic structural diagram of an electronic device provided in another embodiment of the present invention.

Detailed Description

As known from the background art, in the related art, a user terminal uploads a file to be recognized through a WebSocket connection, but due to the influence of factors such as a uniform fixed format, a frame sending interval time and a network, the efficiency of uploading the file is very low, that is, the efficiency of uploading the file which can be received and recognized by a server through transmission is also correspondingly very low, that is, the efficiency of the current speech recognition method for uploading the file based on the WebSocket connection for the server to process the file is very low.

In order to implement parallel processing of a file in a speech recognition process, improve speech recognition efficiency, reduce server processing time, and improve user experience, an embodiment of the present application provides a speech recognition method: acquiring a file to be identified; dividing a file to be identified to obtain a plurality of sub-files to be identified; establishing at least two connections for the sub-files to be identified, and simultaneously uploading the sub-files to be identified through the at least two connections, so that the server can perform voice identification on the sub-files to be identified; acquiring a voice recognition result of the sub-file to be recognized; and splicing the voice recognition result of the sub-file to be recognized to obtain the voice recognition result of the file to be recognized.

In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, embodiments of the present invention will be described in detail below with reference to the accompanying drawings. However, it will be appreciated by those of ordinary skill in the art that numerous technical details are set forth in order to provide a better understanding of the present application in various embodiments of the present invention. However, the technical solution claimed in the present application can be implemented without these technical details and various changes and modifications based on the following embodiments. The following embodiments are divided for convenience of description, and should not constitute any limitation to the specific implementation manner of the present invention, and the embodiments may be mutually incorporated and referred to without contradiction.

The following describes the implementation details of the speech recognition method of the present embodiment in detail, and the following is only provided for the convenience of understanding and is not necessary for implementing the present embodiment.

Step 101, obtaining a file to be identified.

In this embodiment, the file to be recognized is a file containing audio of a voice signal, and the format of the file to be recognized is consistent with the file format specified by the server that performs voice recognition on the file to be recognized, for example, in a case where only a file in a Waveform sound file (WAV) format can be used for voice recognition on the server, that is, the file format specified by the server is the WAV format, when the client performs voice recognition on the file by using the server, the format of the file to be recognized needs to be unified into the WAV format. Thus, as shown in fig. 2, in some embodiments, step 101 comprises:

in step 1011, the original file is obtained.

In this embodiment, the original file is a media file such as audio or audio/video containing a voice signal, but the format of the original file is not limited in this embodiment, and the original file may be any file format capable of supporting a voice signal.

In one example, obtaining the original file may be the user selecting a recording file or a video, etc. in one device.

Step 1012, when the format of the original file is not the WAV format, converting the format of the original file into the WAV format to obtain the file to be identified.

In this embodiment, the server supports voice recognition on the audio file in the WAV format, and therefore, it is necessary to detect whether the file format of the original file is in the WAV format, and if it is detected that the original file is not in the WAV format, the original file needs to be converted into the WAV format, which is convenient for the server to process.

It should be noted that the WAV format is one of the most common sound file formats, the file in the WAV format is a standard digital audio file, and can be applied to various operating systems such as Windows, Macintosh, Linux, etc., and has very strong practicability, and the file in the WAV format can record various monaural or stereo sound information and ensure that the sound is not distorted. Therefore, when developing speech recognition, technicians usually perform speech recognition based on the WAV formatted sound file, and the server also usually provides a service for performing speech recognition on the WAV formatted file. Of course, in the present embodiment, the WAV format is taken as an example for explanation, and in other embodiments, if the server can support voice recognition on files in other formats, the files to be recognized may also be unified into Audio files in other formats, such as Moving Picture Experts Group Audio Layer 3 (MP 3), Musical Instrument Digital Interface (MIDI), Windows Media Audio format (WMA), and the like.

And 102, segmenting the file to be identified to obtain a plurality of sub-files to be identified.

In this embodiment, the file may be divided according to the size of the file, or may be divided according to the total playing duration of the file content, but the number of subfiles to be identified and the specific corresponding duration are not limited in this embodiment, and may be flexibly set according to actual requirements.

In particular, since subsequent sub-files to be identified are uploaded to the server and received and processed by the server at the same time, the files can be divided evenly in order to better shorten the transmission and processing time, and the situation that some sub-file to be identified is not uploaded and other sub-file to be identified is uploaded to cause partial connection to be idle, that is, the connection cannot be fully utilized is avoided.

Further, as shown in fig. 3, if the segmentation is performed according to the total playing duration of the file to be identified, in some embodiments, step 102 further includes:

and step 106, acquiring the total playing time of the file to be identified.

At this time, as shown in fig. 3, step 102 specifically includes: and segmenting the file to be identified according to the total playing time to obtain the sub-file to be identified.

More specifically, after the total playing time of the file to be identified is obtained, some time points are selected from the total playing time, the file to be identified is divided by taking the time points as dividing points, or the file to be identified is determined to be divided into N (N is an integer greater than or equal to 2) segments according to the aspects of actual needs, resource consumption and the like, the total playing time of the file to be identified is averagely divided into N segments, and the file content corresponding to the time of each segment forms a subfile to be identified.

In one example, the total playing time length of the file to be identified is 1 hour, it is determined that the file to be identified needs to be divided into 3 segments according to actual needs, and the total playing time length is divided into 3 segments of time lengths, each of which is 60 minutes/3-20 minutes, namely [0, 20 ], [20, 40 ], [40, 60] (unit: minutes), at this time, the starting time of the subfile to be identified corresponding to the first period of time is recorded as 0, the starting time of the subfile to be identified corresponding to the second period of time is recorded as 20 minutes, and the starting time of the subfile to be identified corresponding to the third period of time is recorded as 40 minutes.

And 103, establishing at least two connections and simultaneously uploading the subfiles to be identified through the at least two connections, so that the server can perform voice identification on the subfiles to be identified.

In this embodiment, the number of connections is not limited, each to-be-identified sub-file may have one connection, for example, M to-be-identified sub-files are obtained by splitting and M connections are created, or a plurality of to-be-identified sub-files share one connection, for example, M to-be-identified sub-files are obtained by splitting and J connections are created, where M is an integer greater than 2 and J is a positive integer greater than or equal to 2 and less than M.

Specifically, in the case that each subfile to be processed has one connection, as shown in fig. 4, step 103 specifically includes:

and step 1031, respectively creating WebSocket connection for each sub-file to be identified.

Generally, when a client uploads an audio/video file for a server to perform voice recognition, a connection needs to be established between the client and the server, and particularly when the data stream of the uploaded file is large, a persistent stable connection needs to be established, so that the audio/video file can be continuously uploaded until all contents are uploaded, and if the client uploads the audio/video data while receiving a voice recognition result returned by the server, and the server receives the audio/video data transmitted by the client while performing voice recognition and returning. Therefore, a WebSocket connection is usually established, which allows full duplex communication between the client and the server, so that any party can push data to the other end through the established connection, and the WebSocket can keep the connection state only by establishing the connection once. Compared with the traditional network communication which generally uses stateless, connectionless and unidirectional HTTP connection, the method has higher efficiency and is more durable and stable.

Further, in some embodiments, step 1031 may be described as the following steps from the code level: and creating an example for each subfile to be identified, and then establishing a WebSocket connection for the subfile to be identified by using the example. Specifically, after the identification file is divided into a plurality of sub-files to be identified, an instance is generated for each sub-file to be identified, the starting time of the sub-file to be identified is recorded in the instance, then a WebSocket connection is created for each instance, the sub-files to be identified corresponding to the instances are all audio signals to be identified and data returned by the server are uploaded to the server through the WebSocket connection, and the correspondence relationship between the WebSocket connection and the instances is specifically shown in fig. 5.

And step 1032, simultaneously starting to upload the subfiles to be identified through WebSocket connection until the file data of all the subfiles to be identified are uploaded successfully.

And 104, acquiring a voice recognition result of the subfile to be recognized.

In this embodiment, the server may perform voice recognition after receiving the sub-file, or perform voice recognition while receiving the sub-file, and the embodiment does not limit how the server performs voice recognition, and only needs the client to receive the voice recognition result returned by the server after starting to upload the sub-file to be recognized.

Further, in some embodiments, step 104 may be described at the code level as the following steps: and receiving a voice recognition result of the subfile to be recognized, which is returned by the server through the WebSocket connection corresponding to the instance.

It should be noted that, in order to provide better experience for a user, for example, text information can be made into subtitles according to time information in a speech recognition result, generally speaking, the speech recognition result includes not only text information obtained by converting a speech signal, but also time information of the speech signal, where the time information is generally a time sequence of the text information corresponding to the speech signal, for example, a text information a is recognized for a certain sentence in an audio file, a time period corresponding to the text information a in the entire playing duration of the audio file is time information for the text information a, and time information of all text information of a sub-file to be recognized constitutes time information of a sub-file to be recognized.

And 105, splicing the voice recognition results of the subfiles to be recognized to obtain the voice recognition results of the files to be recognized.

In this embodiment, the voice recognition result of the subfile to be recognized includes text information and time information, where the text information refers to a text obtained by converting a voice signal; the time information is a time sequence of text information corresponding to time information, usually a speech signal, for example, a text information a is recognized for a certain sentence in an audio file, a corresponding time period of the text information a in the whole playing duration of the audio file is the time information for the text information a, and the time information of all the text information of the sub-file to be recognized constitutes the time information of the sub-file to be recognized.

In particular, the present embodiment does not limit a server providing speech recognition, and may be a currently commonly used speech recognition server, and the time information obtained by the currently commonly used speech recognition server is for an uploaded file, that is, the uploaded file is a to-be-recognized sub-file instead of the entire to-be-recognized file in the present embodiment, that is, the time information in the speech recognition result returned by the present embodiment is time described based on the duration of the to-be-recognized sub-file, so that for some special application scenarios, for example, in a subtitle scenario in which an audio/video is directly displayed by a speech recognition result, a video frame needs to be matched with text information of the to-be-recognized sub-file, and the matching is based on the time of a complete audio/video file, that is, the time of the to-be-recognized file, and a small segment of audio/video is divided when the speech recognition is not performed. Therefore, in some embodiments, the returned time information of the subfile to be identified cannot be directly used, and the time information of the subfile to be identified also needs to be modified, and referring to fig. 6, before step 105, further comprising:

and step 107, acquiring the starting time of each recorded sub-file to be identified in the total playing time.

In the embodiment, when the files to be identified are segmented according to the selected time, the starting time of each subfile to be identified is determined and recorded according to the time points; and when the average division into N sub-files to be identified is determined, determining and recording the starting time of each sub-file to be identified according to the total playing time and N. It should be noted that, at this time, in step 102, the file to be identified is segmented according to the total playing time, so as to obtain the sub-files to be identified.

In addition, the time information needs to be modified to correspond to the time in the file to be identified, and step 105 specifically includes:

and 1051, modifying the time information of the same subfile to be identified corresponding to the starting time according to the starting time.

In this embodiment, the time information of the speech signal in the subfile to be recognized in the whole document to be recognized is obtained by adding the start time of the subfile to be recognized to the time information of each subfile to be recognized, and if the start time of a certain subfile to be recognized is 34 minutes, the time information of a certain sentence in the subfile to be recognized is 13 minutes, 24 seconds to 14 minutes and 2 seconds, and the time information of the modified sentence in the whole document to be recognized is 47 minutes, 24 seconds to 48 minutes and 2 seconds.

And step 1052, splicing the text information according to the modified time information.

In this embodiment, the sequence of the sub-files to be identified in the file to be identified is determined according to the size sequence of the modified time information, and then the text information is spliced according to the sequence of the sub-files to be identified.

It should be noted that the time information and the text information included in the speech recognition result are in one-to-one correspondence, and further the modified time information and the modified text information are also in one-to-one correspondence, that is, the text information converted from the whole file to be recognized obtained after splicing also carries the corresponding modified time information, and the specific spliced result is as follows:

00：20：00，100–00：20：05，410

the data table is a matrix of data and is a grid virtual table for temporarily storing data.

00：20：06，870-00：20：15，890

A database: a database is a repository that organizes, stores, and manages data according to a data structure. Further, as shown in fig. 7, step 105 is followed by:

and step 108, storing the voice recognition result of the file to be recognized.

The storage is performed in this embodiment to facilitate displaying, checking, sharing, and the like of letters by using the voice result of the file to be recognized.

The storage location and the storage mode are not limited in this embodiment, and may be any storage method that can be subsequently accessed and used by the relevant user.

An embodiment of the present invention further provides a speech recognition apparatus, as shown in fig. 8, including:

an obtaining module 801, configured to obtain a file to be identified.

The segmentation module 802 is configured to segment the file to be identified, and obtain a plurality of sub-files to be identified.

The communication module 803 is configured to establish at least two connections and upload the subfiles to be identified through the at least two connections at the same time, so that the server performs voice identification on the subfiles to be identified; and acquiring a voice recognition result of the subfile to be recognized.

And the processing module 804 is used for obtaining the voice recognition result of the to-be-recognized file according to the data splicing of the to-be-recognized sub-files.

It should be understood that the present embodiment is an apparatus embodiment corresponding to the method embodiment, and the present embodiment can be implemented in cooperation with the method embodiment. The related technical details mentioned in the method embodiment are still valid in this embodiment, and are not described herein again in order to reduce repetition. Accordingly, the related art details mentioned in the present embodiment can also be applied in the method embodiment.

It should be noted that, all the modules involved in this embodiment are logic modules, and in practical application, one logic unit may be one physical unit, may also be a part of one physical unit, and may also be implemented by a combination of multiple physical units. In addition, in order to highlight the innovative part of the present application, a unit that is not so closely related to solving the technical problem proposed by the present application is not introduced in the present embodiment, but this does not indicate that there is no other unit in the present embodiment.

An embodiment of the present invention further provides an electronic device, as shown in fig. 9, including:

at least one processor 901; and the number of the first and second groups,

a memory 902 communicatively connected to the at least one processor 901; wherein the content of the first and second substances,

the memory 902 stores instructions executable by the at least one processor 901 to enable the at least one processor 901 to perform a speech recognition method provided by an embodiment of the present invention.

Where the memory and processor are connected by a bus, the bus may comprise any number of interconnected buses and bridges, the buses connecting together one or more of the various circuits of the processor and the memory. The bus may also connect various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein. A bus interface provides an interface between the bus and the transceiver. The transceiver may be one element or a plurality of elements, such as a plurality of receivers and transmitters, providing a means for communicating with various other apparatus over a transmission medium. The data processed by the processor is transmitted over a wireless medium via an antenna, which further receives the data and transmits the data to the processor.

The processor is responsible for managing the bus and general processing and may also provide various functions including timing, peripheral interfaces, voltage regulation, power management, and other control functions. And the memory may be used to store data used by the processor in performing operations.

The embodiment of the invention also provides a computer readable storage medium which stores a computer program. The computer program realizes the above-described method embodiments when executed by a processor.

That is, as can be understood by those skilled in the art, all or part of the steps in the method for implementing the embodiments described above may be implemented by a program instructing related hardware, where the program is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, or the like) or a processor (processor) to execute all or part of the steps of the method described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

It will be understood by those of ordinary skill in the art that the foregoing embodiments are specific examples for carrying out the invention, and that various changes in form and details may be made therein without departing from the spirit and scope of the invention in practice.

18页详细技术资料下载

Voice recognition method and device, electronic equipment and storage medium

相关技术

网友询问留言