Speech recognition method, device, equipment and computer readable storage medium

文档序号：139071 发布日期：2021-10-22 浏览：34次中文

阅读说明：本技术 语音识别方法、装置、设备及计算机可读存储介质 (Speech recognition method, device, equipment and computer readable storage medium ) 是由秦红伟于 2021-07-23 设计创作，主要内容包括：本申请公开了一种语音识别方法、装置、设备及计算机可读存储介质,属于计算机技术领域,该方法包括：获取语音数据；识别语音数据对应的第一文本数据；响应于第一文本数据包括第一场景标识,基于第一场景标识对应的第一场景字典,获取语音数据对应的第二文本数据；基于第二文本数据进行输入。该方法使得识别得到的输入文本数据更符合应用场景,提高了语音识别的准确率。(The application discloses a voice recognition method, a voice recognition device, voice recognition equipment and a computer readable storage medium, which belong to the technical field of computers, and the method comprises the following steps: acquiring voice data; recognizing first text data corresponding to the voice data; responding to the first text data including a first scene identification, and acquiring second text data corresponding to the voice data based on a first scene dictionary corresponding to the first scene identification; the input is made based on the second text data. The method enables the input text data obtained by recognition to better conform to the application scene, and improves the accuracy of voice recognition.)

1. A method of speech recognition, the method comprising:

acquiring voice data;

identifying first text data corresponding to the voice data;

in response to that the first text data comprises a first scene identification, acquiring second text data corresponding to the voice data based on a first scene dictionary corresponding to the first scene identification, wherein the first scene identification is used for indicating an application scene corresponding to the voice data, and the first scene dictionary is used for indicating a text dictionary corresponding to the application scene;

an input is made based on the second text data.

2. The method of claim 1, wherein after identifying the first text data corresponding to the speech data, further comprising:

searching a first scene identifier included in the first text data based on a corresponding relation between a scene identifier and a scene dictionary, wherein the first scene identifier is a scene identifier included in the corresponding relation between the scene identifier and the scene dictionary;

and acquiring a first scene dictionary corresponding to the first scene identification based on the first scene identification.

3. The method according to claim 2, wherein before searching for the first scene identifier included in the first text data based on the correspondence between the scene identifier and the scene dictionary, further comprising:

Acquiring a scene identifier and a scene dictionary corresponding to the scene identifier;

and obtaining the corresponding relation between the scene identification and the scene dictionary based on the scene identification and the scene dictionary corresponding to the scene identification.

4. The method of any of claims 1-3, wherein the first scene dictionary comprises a mapping of at least one recognition text to at least one scene text;

the obtaining of the second text data corresponding to the voice data based on the first scene dictionary corresponding to the first scene identification includes:

searching a first identification text included in the first text data based on the mapping relation between the at least one identification text and the at least one scene text, wherein the first identification text is the identification text included in the mapping relation between the at least one identification text and the at least one scene text;

acquiring a first scene text corresponding to the first identification text based on the first identification text;

and acquiring second text data corresponding to the voice data based on the first scene text.

5. The method of claim 4, wherein the obtaining second text data corresponding to the speech data based on the first scene text comprises:

And replacing the first recognition text in the first text data with the corresponding first scene text to obtain second text data corresponding to the voice data.

6. A speech recognition apparatus, characterized in that the apparatus comprises:

the first acquisition module is used for acquiring voice data;

the recognition module is used for recognizing first text data corresponding to the voice data;

a second obtaining module, configured to, in response to that the first text data includes a first scene identifier, obtain, based on a first scene dictionary corresponding to the first scene identifier, second text data corresponding to the voice data, where the first scene identifier is used to indicate an application scene corresponding to the voice data, and the first scene dictionary is used to indicate a text dictionary corresponding to the application scene;

and the input module is used for inputting based on the second text data.

7. The apparatus according to claim 6, wherein the recognition module is further configured to look up a first scene identifier included in the first text data based on a correspondence relationship between a scene identifier and a scene dictionary, where the first scene identifier is a scene identifier included in the correspondence relationship between the scene identifier and the scene dictionary; and acquiring a first scene dictionary corresponding to the first scene identification based on the first scene identification.

8. The apparatus according to claim 6 or 7, wherein the first scene dictionary comprises a mapping relationship between at least one recognition text and at least one scene text;

the second obtaining module is configured to search for a first identification text included in the first text data based on a mapping relationship between the at least one identification text and the at least one scene text, where the first identification text is an identification text included in the mapping relationship between the at least one identification text and the at least one scene text; acquiring a first scene text corresponding to the first identification text based on the first identification text; and acquiring second text data corresponding to the voice data based on the first scene text.

9. A computer device, characterized in that the computer device comprises a processor and a memory, in which at least one program code is stored, which is loaded and executed by the processor, to cause the computer device to implement the speech recognition method according to any of claims 1 to 5.

10. A computer-readable storage medium having stored therein at least one program code, the at least one program code being loaded and executed by a processor, to cause a computer to implement the speech recognition method according to any one of claims 1 to 5.

Technical Field

The present application relates to the field of computer technologies, and in particular, to a speech recognition method, apparatus, device, and computer-readable storage medium.

Background

With the development of computer technology, speech recognition technology is becoming more and more popular in the fields of social applications, intelligent customer service or voice assistants, etc. In a speech input scenario of speech recognition, speech data needs to be converted into text data, so that the text data is used as a corresponding input text to realize speech input.

In the related art, a conventional speech recognition method is used to recognize speech data, and the conventional speech recognition method is generally formed by combining a plurality of modules such as an acoustic model, a pronunciation dictionary, a language model and the like, converts the speech data into text data through the sequential mapping relationship among speech features, phonemes, words and word strings, and then directly uses the text data recognized by the conventional speech recognition method as an input text to realize speech input.

However, since the conventional voice recognition method may cause the recognized text data not to coincide with the input text indicated by the voice data, for example, when the user wants to input "@", the input text recognized by the conventional voice recognition system is "at", resulting in a low accuracy of the voice recognition.

Disclosure of Invention

The application provides a voice recognition method, a voice recognition device, voice recognition equipment and a computer-readable storage medium, which can solve the problems in the related art.

In a first aspect, a speech recognition method is provided, the method comprising: acquiring voice data; identifying first text data corresponding to the voice data; in response to that the first text data comprises a first scene identification, acquiring second text data corresponding to the voice data based on a first scene dictionary corresponding to the first scene identification, wherein the first scene identification is used for indicating an application scene corresponding to the voice data, and the first scene dictionary is used for indicating a text dictionary corresponding to the application scene; an input is made based on the second text data.

In a possible implementation manner, after the recognizing the first text data corresponding to the speech data, the method further includes: searching a first scene identifier included in the first text data based on a corresponding relation between a scene identifier and a scene dictionary, wherein the first scene identifier is a scene identifier included in the corresponding relation between the scene identifier and the scene dictionary; and acquiring a first scene dictionary corresponding to the first scene identification based on the first scene identification.

In a possible implementation manner, before searching for the first scene identifier included in the first text data based on the corresponding relationship between the scene identifier and the scene dictionary, the method further includes: establishing scene identifiers and scene dictionaries respectively corresponding to the scene identifiers; and obtaining the corresponding relation between the scene identification and the scene dictionary based on the scene identification and the scene dictionary corresponding to the scene identification.

In a possible implementation, the first scene dictionary includes a mapping relationship between at least one recognition text and at least one scene text; the obtaining of the second text data corresponding to the voice data based on the first scene dictionary corresponding to the first scene identification includes: searching a first identification text included in the first text data based on the mapping relation between the at least one identification text and the at least one scene text, wherein the first identification text is the identification text included in the mapping relation between the at least one identification text and the at least one scene text; acquiring a first scene text corresponding to the first identification text based on the first identification text; and acquiring second text data corresponding to the voice data based on the first scene text.

In a possible implementation manner, the obtaining second text data corresponding to the speech data based on the first scene text includes: and replacing the first recognition text in the first text data with the corresponding first scene text to obtain second text data corresponding to the voice data.

In a second aspect, there is provided a speech recognition apparatus, the apparatus comprising:

the first acquisition module is used for acquiring voice data;

the recognition module is used for recognizing first text data corresponding to the voice data;

and the input module is used for inputting based on the second text data.

In a possible implementation manner, the recognition module is further configured to look up a first scene identifier included in the first text data based on a corresponding relationship between a scene identifier and a scene dictionary, where the first scene identifier is a scene identifier included in the corresponding relationship between the scene identifier and the scene dictionary; and acquiring a first scene dictionary corresponding to the first scene identification based on the first scene identification.

In a possible implementation manner, the recognition module is further configured to establish a scene identifier and a scene dictionary corresponding to the scene identifier; and obtaining the corresponding relation between the scene identification and the scene dictionary based on the scene identification and the scene dictionary corresponding to the scene identification.

In a possible implementation, the first scene dictionary includes a mapping relationship between at least one recognition text and at least one scene text;

In a possible implementation manner, the second obtaining module is configured to replace the first recognized text in the first text data with the corresponding first scene text, so as to obtain second text data corresponding to the voice data.

In a third aspect, a computer device is provided, which includes a processor and a memory, where at least one program code is stored in the memory, and the at least one program code is loaded and executed by the processor, so as to enable the computer device to implement any one of the above-mentioned speech recognition methods.

In a fourth aspect, a computer-readable storage medium is provided, in which at least one program code is stored, and the at least one program code is loaded and executed by a processor to make a computer implement the speech recognition method according to any one of the above.

In a fifth aspect, there is also provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions to cause the computer device to execute any of the above-mentioned voice recognition methods.

The technical scheme provided by the application can at least bring the following beneficial effects:

According to the technical scheme, the first text data obtained based on voice data recognition is converted into the second text data through the application of the scene identification and the scene dictionary, so that the second text data better accords with the application scene corresponding to the voice data, the accuracy rate of acquiring the corresponding text data according to the voice data is improved, and the accuracy rate of voice recognition is further improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic diagram of an implementation environment of a speech recognition method according to an embodiment of the present application;

FIG. 2 is a flow chart of a speech recognition method provided by an embodiment of the present application;

FIG. 3 is an interaction diagram of a speech recognition method according to an embodiment of the present application;

fig. 4 is a schematic diagram of a speech recognition apparatus according to an embodiment of the present application;

FIG. 5 is a schematic structural diagram of a computer device according to an embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

Fig. 1 is a schematic diagram of an implementation environment of a speech recognition method according to an embodiment of the present application. As shown in fig. 1, the implementation environment includes at least one terminal 101 and a server 102, and the at least one terminal 101 and the server 102 are connected through a network, which may be a wireless network or a wired network.

At least one of the terminals 101 may be a portable, pocket, or handheld terminal, such as a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, and the like. The server 102 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a web service, cloud communication, a middleware service, a domain name service, a security service, a CDN (Content Delivery Network), a big data and artificial intelligence platform.

In one possible embodiment, at least one terminal 101 has installed therein a voice recognition application program by which a user can convert user voice data into text data for input and display the input text on a page of the application program. The speech recognition application may be a traditional application, may be a cloud application, may be implemented as an applet or an application module in a host application, or may be a certain web page platform, which is not limited herein.

The server 102 is used to provide a voice recognition service to the terminal 101. A user uploads voice data of a user to be identified through an application program in at least one terminal 101, the terminal 101 transmits the voice data of the user to a server 102 through a network, and the server 102 identifies the voice data of the user and converts the voice data into corresponding first text data; and in response to the first text data including the first scene identification, acquiring second text data corresponding to the voice data of the user based on a first scene dictionary corresponding to the first scene identification, and returning the second text data to the terminal 101 through the network, so that the terminal 101 displays the second text data as an input text in a page of the application program.

The method provided by the embodiment of the application can be applied to various scenes.

For example, in the context of social software.

The method includes the steps of performing text conversion on voice data received in the social software, for example, a user receives a piece of voice information in the social software, for example, a voice strip sent by a counterpart is received in the chat process, a piece of voice dynamic published by a friend in a dynamic interface is brushed, and the like. The user can trigger the voice recognition instruction, so that the voice recognition method provided by the embodiment of the application is adopted to recognize the voice data, the voice data is converted into the text data to be displayed, and the user can be ensured to acquire the related content of the message in time when the user is inconvenient to listen to the voice.

The method is also applied to the scene of input method software.

For the voice input function provided in the input method software, a user performs voice recognition through a preset control in the input method software, the terminal sends collected voice data to the server, the server processes the voice data to obtain corresponding text data, the text data is returned to the terminal, and the text data is displayed as the content of the voice recognition of the user by the terminal. For example, the server may return a piece of text data, and may also return a plurality of pieces of similar text data determined from the voice data to provide the user with a selection.

For example, the speech recognition method provided in the embodiment of the present application may also be applied to other application scenarios, which are only illustrated here and are not limited to specific application scenarios.

Please refer to fig. 2, which shows a flowchart of a speech recognition method provided in an embodiment of the present application, and in the embodiment of the present application, the method is performed by a server in the above-mentioned implementation environment as an example, and the method includes the following steps 201 and 204.

Step 201, voice data is acquired.

In the embodiment of the application, a voice recognition application program can be run in the terminal, and the voice audio of the user can be acquired in real time through a sound receiving component such as a microphone of the terminal. In a possible implementation mode, the terminal compresses the recorded voice audio, packages the compressed audio and the voice-to-text request by using a network protocol, and sends the request to the server through a communication network. And after receiving the voice-to-text request sent by the terminal, the server decompresses the compressed audio corresponding to the voice-to-text request to obtain the voice data to be recognized. Optionally, the server may also obtain the voice data from a database, which is not limited herein.

The voice data may be any content, for example, the voice data may be a word, a sentence, or the like; the voice data may be in any language, for example, chinese, english, or the like.

Step 202, identifying first text data corresponding to the voice data.

The method for recognizing the first text data corresponding to the voice data is not limited in the embodiment of the application, and the method can be used as long as the voice data can be translated into the corresponding text data. Alternatively, the speech recognition method may be a template matching method, a statistical modeling method, an artificial neural network model, or a method based on a combination of vector quantization and hidden markov model, etc.

In the embodiment of the application, the first text data corresponding to the voice data can be recognized by adopting a voice recognition method, and the first text data is often an transliterated text corresponding to the voice data. For example, if the voice data is "123456. com", the first text data that can be recognized is "123456 point com". Since the application scenario in which the voice data is not considered is mailbox input, the recognized text does not coincide with the text indicated by the voice data.

Step 203, in response to that the first text data includes the first scene identifier, acquiring second text data corresponding to the voice data based on the first scene dictionary corresponding to the first scene identifier.

In the embodiment of the application, after the first text data corresponding to the voice data is recognized, whether the first text data includes the scene identifier is further searched. If the first text data includes the scene identifier, the first text data may be modified based on a scene dictionary corresponding to the included scene identifier to obtain second text data corresponding to the voice data. The first scene identification is used for indicating an application scene corresponding to the voice data, and the first scene dictionary is used for indicating a text dictionary corresponding to the application scene.

In a possible implementation, after the first text data corresponding to the voice data is recognized, the method further includes: searching a first scene identification included in the first text data based on the corresponding relation between the scene identification and the scene dictionary; and acquiring a first scene dictionary corresponding to the first scene identification based on the first scene identification. The first scene identifier is a scene identifier included in a corresponding relationship between the scene identifier and the scene dictionary, the number of the scene identifiers included in the first scene identifier is not limited, and the first scene identifier includes at least one scene identifier included in a corresponding relationship between the scene identifier and the scene dictionary.

Illustratively, the first text data corresponding to the voice data is recognized as "mailbox 123456 point com", where "mailbox" is one scene identifier included in the correspondence relationship between the scene identifier and the scene dictionary, and thus it is determined that the first scene identifier included in the first text data is "mailbox". According to the scene identifier of the mailbox, a corresponding first scene dictionary is obtained according to the corresponding relation between the scene identifier and the scene dictionary, and optionally, the first scene dictionary is a mailbox dictionary.

In a possible implementation manner, before searching for the first scene identifier included in the first text data based on the corresponding relationship between the scene identifier and the scene dictionary, the method further includes: acquiring a scene identifier and a scene dictionary corresponding to the scene identifier; and acquiring the corresponding relation between the scene identification and the scene dictionary based on the scene identification and the scene dictionary corresponding to the scene identification.

In one possible embodiment, in order to enable the result of the speech recognition to take the specificity of the application scenario into consideration, a correspondence between at least one scenario identifier and at least one scenario dictionary respectively corresponding to the at least one scenario identifier may be established based on the needs of the application scenario. Each scene dictionary comprises a plurality of mapping relations between the recognition texts and the scene texts, so that text data obtained by transliteration according to a traditional speech recognition method is modified into text data suitable for a special application scene, and the accuracy of speech recognition is improved.

In the embodiment of the present application, the manner of establishing the scene identifier is not limited, as long as the established scene identifier can indicate a corresponding application scene and can correspond to the scene dictionary. For example, the scene identifier may be "password", "mailbox", "number", and "symbol", which respectively correspond to an application scene for inputting a password, an application scene for inputting a mailbox, and an application scene for inputting a number. Alternatively, the scene identifier of the input password scene may be set to "sesame", or the scene identifier of the input digital scene may be set to "arabic".

In one possible implementation, the scene dictionary corresponding to the scene identifier "password" is a "password dictionary", the scene dictionary corresponding to the scene identifier "mailbox" is a "mailbox dictionary", the scene dictionary corresponding to the scene identifier "number" is a "number dictionary", and the scene dictionary corresponding to the scene identifier "symbol" is a "symbol dictionary". Each scene identification corresponds to a scene dictionary, so that the corresponding relation between the scene identification and the scene dictionary can be obtained.

In this embodiment of the present application, each scene dictionary includes a mapping relationship between at least one recognition text and at least one scene text, where the mapping relationship between the at least one recognition text and the at least one scene text is a mapping relationship between a recognition text that does not consider scenes and a recognition text that does consider scenes, and the mapping relationship may be a one-to-one mapping relationship or a many-to-one mapping relationship, and the type of the mapping relationship is not limited in this embodiment of the present application.

For example, the scene dictionary "password dictionary" includes a mapping relationship between at least one recognition text and at least one scene text, and the mapping relationship between the at least one recognition text and the at least one scene text includes but is not limited to: an "upper case F" maps to "F", "a lower case A" maps to "a", and "one" maps to "1", and so on.

As another example, the scene dictionary "symbol dictionary" includes a mapping relationship between at least one recognition text and at least one scene text, and the mapping relationship between the at least one recognition text and the at least one scene text includes but is not limited to: the "ellipses" are mapped as "…", "semicolons" are mapped as "; "and" question mark "are mapped as"? "and the like.

In a possible implementation manner, obtaining second text data corresponding to the voice data based on a first scene dictionary corresponding to the first scene identification includes: searching a first recognition text included in the first text data based on the mapping relation between the at least one recognition text and the at least one scene text; acquiring a first scene text corresponding to the first identification text based on the first identification text; and acquiring second text data corresponding to the voice data based on the first scene text. The first identification text is an identification text included in the mapping relation between at least one identification text and at least one scene text.

In one possible implementation, obtaining second text data corresponding to the voice data based on the first scene text includes: and replacing the first recognition text in the first text data with the corresponding first scene text to obtain second text data corresponding to the voice data.

Illustratively, the first text data obtained by recognition is "password two three upper case F", the first scene identifier included in the first text data "password two three upper case F" is "password", and thus the first scene dictionary is "password dictionary", which includes but is not limited to mappings of "upper case F" to "F", "lower case a" to "a", "one" to "1", "two" to "2", and "three" to "3", etc. Therefore, the first recognized text included in the first text data "password two, three, and" capital F "can be determined to be" one "," two "," three ", and" capital F ", the first scene text corresponding to the first recognized text is acquired as" 1 "," 2 "," 3 ", and" F "according to the" password dictionary ", and the second text data corresponding to the voice data can be acquired as" password 123F "by replacing the first recognized text in the first text data with the corresponding text to the first scene.

Optionally, obtaining second text data corresponding to the voice data based on the first scene text, and may further include: and deleting the scene identification in the first text data, and replacing the first recognition text in the first text data with the corresponding first scene text to obtain second text data corresponding to the voice data. Illustratively, if the first text data is "password two three capital letters F", the "password" is deleted, and "one", "two", "three", and "capital letter F" are replaced with "1", "2", "3", and "F", resulting in the second text data being "123F".

And step 204, inputting based on the second text data.

In one possible embodiment, the entering based on the second text data comprises: and the server returns the obtained second text data to the terminal as text data corresponding to the voice data, the terminal inputs the second text data into a display page of the terminal as input text, and optionally, the terminal displays the second text data into a page of an application program in the terminal as the input text.

In a possible implementation manner, the server may further return both the obtained first text data and the obtained second text data to the terminal, and the terminal selects which text data is displayed in the page of the application as the input text.

The embodiment of the present application provides a speech recognition method, as shown in fig. 3, the method includes the following steps 301-304.

Step 301, the terminal sends the acquired voice data to the server.

In one possible implementation, a voice recognition application program is installed in the terminal, and voice audio of the user can be acquired in real time through a sound receiving component such as a microphone of the terminal. In a possible implementation manner, the terminal compresses the recorded voice audio, packages the compressed audio and the voice-to-text request by using a network protocol, and sends the packaged voice audio and the voice-to-text request to the server through the communication network.

Step 302, the server identifies first text data corresponding to the voice data based on the received voice data.

In a possible implementation manner, after receiving a voice-to-text request sent by a terminal, a server decompresses compressed audio corresponding to the voice-to-text request to obtain the voice data to be recognized.

Step 303, in response to that the first text data includes the first scene identifier, the server obtains, based on the first scene dictionary corresponding to the first scene identifier, second text data corresponding to the voice data.

The implementation of step 303 can refer to the related description of step 203, and is not described herein again.

And step 303, the server returns the second text data corresponding to the acquired voice data to the terminal.

In one possible implementation manner, the server may return both the first text data and the second text data corresponding to the acquired voice data to the terminal.

And step 304, inputting second text data by the terminal, and displaying the second text data in a page of the terminal.

The implementation of step 304 can refer to the related description of step 204, and is not described herein again.

According to the voice recognition method provided by the embodiment of the application, the scene identification and the scene dictionary are applied, the first text data obtained based on the voice data recognition is converted into the second text data, so that the second text data better accords with the application scene corresponding to the voice data, the accuracy of obtaining the corresponding text data according to the voice data is improved, and the accuracy of the voice recognition is further improved.

Referring to fig. 4, an embodiment of the present application provides a speech recognition apparatus, including:

a first obtaining module 401, configured to obtain voice data;

a recognition module 402, configured to recognize first text data corresponding to the voice data;

a second obtaining module 403, configured to, in response to that the first text data includes a first scene identifier, obtain, based on a first scene dictionary corresponding to the first scene identifier, second text data corresponding to the voice data, where the first scene identifier is used to indicate an application scene corresponding to the voice data, and the first scene dictionary is used to indicate a text dictionary corresponding to the application scene;

An input module 404, configured to perform input based on the second text data.

In a possible implementation manner, the recognition module 402 is further configured to look up a first scene identifier included in the first text data based on a corresponding relationship between the scene identifier and the scene dictionary, where the first scene identifier is a scene identifier included in the corresponding relationship between the scene identifier and the scene dictionary; and acquiring a first scene dictionary corresponding to the first scene identification based on the first scene identification.

In a possible implementation manner, the recognition module 402 is further configured to establish a scene identifier and a scene dictionary corresponding to the scene identifier; and obtaining the corresponding relation between the scene identification and the scene dictionary based on the scene identification and the scene dictionary corresponding to the scene identification.

In a possible implementation, the first scene dictionary includes a mapping relationship between at least one recognition text and at least one scene text; a second obtaining module 403, configured to search for a first identification text included in the first text data based on a mapping relationship between the at least one identification text and the at least one scene text, where the first identification text is an identification text included in the mapping relationship between the at least one identification text and the at least one scene text; acquiring a first scene text corresponding to the first identification text based on the first identification text; and acquiring second text data corresponding to the voice data based on the first scene text.

In a possible implementation manner, the second obtaining module 403 is configured to replace the first recognized text in the first text data with the corresponding first scene text, so as to obtain second text data corresponding to the voice data.

According to the voice recognition device provided by the embodiment of the application, after the first text data corresponding to the voice data is obtained through recognition, the first scene identification included in the first text data is determined, and the second text data corresponding to the voice data is obtained as the text data corresponding to the voice data based on the first scene dictionary corresponding to the first scene identification. The device enables the input text data obtained by recognition to better conform to the application scene through the application of the scene identification and the scene dictionary, and improves the accuracy of voice recognition.

It should be understood that, when the apparatus provided in the foregoing embodiment implements the functions thereof, the foregoing division of the functional modules is merely illustrated, and in practical applications, the above functions may be distributed by different functional modules according to needs, that is, the internal structure of the apparatus is divided into different functional modules to implement all or part of the functions described above. In addition, the apparatus and method embodiments provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments for details, which are not described herein again.

Referring to fig. 5, a schematic structural diagram of a computer device according to an embodiment of the present application is shown. The computer device may be a terminal, and may be, for example: smart phones, tablet computers, vehicle-mounted terminals, notebook computers or desktop computers. A terminal may also be referred to by other names such as user equipment, portable terminal, laptop terminal, desktop terminal, etc.

Generally, a terminal includes: a processor 701 and a memory 702.

The processor 701 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so on. The processor 701 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 701 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 701 may be integrated with a GPU (Graphics Processing Unit) which is responsible for rendering and drawing the content required to be displayed by the display screen. In some embodiments, the processor 701 may further include an AI (Artificial Intelligence) processor for processing computing operations related to machine learning.

Memory 702 may include one or more computer-readable storage media, which may be non-transitory. Memory 702 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 702 is used to store at least one instruction for execution by processor 701 to implement a speech recognition method provided by method embodiments herein.

In some embodiments, the terminal may further include: a peripheral interface 703 and at least one peripheral. The processor 701, the memory 702, and the peripheral interface 703 may be connected by buses or signal lines. Various peripheral devices may be connected to peripheral interface 703 via a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of a radio frequency circuit 704, a display screen 705, a camera assembly 706, an audio circuit 707, a positioning component 708, and a power source 709.

The peripheral interface 703 may be used to connect at least one peripheral related to I/O (Input/Output) to the processor 701 and the memory 702. In some embodiments, processor 701, memory 702, and peripheral interface 703 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 701, the memory 702, and the peripheral interface 703 may be implemented on a separate chip or circuit board, which is not limited in this embodiment.

The Radio Frequency circuit 704 is used for receiving and transmitting RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuitry 704 communicates with communication networks and other communication devices via electromagnetic signals. The rf circuit 704 converts an electrical signal into an electromagnetic signal to transmit, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 704 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuitry 704 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: metropolitan area networks, various generation mobile communication networks (2G, 3G, 4G, and 5G), Wireless local area networks, and/or Wireless Fidelity (WiFi) networks. In some embodiments, the radio frequency circuit 704 may also include NFC (Near Field Communication) related circuits, which are not limited in this application.

The display screen 705 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 705 is a touch display screen, the display screen 705 also has the ability to capture touch signals on or over the surface of the display screen 705. The touch signal may be input to the processor 701 as a control signal for processing. At this point, the display 705 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display 705 may be one, disposed on the front panel of the terminal; in other embodiments, the display 705 may be at least two, respectively disposed on different surfaces of the terminal or in a folded design; in still other embodiments, the display 705 may be a flexible display disposed on a curved surface or on a folded surface of the terminal. Even more, the display 705 may be arranged in a non-rectangular irregular pattern, i.e. a shaped screen. The Display 705 may be made of LCD (Liquid Crystal Display), OLED (Organic Light-Emitting Diode), or the like.

The camera assembly 706 is used to capture images or video. Optionally, camera assembly 706 includes a front camera and a rear camera. Generally, a front camera is disposed at a front panel of the terminal, and a rear camera is disposed at a rear surface of the terminal. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (Virtual Reality) shooting functions or other fusion shooting functions. In some embodiments, camera assembly 706 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.

The audio circuitry 707 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 701 for processing or inputting the electric signals to the radio frequency circuit 704 to realize voice communication. For the purpose of stereo sound collection or noise reduction, a plurality of microphones can be arranged at different parts of the terminal respectively. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from the processor 701 or the radio frequency circuit 704 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, the audio circuitry 707 may also include a headphone jack.

The positioning component 708 is used to locate the current geographic Location of the terminal to implement navigation or LBS (Location Based Service). The Positioning component 708 can be a Positioning component based on the GPS (Global Positioning System) in the united states, the beidou System in china, the graves System in russia, or the galileo System in the european union.

The power supply 709 is used to supply power to various components in the terminal. The power source 709 may be alternating current, direct current, disposable batteries, or rechargeable batteries. When power source 709 includes a rechargeable battery, the rechargeable battery may support wired or wireless charging. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, the terminal also includes one or more sensors 710. The one or more sensors 710 include, but are not limited to: acceleration sensor 711, gyro sensor 712, pressure sensor 713, fingerprint sensor 714, optical sensor 715, and proximity sensor 716.

The acceleration sensor 711 can detect the magnitude of acceleration on three coordinate axes of a coordinate system established with the terminal. For example, the acceleration sensor 711 may be used to detect components of the gravitational acceleration in three coordinate axes. The processor 701 may control the display screen 705 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 711. The acceleration sensor 711 may also be used for acquisition of motion data of a game or a user.

The gyro sensor 712 may detect a body direction and a rotation angle of the terminal, and the gyro sensor 712 may cooperate with the acceleration sensor 711 to acquire a 3D motion of the terminal by the user. From the data collected by the gyro sensor 712, the processor 701 may implement the following functions: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization at the time of photographing, game control, and inertial navigation.

Pressure sensors 713 may be disposed on the side frames of the terminal and/or underneath the display 705. When the pressure sensor 713 is arranged on the side frame of the terminal, a holding signal of a user to the terminal can be detected, and the processor 701 performs left-right hand identification or shortcut operation according to the holding signal collected by the pressure sensor 713. When the pressure sensor 713 is disposed at a lower layer of the display screen 705, the processor 701 controls the operability control on the UI interface according to the pressure operation of the user on the display screen 705. The operability control comprises at least one of a button control, a scroll bar control, an icon control and a menu control.

The fingerprint sensor 714 is used for collecting a fingerprint of a user, and the processor 701 identifies the identity of the user according to the fingerprint collected by the fingerprint sensor 714, or the fingerprint sensor 714 identifies the identity of the user according to the collected fingerprint. When the user identity is identified as a trusted identity, the processor 701 authorizes the user to perform relevant sensitive operations, including unlocking a screen, viewing encrypted information, downloading software, paying, changing settings, and the like. The fingerprint sensor 714 may be disposed on the front, back, or side of the terminal. When a physical button or vendor Logo is provided on the terminal, the fingerprint sensor 714 may be integrated with the physical button or vendor Logo.

The optical sensor 715 is used to collect the ambient light intensity. In one embodiment, the processor 701 may control the display brightness of the display screen 705 based on the ambient light intensity collected by the optical sensor 715. Specifically, when the ambient light intensity is high, the display brightness of the display screen 705 is increased; when the ambient light intensity is low, the display brightness of the display screen 705 is adjusted down. In another embodiment, processor 701 may also dynamically adjust the shooting parameters of camera assembly 706 based on the ambient light intensity collected by optical sensor 715.

A proximity sensor 716, also known as a distance sensor, is typically provided on the front panel of the terminal. The proximity sensor 716 is used to collect the distance between the user and the front face of the terminal. In one embodiment, when the proximity sensor 716 detects that the distance between the user and the front surface of the terminal gradually decreases, the processor 701 controls the display screen 705 to switch from the bright screen state to the dark screen state; when the proximity sensor 716 detects that the distance between the user and the front face of the terminal is gradually increased, the processor 701 controls the display 705 to switch from the rest state to the bright state.

Those skilled in the art will appreciate that the configuration shown in FIG. 5 is not intended to be limiting of computer devices and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components may be used.

Fig. 6 is a schematic structural diagram of a server according to an embodiment of the present application, where the server 1400 may generate a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 1401 and one or more memories 1402, where the memory 1402 stores at least one instruction, and the at least one instruction is loaded and executed by the processors 1401 to implement the methods provided by the foregoing method embodiments. Of course, the server may also have components such as a wired or wireless network interface, a keyboard, and an input/output interface, so as to perform input/output, and the server may also include other components for implementing the functions of the device, which are not described herein again.

The server 1400 may be used to perform the steps performed by the server in the speech recognition method described above.

In an exemplary embodiment, a computer device is also provided that includes a processor and a memory having at least one program code stored therein. The at least one program code is loaded into and executed by one or more processors to cause a computer device to implement any of the speech recognition methods described above.

In an exemplary embodiment, there is also provided a computer-readable storage medium having at least one program code stored therein, the at least one program code being loaded and executed by a processor of a computer device to cause the computer to implement any of the speech recognition methods described above.

Alternatively, the computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a Compact Disc Read-Only Memory (CD-ROM), a magnetic tape, a floppy disk, an optical data storage device, and the like.

In an exemplary embodiment, a computer program product or computer program is also provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform any of the speech recognition methods described above.

The present application is intended to cover various modifications, alternatives, and equivalents, which may be included within the spirit and scope of the present application.

19页详细技术资料下载

Speech recognition method, device, equipment and computer readable storage medium

相关技术

网友询问留言