Voice interaction method and device and voice chip module

文档序号：1876888 发布日期：2021-11-23 浏览：18次中文

阅读说明：本技术 语音交互方法、装置及语音芯片模组 (Voice interaction method and device and voice chip module ) 是由林云峰郭万永于 2020-05-18 设计创作，主要内容包括：公开了一种语音交互方法、装置及语音芯片模组。接收音频输入；获取所述音频输入的文本识别信息；判断未获取到新识别的文本识别信息的时间是否大于第一时长；若未获取到新识别的文本识别信息的时间大于第一时长,则判定语音输入结束。由此,可以在不显著增加设备性能消耗的同时实现高噪音环境下的语音活动结束的判定。(A voice interaction method, a voice interaction device and a voice chip module are disclosed. Receiving an audio input; acquiring text identification information of the audio input; judging whether the time for not acquiring the newly identified text identification information is longer than a first duration; and if the time for not acquiring the newly recognized text recognition information is longer than the first time length, judging that the voice input is finished. Thus, a determination of the end of speech activity in a high noise environment may be achieved without significantly increasing the performance consumption of the device.)

1. A voice interaction method, comprising:

receiving an audio input;

acquiring text identification information of the audio input;

judging whether the time for not acquiring the newly identified text identification information is longer than a first duration;

and if the time for not acquiring the newly recognized text recognition information is longer than the first time length, judging that the voice input is finished.

2. The method of claim 1, further comprising:

detecting whether speech exists in the received audio input;

judging whether the time when the voice is not detected is longer than a second time length or not;

and if the time of not detecting the voice is longer than the second duration, judging that the voice input is finished.

3. The method of claim 1, further comprising:

judging whether the time for not receiving the new audio input is longer than a third time length;

and if the time for not receiving the new audio input is longer than the third time length, judging that the voice input is finished.

4. The method of claim 1, wherein obtaining textual identification information for the audio input comprises:

uploading the audio input to a server;

and receiving text recognition information returned by a server, wherein the text recognition information is obtained by performing semantic recognition on the audio input by the server.

5. The method of claim 1, further comprising:

and executing an operation instruction corresponding to the text recognition information when the voice input is judged to be finished.

6. The method of claim 1, further comprising:

judging whether the audio input is audio data acquired in a high-noise scene;

and under the condition that the audio input is acquired in a high-noise scene, judging whether the voice input is finished or not by judging whether the time for not acquiring the newly identified text identification information is longer than a first time length or not.

7. The method of claim 6, wherein determining whether the audio input is audio data captured in a high noise scene comprises:

and judging whether the audio input is the audio data acquired in the high-noise scene or not based on the preset high-noise scene audio data.

8. A voice interaction method applied to a treadmill comprises the following steps:

collecting audio input of the running machine in a running state;

uploading the audio input to a server;

receiving text recognition information of the audio input returned by the server;

judging whether the time for not acquiring the newly identified text identification information is longer than a first duration;

and if the time for not acquiring the newly recognized text recognition information is longer than the first time length, judging that the voice input is finished.

9. A voice interaction method applied to an integrated cooker comprises the following steps:

collecting audio input of the integrated cooker in an operating state;

uploading the audio input to a server;

receiving text recognition information of the audio input returned by the server;

judging whether the time for not acquiring the newly identified text identification information is longer than a first duration;

and if the time for not acquiring the newly recognized text recognition information is longer than the first time length, judging that the voice input is finished.

10. A voice chip module adapted to be deployed in a device, comprising:

the communication module is used for uploading the audio input detected by the microphone of the equipment to a server and receiving text identification information returned by the server, wherein the text identification information is obtained by performing semantic identification on the audio input by the server;

and the voice activity ending judging module is used for judging whether the time for not acquiring the newly recognized text recognition information is longer than a first time length or not, and judging that the voice input is ended if the time for not acquiring the newly recognized text recognition information is longer than the first time length.

11. The voice chip module according to claim 10, further comprising:

a voice activity detection module to detect whether voice is present in the audio input,

if the voice activity detection module does not detect voice, the voice activity judgment module judges that the voice input is finished,

and if the voice activity detection module detects voice and the communication module does not acquire new text recognition information after exceeding the first duration, the voice activity judgment module judges that the voice input is finished.

12. The voice chip module according to claim 10, further comprising:

and the instruction module is used for instructing the equipment to execute an operation instruction corresponding to the text recognition information under the condition of judging that the voice input is ended.

13. A voice interaction device, comprising:

a receiving module for receiving an audio input;

the acquisition module is used for acquiring the text identification information of the audio input;

and the judging module is used for judging whether the time for not acquiring the newly recognized text recognition information is longer than a first time length or not, and judging that the voice input is finished if the time for not acquiring the newly recognized text recognition information is longer than the first time length.

14. A computing device, comprising:

a processor; and

a memory having executable code stored thereon, which when executed by the processor, causes the processor to perform the method of any of claims 1 to 9.

15. A non-transitory machine-readable storage medium having stored thereon executable code, which when executed by a processor of an electronic device, causes the processor to perform the method of any of claims 1-9.

Technical Field

The present disclosure relates to the field of voice interaction, and in particular, to a voice interaction method, apparatus, and voice chip module.

Background

The voice interaction belongs to the category of human-computer interaction, and is a leading-edge interaction mode developed by the human-computer interaction. In the voice interaction process, whether the voice input is finished or not needs to be judged so as to obtain the complete voice input and improve the interaction experience of the user.

At present, voice activity detection schemes start with a sound signal itself, and determine whether voice input is finished by detecting whether an audio frame in the received sound signal conforms to human pronunciation characteristics.

In the voice interaction scene that there are continuous complicated high noise such as integrated kitchen smoke ventilator, fried dish sound, treadmill noise, because the high noise influence, the speech activity detects the condition that can't accurately judge to stop the pronunciation to finish that the probability appears.

In order to solve the problem that the end of speech cannot be accurately judged in a high-noise environment, according to the existing speech activity detection scheme, a signal processing algorithm adopted by the speech activity detection needs to be optimized, and the performance consumption of equipment is inevitably greatly increased.

Therefore, there is a need for a decision scheme that achieves end of speech activity in a high noise environment without significantly increasing the performance consumption of the device.

Disclosure of Invention

One technical problem to be solved by the present disclosure is to provide a voice interaction scheme capable of achieving determination of end of voice activity in a high noise environment without significantly increasing device performance consumption.

According to a first aspect of the present disclosure, there is provided a voice interaction method, including: receiving an audio input; acquiring text identification information of the audio input; judging whether the time for not acquiring the newly identified text identification information is longer than a first duration; and if the time for not acquiring the newly recognized text recognition information is longer than the first time length, judging that the voice input is finished.

According to a second aspect of the present disclosure, there is provided a voice interaction method applied to a treadmill, including: collecting audio input of the running machine in a running state; uploading the audio input to a server; receiving text recognition information of the audio input returned by the server; judging whether the time for not acquiring the newly identified text identification information is longer than a first duration; and if the time for not acquiring the newly recognized text recognition information is longer than the first time length, judging that the voice input is finished.

According to a third aspect of the present disclosure, there is provided a voice interaction method applied to an integrated cooker, including: collecting audio input of the integrated cooker in an operating state; uploading the audio input to a server; receiving text recognition information of the audio input returned by the server; judging whether the time for not acquiring the newly identified text identification information is longer than a first duration; and if the time for not acquiring the newly recognized text recognition information is longer than the first time length, judging that the voice input is finished.

According to a fourth aspect of the present disclosure, there is provided a voice chip module adapted to be deployed in a device, comprising: the communication module is used for uploading the audio input detected by the microphone of the equipment to the server and receiving text identification information returned by the server, wherein the text identification information is obtained by performing semantic identification on the audio input by the server; and the voice activity ending judging module is used for judging whether the time for not acquiring the newly recognized text recognition information is longer than a first time length or not, and judging that the voice input is ended if the time for not acquiring the newly recognized text recognition information is longer than the first time length.

According to a fifth aspect of the present disclosure, there is provided a voice interaction apparatus, comprising: a receiving module for receiving an audio input; the acquisition module is used for acquiring text identification information of audio input; and the judging module is used for judging whether the time for not acquiring the newly recognized text recognition information is longer than a first time length or not, and judging that the voice input is finished if the time for not acquiring the newly recognized text recognition information is longer than the first time length.

According to a sixth aspect of the present disclosure, there is provided a computing device comprising: a processor; and a memory having executable code stored thereon, which when executed by the processor, causes the processor to perform the method of any of the first to third aspects as described above.

According to a seventh aspect of the present disclosure, there is provided a non-transitory machine-readable storage medium having stored thereon executable code, which when executed by a processor of an electronic device, causes the processor to perform the method of any of the first to third aspects described above.

Therefore, the method and the device can judge whether the voice input is finished or not by judging whether the newly recognized text recognition information exists or not within a period of time based on the characteristic that the voice recognition of the noise data cannot obtain the effective text recognition information, can effectively eliminate the interference of noise, can accurately judge whether the voice input is finished or not in a noise environment, and cannot additionally increase the performance consumption in the whole process.

Drawings

The above and other objects, features and advantages of the present disclosure will become more apparent by describing in greater detail exemplary embodiments thereof with reference to the attached drawings, in which like reference numerals generally represent like parts throughout.

FIG. 1 shows a schematic flow chart diagram of a voice interaction method according to one embodiment of the present disclosure.

Fig. 2 is a schematic diagram illustrating a method of determining whether a voice input is ended according to an embodiment of the present disclosure.

FIG. 3 shows a schematic block diagram of the structure of a voice chip module according to one embodiment of the present disclosure.

FIG. 4 shows a schematic block diagram of the structure of a voice interaction device according to one embodiment of the present disclosure.

Fig. 5 shows a schematic structural diagram of a computing device that can be used to implement the voice interaction method according to an embodiment of the present disclosure.

Detailed Description

Preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

Tests show that effective text recognition information can not be obtained for voice recognition of noise data such as integrated range smoke exhaust ventilator, cooking sound, treadmill noise and the like.

Based on this characteristic, the present disclosure proposes that whether the voice input is ended may be determined by determining whether there is newly recognized text recognition information within a period of time by means of a semantic recognition result of the audio input. Therefore, the interference of noise can be effectively eliminated, and whether the voice input is finished or not can be accurately judged in a noise environment.

Semantic recognition of the audio input is necessary to enable voice interaction and therefore does not add extra performance consumption. And under the condition that the semantic recognition operation is executed at the server, the performance consumption of the equipment is almost zero, so that the method can be applied to various high-low-end equipment, such as a treadmill, an intelligent sound box, an integrated stove and a vehicle.

FIG. 1 shows a schematic flow chart diagram of a voice interaction method according to one embodiment of the present disclosure. The method shown in fig. 1 may be performed by a device supporting a voice interaction function, which may be various high-low end devices operating in a complex noise environment, such as a treadmill, a smart speaker, an integrated cooker, a vehicle, and so on.

Referring to fig. 1, in step S110, an audio input is received.

Sounds in the surrounding environment may be picked up by a microphone of the device, such as a microphone, to obtain audio input. The audio input may include both ambient noise and speech uttered by the speaker.

In step S120, text recognition information of the audio input is acquired.

The text recognition information is a text recognition result obtained by performing semantic recognition on the audio input by utilizing a voice recognition technology. The operation of semantic recognition of the audio input can be executed at the device side or the server side. If the operation of performing semantic recognition on the audio input is executed at the server, the audio input can be uploaded to the server, and text recognition information obtained by performing semantic recognition on the audio input and returned by the server is received.

In step S130, it is determined whether the time during which the newly recognized text recognition information is not acquired is longer than a first duration.

The newly recognized text recognition information refers to a newly added text compared to the previously acquired text recognition information. The newly added text is also the text different from the text identification information acquired before.

Text recognition information referred to in this disclosure may refer to valid text having actual semantics to exclude recognition of noise as invalid text having no actual semantics, such as "buzzing". That is, the text recognition information acquired in step S120 may refer to a valid text having actual semantic content obtained by performing semantic recognition on the audio input.

Receiving audio input (step S110) is an ongoing process; accordingly, it is a continuous process to acquire the text recognition information of the audio input (step S120).

The method and the device can continuously perform semantic recognition on the newly received audio input in the process of continuously receiving the audio input so as to obtain the text recognition information of the newly received audio input. If the text identification information of the newly received audio input is a null text or an invalid text without actual meaning, it indicates that the newly identified text identification information is not acquired.

The method and the device can also perform semantic recognition on all audio inputs received from the audio input receiving to the current in the current round of voice interaction process continuously in the process of continuously receiving the audio input so as to continuously obtain the text recognition information of the total audio input at the latest moment. At this time, the currently acquired text identification information may be compared with the text identification information acquired last time, and if the two are the same, it indicates that no newly identified text identification information exists, that is, no newly identified text identification information is acquired.

The time when the newly recognized text recognition information is not acquired may be timed so as to determine whether the time when the newly recognized text recognition information is not acquired is greater than the first duration. The first time period may be set according to actual conditions, for example, the first time period may be set to 1.5 seconds.

As an example, a timer may be set, and the timer may start timing after the text identification information is acquired, and if newly recognized text identification information is acquired, the timing result of the timer is reset to zero, and the timing is restarted. Therefore, the timing result of the timer can be compared with the first duration in real time to judge whether the time for not acquiring the newly identified text identification information is longer than the first duration.

In step S140, if the time for which the newly recognized text recognition information is not acquired is longer than the first time, it is determined that the voice input is ended.

In the case where it is determined that the voice input is ended, an operation instruction corresponding to the text recognition information may be executed. The text recognition information mentioned here refers to a text recognition result obtained by performing semantic recognition on the audio input from the start of detecting the audio input to the end of determining the voice input.

The operation instruction is related to the user intention represented by the text recognition information. For example, in the case where the device is an integrated cooker and the text recognition information is "turn off the range hood", the operation instruction turns off the range hood.

In the voice interaction process, the operation of performing semantic recognition on the audio input is indispensable. Generally, in the Voice interaction process, a microphone receives an audio input, and then a Voice Activity Detection (VAD) is used to identify a Voice end point and determine whether the Voice input is ended. And under the condition of judging that the voice input is finished, making corresponding action according to a semantic recognition result obtained by performing voice recognition on the complete audio input.

That is, the voice recognition and the voice input end detection are generally performed separately.

The method and the device assist in judging whether the voice input is finished or not by means of the text recognition information obtained by performing semantic recognition on the audio input in the voice interaction process based on the characteristic that effective text recognition information cannot be obtained by performing voice recognition on noise data so as to realize judgment of voice activity finish in a high-noise environment. And performing semantic recognition on the audio input is an operation originally existing in the voice interaction process, so that the performance consumption is not additionally increased by the method. That is, the present disclosure may enable determination of end of voice activity in a high noise environment without significantly increasing device performance consumption.

Furthermore, the audio input is subjected to semantic recognition immediately after the audio input is detected, and the audio input is not subjected to semantic recognition after the voice input is judged to be finished, so that the text recognition result of the voice input can be directly obtained after the voice input is judged to be finished, and the voice interaction efficiency is improved.

As an example, after receiving the audio input, it may be further determined whether the audio input is audio data collected in a high-noise scene, that is, whether there is more noise data in the audio input. And under the condition that the audio input is acquired in the high-noise scene, executing the steps S120 to S140, and judging whether the voice input is ended by judging whether the time for not acquiring the newly recognized text recognition information is longer than the first time length.

Whether the audio input is the audio data acquired in a high-noise scene can be judged in various ways. For example, some audio data in a high-noise scene may be collected in advance, the audio data may be used as preset high-noise scene audio data, and whether the audio input is the audio data collected in the high-noise scene may be determined based on the high-noise scene audio data, for example, whether the audio input is the audio data collected in the high-noise scene may be determined by comparing the similarity between the high-noise scene audio data and the collected audio data. For another example, a noise recognition model for determining whether the audio data is high-noise audio data may be trained according to audio data in some high-noise scenes collected in advance, and the noise recognition model may be used to determine whether the audio input is the audio data collected in the high-noise scenes.

The process of determining whether the voice input is finished by the text recognition information obtained by performing semantic recognition on the audio input is disclosed in detail with reference to fig. 1. The present disclosure can also jointly determine whether the voice input is finished or not by combining the voice activity detection mode on the basis of the method shown in fig. 1.

Fig. 2 is a schematic diagram illustrating a method of determining whether a voice input is ended according to an embodiment of the present disclosure. The method shown in fig. 2 may be performed by a device supporting a voice interaction function, which may be various high-low end devices operating in a complex noise environment, such as a treadmill, a smart speaker, an integrated kitchen, a vehicle, and so on.

As shown in fig. 2, sound in the surrounding environment may be collected using a microphone mounted on the device. The sound collected by the microphone can be subjected to semantic recognition to obtain text recognition information. The semantic recognition operation can be executed at the equipment end or the server end.

In this example, whether the voice input is finished or not can be jointly judged according to the text recognition information obtained by semantic recognition and the voice activity detection mode. The process of determining whether the speech input is finished or not by using the text recognition information obtained by semantic recognition can be referred to the description above in conjunction with fig. 1.

Voice activity detection, also known as voice endpoint detection, voice boundary detection, refers to the recognition of voice endpoints from a stream of voice signals for subsequent processing. As an example, the voice activity detection approach may include: detecting whether speech exists in the received audio input; judging whether the time when the new voice is not detected is longer than a second time length; and if the time for not detecting the new voice is longer than the second time length, judging that the voice input is finished. The method includes detecting whether voice exists in received audio input, namely detecting whether voice signals meeting pronunciation characteristics exist in the audio input, wherein the time of not detecting the voice refers to continuous time of continuously detecting no voice.

In the present disclosure, the determination priority of the determination method of determining whether or not the voice input is ended based on the text recognition information obtained by the semantic recognition may be higher than the voice activity detection method. If the voice input is not finished through the voice activity detection method, but the judgment result of the judgment method based on the semantic recognition is the voice input is finished, the final judgment result is the judgment of the voice input is finished. That is, in the case where it is determined that the voice input is not ended but the time during which the newly recognized text recognition information is not acquired is longer than the first duration using the voice activity detection method, it is determined that the voice input is ended.

Therefore, the scheme carries out auxiliary judgment through the semantic recognition result returned by the server in real time, the burden of the server is not increased, and the performance consumption of the equipment end is very low. The method is applicable to all high-end and low-end equipment without increasing extra cost. The semantic recognition result returned by the server in real time assists the VAD to detect the end of speaking, and the VAD stopping judgment problem under a complex high-noise scene is effectively solved under the condition of little performance consumption.

The method can also judge whether the time for not receiving the new audio input is longer than the third time length, if the time for not receiving the new audio input is longer than the third time length, the end of the voice input can be directly judged, and the voice activity detection can be stopped at the moment so as to reduce the resource consumption.

As an example, the present disclosure may be implemented as a voice interaction method applied to a treadmill, including: collecting audio input of the running machine in a running state, wherein the collected audio input can comprise voice data sent by a speaker and noise data generated by the running machine; uploading the audio input to a server; receiving text recognition information of audio input returned by a server; judging whether the time for not acquiring the newly identified text identification information is longer than a first duration; and if the time for not acquiring the newly recognized text recognition information is longer than the first time length, judging that the voice input is finished. For details related to the method, see the above description, and are not repeated herein.

As an example, the present disclosure may also be implemented as a voice interaction method applied to an integrated cooker, including: collecting audio input of the integrated cooker in an operating state, wherein the collected audio input can comprise voice data sent by a speaker and noise data generated in the working process of the integrated cooker; uploading the audio input to a server; receiving text recognition information of audio input returned by a server; judging whether the time for not acquiring the newly identified text identification information is longer than a first duration; and if the time for not acquiring the newly recognized text recognition information is longer than the first time length, judging that the voice input is finished. For details related to the method, see the above description, and are not repeated herein.

The voice interaction method can be applied to an AI module supporting the voice interaction function, and the AI module is an artificial intelligence module and can be composed of a chip or a chip and a hardware PCB. The AI module can be passed through the network and communicate with artificial intelligence server, and the AI module can be disposed in equipment, and equipment accessible is wired or wireless and AI module communicates.

As an example, the voice interaction method of the present disclosure may be implemented as a voice chip module, and the voice chip module may be deployed in, but not limited to, a treadmill, a smart speaker, an integrated stove, a vehicle, and the like, so that the device has a voice interaction function.

FIG. 3 shows a schematic block diagram of the structure of a voice chip module according to one embodiment of the present disclosure. The functional modules of the voice chip module can be implemented by hardware, software or a combination of hardware and software for implementing the principles of the present disclosure. It will be appreciated by those skilled in the art that the functional blocks described in fig. 3 may be combined or divided into sub-blocks to implement the principles of the invention described above. Thus, the description herein may support any possible combination, or division, or further definition of the functional modules described herein.

In the following, brief descriptions are given to functional modules that the voice chip module can have and operations that each functional module can execute, and for details related thereto, reference may be made to the above-mentioned related descriptions, which are not described herein again.

Referring to fig. 3, the voice chip module 300 may include a communication module 310 and an end of voice activity determination module 320.

The communication module 310 may upload the audio input detected by the microphone of the device to the server and receive the text recognition information returned by the server, where the text recognition information is obtained by performing semantic recognition on the audio input by the server.

The voice activity end determination module 320 is configured to determine whether the time when the newly recognized text recognition information is not acquired is longer than a first duration, and if the time when the newly recognized text recognition information is not acquired is longer than the first duration, determine that the voice input is ended.

The voice chip module 300 may further include a voice activity detection module for detecting whether voice exists in the audio input, if the voice activity detection module does not detect voice, the voice activity determination module determines that the voice input is ended, and if the voice activity detection module detects voice and the communication module does not acquire new text recognition information after exceeding the first duration, the voice activity determination module determines that the voice input is ended.

The voice chip module 300 may further include an instruction module configured to instruct a device to execute an operation instruction corresponding to the text recognition information if it is determined that the voice input is ended.

As an example, after receiving the audio input, the voice chip module 300 may first determine whether the audio input is the audio data collected in the high-noise scene, for example, may determine whether the audio input is the audio data collected in the high-noise scene based on the preset audio data of the high-noise scene. And under the condition that the audio input is acquired in a high-noise scene, judging whether the time for not acquiring the newly identified text identification information is longer than a first time length, and judging whether the voice input is finished.

The voice interaction method can be realized as a voice interaction device. FIG. 4 shows a schematic block diagram of the structure of a voice interaction device according to one embodiment of the present disclosure. Wherein the functional blocks of the voice interaction device can be implemented by hardware, software, or a combination of hardware and software that implement the principles of the present disclosure. It will be appreciated by those skilled in the art that the functional blocks described in fig. 4 may be combined or divided into sub-blocks to implement the principles of the invention described above. Thus, the description herein may support any possible combination, or division, or further definition of the functional modules described herein.

In the following, functional modules that the voice interaction apparatus can have and operations that each functional module can perform are briefly described, and details related thereto may be referred to the above description, and are not repeated here.

Referring to fig. 4, the voice interaction apparatus 400 includes a receiving module 410, an obtaining module 420, and a determining module 430.

The receiving module 410 is used for receiving audio input. The obtaining module 420 is used for obtaining text recognition information of the audio input. The determining module 430 is configured to determine whether the time when the newly recognized text recognition information is not acquired is longer than a first duration, and if the time when the newly recognized text recognition information is not acquired is longer than the first duration, determine that the voice input is ended.

Alternatively, the obtaining module 420 may upload the audio input to the server, and receive text recognition information returned by the server, where the text recognition information is obtained by performing semantic recognition on the audio input by the server.

As an example, the voice interaction apparatus 400 may further include a detection module. The detection module is used for detecting whether voice exists in the received audio input. The determining module 430 may further determine whether the time when the voice is not detected is greater than the second duration, and if the time when the voice is not detected is greater than the second duration, determine that the voice input is ended.

As an example, the determining module 430 may further determine whether a time when no new audio input is received is greater than a third duration; and if the time for not receiving the new audio input is longer than the third time length, judging that the voice input is finished.

As an example, after receiving the audio input, the receiving module 410 may first determine whether the audio input is the audio data collected in the high noise scene by the determining module 430, for example, whether the audio input is the audio data collected in the high noise scene may be determined based on the preset audio data of the high noise scene. And under the condition that the audio input is acquired in a high-noise scene, judging whether the time for not acquiring the newly identified text identification information is longer than a first time length, and judging whether the voice input is finished.

Fig. 5 shows a schematic structural diagram of a computing device that can be used to implement the voice interaction method according to an embodiment of the present disclosure.

Referring to fig. 5, computing device 500 includes memory 510 and processor 520.

The processor 520 may be a multi-core processor or may include a plurality of processors. In some embodiments, processor 520 may include a general-purpose host processor and one or more special coprocessors such as a Graphics Processor (GPU), a Digital Signal Processor (DSP), or the like. In some embodiments, processor 520 may be implemented using custom circuitry, such as an Application Specific Integrated Circuit (ASIC) or a Field Programmable Gate Array (FPGA).

The memory 510 may include various types of storage units, such as system memory, Read Only Memory (ROM), and permanent storage. Wherein the ROM may store static data or instructions for the processor 520 or other modules of the computer. The persistent storage device may be a read-write storage device. The persistent storage may be a non-volatile storage device that does not lose stored instructions and data even after the computer is powered off. In some embodiments, the persistent storage device employs a mass storage device (e.g., magnetic or optical disk, flash memory) as the persistent storage device. In other embodiments, the permanent storage may be a removable storage device (e.g., floppy disk, optical drive). The system memory may be a read-write memory device or a volatile read-write memory device, such as a dynamic random access memory. The system memory may store instructions and data that some or all of the processors require at runtime. Further, the memory 510 may include any combination of computer-readable storage media, including various types of semiconductor memory chips (DRAM, SRAM, SDRAM, flash memory, programmable read-only memory), magnetic and/or optical disks, may also be employed. In some embodiments, memory 510 may include a removable storage device that is readable and/or writable, such as a Compact Disc (CD), a digital versatile disc read only (e.g., DVD-ROM, dual layer DVD-ROM), a Blu-ray disc read only, an ultra-dense disc, a flash memory card (e.g., SD card, min SD card, Micro-SD card, etc.), a magnetic floppy disk, or the like. Computer-readable storage media do not contain carrier waves or transitory electronic signals transmitted by wireless or wired means.

The memory 510 has stored thereon executable code that, when processed by the processor 520, causes the processor 520 to perform the voice interaction methods described above.

The voice interaction method, the voice chip module, the apparatus and the computing device according to the present disclosure have been described in detail above with reference to the accompanying drawings.

Furthermore, the method according to the present disclosure may also be implemented as a computer program or computer program product comprising computer program code instructions for performing the above-mentioned steps defined in the above-mentioned method of the present disclosure.

Alternatively, the present disclosure may also be embodied as a non-transitory machine-readable storage medium (or computer-readable storage medium, or machine-readable storage medium) having stored thereon executable code (or a computer program, or computer instruction code) which, when executed by a processor of an electronic device (or computing device, server, etc.), causes the processor to perform the various steps of the above-described method according to the present disclosure.

Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems and methods according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Having described embodiments of the present disclosure, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

13页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种语音识别方法

Voice interaction method and device and voice chip module

相关技术

网友询问留言