Device and system for detecting voice recognition accuracy

文档序号:193319 发布日期:2021-11-02 浏览:35次 中文

阅读说明:本技术 一种检测语音识别准确率的装置和系统 (Device and system for detecting voice recognition accuracy ) 是由 韦胜钰 叶超 蔡佳 黄林轶 徐华伟 刘斌辉 于 2021-06-09 设计创作,主要内容包括:本公开涉及一种检测语音识别准确率的装置和系统。包括:语音播放设备、音频采集设备、网络监测设备、音频分析设备,其中,所述网络监测设备用于监测所述音频分析设备的网络连接状态,在所述网络连接状态低于预设值的情况下,向所述音频分析设备发送停止播放的信息;所述音频分析设备与所述语音播放设备、所述音频采集设备和所述网络监测设备电性连接,用于根据所述响应音频和预设响应音频,确定所述待测设备语音识别的准确率,并且在接收到所述停止播放的信息后,删除或停止接收所述响应音频。本公开实施例利用音频分析设备可以自动化的测试待测设备的语音识别准确率,无需人工参与,测试周期短。(The present disclosure relates to a device and system for detecting speech recognition accuracy. The method comprises the following steps: the network monitoring device is used for monitoring the network connection state of the audio analysis device and sending information for stopping playing to the audio analysis device under the condition that the network connection state is lower than a preset value; the audio analysis equipment is electrically connected with the voice playing equipment, the audio acquisition equipment and the network monitoring equipment, and is used for determining the accuracy of voice recognition of the equipment to be tested according to the response audio and a preset response audio, and deleting or stopping receiving the response audio after receiving the information of stopping playing. The voice recognition accuracy of the equipment to be tested can be automatically tested by using the audio analysis equipment, manual participation is not needed, and the test period is short.)

1. An apparatus for detecting speech recognition accuracy, comprising:

the voice playing device is used for receiving the audio file sent by the audio analysis device and playing the audio file;

the audio acquisition equipment is used for acquiring response audio fed back by the equipment to be tested after receiving the audio file and sending the response audio to the audio analysis equipment;

the network monitoring equipment is used for monitoring the network connection state among the voice playing equipment, the audio acquisition equipment and the audio analysis equipment, and sending information for stopping playing to the audio analysis equipment under the condition that the network connection state does not accord with preset conditions;

and the audio analysis equipment is electrically connected with the voice playing equipment, the audio acquisition equipment and the network monitoring equipment and is used for stopping sending the audio file to the voice playing equipment if the information for stopping playing is received, stopping receiving the response audio sent by the audio acquisition equipment and deleting the received response audio, and when the playing number or duration of the audio file reaches a detection condition, determining the accuracy of the voice recognition of the equipment to be tested according to the received response audio and the preset response audio of which the network state accords with the preset condition.

2. The apparatus of claim 1, further comprising:

noise generation equipment, noise generation equipment includes a plurality ofly, and is a plurality of noise generation equipment symmetry set up in the equipment that awaits measuring is all around, be on same horizontal reference plane all around to on the equipment that awaits measuring is the coordinate system of initial point, the angle is predetermine at least interval to the angle between two adjacent equipment that take place, adopts following mode to confirm predetermine the angle: and dividing by 360 by the number of the devices to be tested.

3. The apparatus of claim 1 or 2, further comprising:

and the noise measuring equipment is used for measuring a noise signal within a preset range of the equipment to be tested and sending information for stopping playing to the audio analysis equipment under the condition that the noise signal is greater than a preset value.

4. The apparatus of claim 1, wherein the audio file is configured to be converted from text corpora in a plurality of dialects.

5. The apparatus according to claim 1, wherein the audio analysis device is configured to obtain an end time when the audio file is played by the voice playing device and a start time when the audio collection device collects the response audio, and determine the response time of the device under test according to the end time and the start time.

6. The apparatus of claim 1, wherein the determining the speech recognition accuracy of the device under test according to the response audio and a preset response audio comprises:

inputting the response audio and the preset response audio into an audio similarity model, and outputting whether the response audio is similar to the preset response audio, wherein the audio similarity model is set to be obtained through training according to the similarity relation between a first audio sample and a second audio sample;

and determining the voice recognition accuracy of the equipment to be tested according to the output result.

7. The apparatus of claim 1, further comprising:

the image acquisition equipment is used for acquiring a response image fed back by the equipment to be tested after receiving the audio file and sending the response image to the audio analysis equipment;

and the audio analysis equipment is used for determining the accuracy of the voice recognition of the equipment to be tested according to the response image and a preset response image.

8. The apparatus of claim 7, wherein the determining the accuracy of the speech recognition of the device under test according to the response image and a preset response image comprises:

inputting the response image and the preset response image into an image similarity model, and outputting whether the response image is a similar image of the preset response image, wherein the image similarity model is set to be obtained according to the similarity relation training of a first image sample and a second image sample;

and determining the voice recognition accuracy of the equipment to be tested according to the output result.

9. A method for detecting speech recognition accuracy, comprising:

playing the audio file;

receiving a response audio fed back by the equipment to be tested, wherein the response audio is an audio generated by the equipment to be tested after receiving the audio file;

monitoring the network connection state among voice playing equipment, audio acquisition equipment and audio analysis equipment, and if the network connection state does not meet the preset condition, stopping playing the audio file, stopping receiving the response audio and deleting the received response audio;

and when the playing number or duration of the audio files reaches a detection condition, determining the accuracy of the voice recognition of the equipment to be tested according to the received response audio and the preset response audio of which the network state meets the preset condition.

10. The method of claim 9, further comprising, after said playing the audio file:

receiving a response image fed back by the equipment to be tested, wherein the response image is an image generated by the equipment to be tested after receiving the audio file;

and when the playing number or duration of the audio files reaches a detection condition, determining the accuracy of the voice recognition of the equipment to be tested according to the received response image and the preset response image of which the network state meets the preset condition.

11. A system for detecting speech recognition accuracy, comprising:

the equipment to be tested has a voice interaction function;

and the device of any one of claims 1 to 8.

Technical Field

The present disclosure relates to the field of speech recognition technology, and in particular, to a device and system for detecting speech recognition accuracy.

Background

Along with the development of science and technology, more and more intelligent devices appear, such as intelligent stereo set, intelligent TV, intelligent navigation equipment and intelligent house equipment etc. bring very big facility for people's production life. In the related art, detection of the voice recognition effect of the intelligent equipment still needs detection personnel to judge, and the time consumption and cost are high.

Therefore, there is a need for an apparatus and system for automatically detecting speech recognition accuracy.

Disclosure of Invention

To overcome at least one of the problems in the related art, the present disclosure provides an apparatus and system for detecting speech recognition accuracy.

According to a first aspect of the embodiments of the present disclosure, there is provided an apparatus for detecting speech recognition accuracy, including:

the voice playing device is used for receiving the audio file sent by the audio analysis device and playing the audio file;

the audio acquisition equipment is used for acquiring response audio fed back by the equipment to be tested after receiving the audio file and sending the response audio to the audio analysis equipment;

the network monitoring equipment is used for monitoring the network connection state among the voice playing equipment, the audio acquisition equipment and the audio analysis equipment, and sending information for stopping playing to the audio analysis equipment under the condition that the network connection state is lower than a preset value;

and the audio analysis equipment is electrically connected with the voice playing equipment, the audio acquisition equipment and the network monitoring equipment and is used for stopping sending the audio file to the voice playing equipment if the information for stopping playing is received, stopping receiving the response audio sent by the audio acquisition equipment and deleting the received response audio, and when the playing number or duration of the audio file reaches a detection condition, determining the accuracy of the voice recognition of the equipment to be tested according to the received response audio and the preset response audio of which the network state accords with the preset condition.

In one possible implementation, the apparatus further includes:

noise generation equipment, noise generation equipment includes a plurality ofly, and is a plurality of noise generation equipment symmetry set up in the equipment that awaits measuring is all around, be on same horizontal reference plane all around to on the equipment that awaits measuring is the coordinate system of initial point, the angle is predetermine at least interval to the angle between two adjacent equipment that take place, adopts following mode to confirm predetermine the angle: and dividing by 360 by the number of the devices to be tested.

In one possible implementation, the apparatus further includes:

and the noise measuring equipment is used for measuring a noise signal within a preset range of the equipment to be tested and sending information for stopping playing to the audio analysis equipment under the condition that the noise signal is greater than a preset value.

In one possible implementation, the audio file is configured to be converted from text corpora in multiple dialects.

In a possible implementation manner, the audio analysis device is configured to obtain an end time when the audio playing device plays the audio file and a start time when the audio acquisition device acquires the response audio, and determine a response time of the device to be tested according to the end time and the start time.

In a possible implementation manner, the determining the speech recognition accuracy of the device under test according to the response audio and a preset response audio includes:

inputting the response audio and the preset response audio into an audio similarity model, and outputting whether the response audio is similar to the preset response audio, wherein the audio similarity model is set to be obtained through training according to the similarity relation between a first audio sample and a second audio sample;

and determining the voice recognition accuracy of the equipment to be tested according to the output result.

In one possible implementation manner, the method further includes:

the image acquisition equipment is used for acquiring a response image fed back by the equipment to be tested after receiving the audio file and sending the response image to the audio analysis equipment;

and the audio analysis equipment is used for determining the accuracy of the voice recognition of the equipment to be tested according to the response image and a preset response image.

In a possible implementation manner, the determining, according to the response image and a preset response image, the accuracy of speech recognition of the device under test includes:

inputting the response image and the preset response image into an image similarity model, and outputting whether the response image is a similar image of the preset response image, wherein the image similarity model is set to be obtained according to the similarity relation training of a first image sample and a second image sample;

and determining the voice recognition accuracy of the equipment to be tested according to the output result.

According to a second aspect of the embodiments of the present disclosure, there is provided a method for detecting speech recognition accuracy, including:

the audio file is played back and the audio file is played back,

receiving a response audio fed back by the equipment to be tested, wherein the response audio is an audio generated by the equipment to be tested after receiving the audio file;

monitoring the network connection state among voice playing equipment, audio acquisition equipment and audio analysis equipment, and if the network connection state does not meet the preset condition, stopping playing the audio file, stopping receiving the response audio and deleting the received response audio;

and when the playing number or duration of the audio files reaches a detection condition, determining the accuracy of the voice recognition of the equipment to be tested according to the received response audio and the preset response audio of which the network state meets the preset condition.

In a possible implementation manner, after the playing the audio file, the method further includes:

receiving a response image fed back by the equipment to be tested, wherein the response image is an image generated by the equipment to be tested after receiving the audio file;

and when the playing number or duration of the audio files reaches a detection condition, determining the accuracy of the voice recognition of the equipment to be tested according to the received response image and the preset response image of which the network state meets the preset condition.

According to a third aspect of the embodiments of the present disclosure, there is provided a system for detecting speech recognition accuracy, including:

the equipment to be tested has a voice interaction function;

the device for detecting the accuracy rate of voice recognition according to any embodiment of the disclosure.

The voice recognition accuracy of the equipment to be tested can be automatically tested by using the audio analysis equipment, manual participation is not needed, and the test period is short. And the network monitoring equipment sends information for stopping playing to the audio analysis equipment when detecting that the connection state of the network does not accord with the preset condition, the audio analysis equipment stops sending audio files to the voice playing equipment after receiving the information for stopping playing, stops receiving the response audio sent by the audio acquisition equipment, and deletes the received response audio. Therefore, the embodiment of the disclosure can prevent the condition of inaccurate response audio or long response time caused by network reasons, and can improve the accuracy of the test by using the response audio of which the network state meets the preset condition.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

Fig. 1 is a diagram illustrating an application scenario of an apparatus for detecting speech recognition accuracy according to an exemplary embodiment.

FIG. 2 is a schematic block diagram illustrating an apparatus for detecting speech recognition accuracy in accordance with an exemplary embodiment.

FIG. 3 is a flow chart illustrating a method of detecting speech recognition accuracy, according to an example embodiment.

FIG. 4 is a flow chart illustrating a method of detecting speech recognition accuracy, according to an example embodiment.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

In order to facilitate those skilled in the art to understand the technical solutions provided by the embodiments of the present disclosure, a technical environment for implementing the technical solutions is described below.

Regarding speech recognition detection, GB/T36464.2-2018 information technology Intelligent Speech interaction System part 2: the basic testing methods such as voice awakening and voiceprint recognition are specified in series of standards such as smart home, but the testing environment specified by the standards is single, so that a real use scene cannot be simulated, and a certain difference exists between a testing result and the actual experience of a consumer. In the related technology, a semi-automatic test system is adopted, and in detection items such as response time and response accuracy, detection personnel are still required to judge the voice interaction result every time, so that the time consumption is long, and the cost is high. In the test process, the test environment depends on the network, and the related technology cannot discharge the abnormity of response delay, response error and the like caused by network blockage or other test environment factors to influence the detection result. In addition, the related art cannot restore real and complex test environment noise and the test of the response time of the intelligent device.

Fig. 1 is a diagram illustrating an application scenario of an apparatus for detecting speech recognition accuracy according to an exemplary embodiment. FIG. 3 is a flow chart illustrating a method of detecting speech recognition accuracy, according to an example embodiment. Referring to fig. 1 and 3, the apparatus includes an audio analyzer 100, an intelligent product 106 to be tested, an image acquisition terminal 102, an audio acquisition terminal 103, a voice playing device 101, a noise generating device 104, a noise tester 105, and a network monitor 106. The audio analyzer 100 stores an audio file (to-be-tested corpus tag) for testing, and plays the audio file through the voice playing device 101, and the to-be-tested smart product 106 feeds back a broadcast voice or a feedback image after receiving the played audio. The image capture terminal 102 and the audio capture terminal 103 capture the fed back voice and image, and send them to the audio analyzer 100. The audio analyzer 100 compares the fed back voice and image with the preset voice and image, judges the similarity of the two, and determines the accuracy of voice recognition. The network monitor (network monitor north-river) 106 is used to monitor the network connection status between the audio analyzer 100 and other devices, and terminate the test in time when the network quality is not high, so as to avoid affecting the accuracy of the test result. The noise measuring instrument (noise measuring device) 105 is configured to measure a noise signal within a preset range of the device under test, and send information of stopping playing to the audio analysis device when the noise signal is greater than a preset value, so as to avoid affecting accuracy of a test result.

FIG. 2 is a schematic block diagram illustrating an apparatus for detecting speech recognition accuracy in accordance with an exemplary embodiment. Referring to fig. 2, the apparatus includes:

the voice playing device 203 is used for receiving the audio file sent by the audio analyzing device and playing the audio file;

the audio acquisition device 205 is configured to acquire a response audio fed back by the device to be tested after receiving the audio file, and send the response audio to the audio analysis device;

the network monitoring device 207 is configured to monitor a network connection state among the voice playing device, the audio acquisition device, and the audio analysis device, and send information for stopping playing to the audio analysis device when the network connection state does not meet a preset condition;

the audio analysis equipment 201 is electrically connected with the voice playing equipment, the audio acquisition equipment and the network monitoring equipment, and is used for stopping sending audio files to the voice playing equipment, stopping receiving response audio sent by the audio acquisition equipment and deleting the received response audio if the information for stopping playing is received, and determining the accuracy of voice recognition of the equipment to be tested according to the received response audio and the preset response audio of which the network state accords with the preset condition when the playing number or duration of the audio files reaches the detection condition. .

In the embodiment of the present disclosure, the voice playing device 203 may include an electronic device with a speaker function, such as a sound box, a mobile phone, or a speaker. In the testing process, the voice playing device 203 may be placed within a preset distance range of the device to be tested. The audio capture device 205 may include an electronic device with a recording function, such as a recording pen, a microphone, a mobile phone, etc. The network monitoring device 207 may include a wireless network tester and a wired network tester, and the network monitoring device may determine a link rate, ten megabytes, hundred megabytes, or gigabytes of the ethernet network, and may also determine an operating state of the network, such as half-duplex or full-duplex. The network monitoring device 207 may also have a ping function, and is configured to perform connectivity test on a network and detect a network failure point. The network monitoring device 207 is configured to detect a network connection status of the audio analysis device, where the network connection status may include a network bandwidth size, a quality of a network signal, a network rate, and the like. The preset conditions may include that the network bandwidth is higher than the preset bandwidth, the network signal quality is higher than the preset signal quality, the network rate is higher than the preset rate, and the like.

In the embodiment of the present disclosure, the audio analysis device 201 may include various computers or server devices, such as a mobile phone, a tablet, a notebook, and the like. The audio analysis equipment and the voice playing equipment play the stored audio files, and the equipment to be tested can make feedback after receiving the audio files. For example: the voice playing device plays 'minim, minim' or 'hello, bose', the device to be tested receives the audio and then gives feedback of 'at woollen', 'I is at a yawning', 'owner, which instruction' and the like. The audio analysis device 201 collects the feedback response audio by using the audio collection device, compares the feedback response audio with the pre-stored preset response audio, determines the similarity between the two, and further determines the accuracy of the speech recognition of the wading pen to be detected. In one example, the method for comparing the similarity between the two audio frequencies may convert the response audio frequency and the preset response audio frequency into a text form, and determine the similarity between the response audio frequency and the preset response audio frequency by using a pre-trained semantic similarity model. In another example, the response audio and the preset response audio can be input into a pre-trained speech similarity model, and the similarity of the response audio and the preset response audio can be output. It should be noted that the audio files played by the audio playing device in the embodiment of the present disclosure may include multiple audio files, and a preset time duration may be set for a playing interval of each audio file. The plurality of audio files can be played in sequence, in a loop or randomly. Correspondingly, the method for determining the speech recognition accuracy of the device to be tested according to the response audio and the preset response audio may include comparing the similarity between the one-time response audio and the preset response audio, and may also include comparing the similarity between the multi-time response audio and the preset audio, and the result of the multi-time comparison may be averaged to determine the speech recognition accuracy of the device to be tested.

The voice recognition accuracy of the equipment to be tested can be automatically tested by using the audio analysis equipment, manual participation is not needed, and the test period is short. And the network monitoring equipment sends information for stopping playing to the audio analysis equipment when detecting that the connection state of the network does not accord with the preset condition, the audio analysis equipment stops sending audio files to the voice playing equipment after receiving the information for stopping playing, stops receiving the response audio sent by the audio acquisition equipment, and deletes the received response audio. Therefore, the embodiment of the disclosure can prevent the condition of inaccurate response audio or long response time caused by network reasons, and can improve the accuracy of the test by using the response audio of which the network state meets the preset condition.

In a possible implementation manner, the apparatus for detecting speech recognition accuracy may further include a plurality of noise generating devices, where the noise generating devices are disposed around the device to be tested, the noise generating devices are disposed on the same horizontal reference plane around the device to be tested, and the device to be tested is used as an original point in a coordinate system, and angles between two adjacent generating devices are at least spaced at preset angles, and the preset angles are determined in the following manner: and dividing by 360 by the number of the devices to be tested.

In the disclosed embodiment, the noise sound generating device may include various types, such as a diode noise generator, a gas discharge tube type noise generator, solid-state noise in a reverse current of a crystal diode, and the like. Noise generation equipment can include a plurality ofly, it is a plurality of noise generation equipment symmetry set up in to be measured equipment's all around, can simulate diversified environmental noise source, the real life scene is pressed close to better to improve measuring result's suitability.

In a possible implementation manner, the apparatus for detecting speech recognition accuracy may further include a noise measurement device, configured to measure a noise signal within a preset range of the device to be detected, and send information of stopping playing to the audio analysis device when the noise signal is greater than a preset value.

In the embodiment of the present disclosure, the noise measurement device is used for measuring noise in a test environment, and may include a sound level meter, a spectrum analyzer, and the like. The noise may be generated by a noise generating device or may be objectively present noise in the test environment. The noise measurement equipment is used for measuring a noise signal within a preset range of the equipment to be measured, and sending information for stopping playing to the audio analysis equipment under the condition that the noise signal is larger than a preset value. And after receiving the information of stopping playing, the audio analysis equipment deletes or stops receiving the response audio, and the measurement is invalid.

The embodiment of the disclosure can prevent the condition of inaccurate response audio or long response time caused by overlarge noise, improve the accuracy of the test, and can support 24-hour repeatability test by monitoring the test environment.

In one possible implementation, the audio file is configured to be converted from text corpora in multiple dialects.

In the embodiment of the present disclosure, the audio file may include a corpus text to be tested and synthesized through voice. In one example, the corpus text to be tested may be converted into dialects, which may include dialects of China or various other countries, e.g., if the product is used in China, the dialects may include northeast, south-of-the-river, Shanghai, Sichuan, etc.

According to the embodiment of the disclosure, the corpus text to be tested is converted into multiple dialects, so that the test environment is closer to the actual application scene, and the accuracy of the test result is improved.

In a possible implementation manner, the audio analysis device is configured to obtain an end time when the audio playing device plays the audio file and a start time when the audio acquisition device acquires the response audio, and determine a response time of the device to be tested according to the end time and the start time.

In the embodiment of the present disclosure, the audio file may include a plurality of audio segments, and in combination with a specific application scenario, for example, a man-machine interaction during navigation: audio file 1: "Xiaode, Xiaode! ", the device under test sends out: "do you, the main task, have what kind? ". Audio file 2: "navigate to zoo", the equipment to be tested sends out: "there are three routes to the zoo, which are you choosing? . Audio file 3: "choose the first strip". After all the test audio clips have been played. The average response time and the maximum response time are calculated. In the embodiment of the present disclosure, the ending time of the audio file and the starting time of the audio file may be implemented by setting timestamps, for example: and acquiring the finishing time of playing the audio by the voice playing equipment and the time of acquiring the feedback response audio of the equipment to be tested by using the audio acquisition equipment, and transmitting the time and the acquired feedback audio into the audio analysis equipment together in a timestamp mode.

In the embodiment of the disclosure, the audio analysis device determines the response time of the device to be tested by using the end time of the audio file and the start time of the response audio, and may determine the response time of the device to be tested without human participation.

In a possible implementation manner, the determining the speech recognition accuracy of the device under test according to the response audio and a preset response audio includes:

inputting the response audio and the preset response audio into an audio similarity model, and outputting whether the response audio is similar to the preset response audio, wherein the audio similarity model is set to be obtained through training according to the similarity relation between a first audio sample and a second audio sample;

and determining the voice recognition accuracy of the equipment to be tested according to the output result.

In the embodiment of the disclosure, the audio similarity model can be trained in advance by a deep learning method. The training method comprises the following steps: acquiring audio characteristics of a first audio sample and a second audio sample, wherein labels are preset on the second audio sample, and the labels can comprise similar or dissimilar. And constructing an audio similarity model, wherein network parameters are arranged in the audio similarity model, inputting the first audio sample and the second audio sample into the audio similarity model to obtain a prediction result, and iteratively adjusting the network parameters based on the difference between the prediction result and the labeled label of the second audio until the difference meets the preset requirement. In one example, the response audio may be further subjected to noise reduction before the response audio is input into the audio similarity model.

According to the voice recognition method and the voice recognition device, the similarity between the response audio and the preset response audio is compared by utilizing a deep learning method, so that the accuracy of the voice recognition of the device to be tested is judged, and the voice recognition method and the voice recognition device have the advantage of high judgment accuracy. And through the judgment result of the deep learning, the indexes of interaction rejection rate, false awakening rate and the like of the equipment to be tested can be obtained through analysis. Provides a full-automatic objective test means for speech recognition.

In a possible implementation manner, the device for detecting speech recognition accuracy further includes an image acquisition device, configured to acquire a response image fed back by the device to be detected after receiving the audio file, and send the response image to the audio analysis device;

and the audio analysis equipment is used for determining the accuracy of the voice recognition of the equipment to be tested according to the response image and a preset response image.

In the embodiment of the disclosure, after the device to be tested receives the audio, the feedback can be made in an image mode. For example: and when the voice playing equipment plays 'please start up', the equipment to be tested presents a 'start up' picture image. The image capture device may comprise a camera. In one embodiment, after the device under test receives the audio, the feedback can be made in both image and voice modes. At this time, the image pickup device and the audio pickup device can simultaneously operate. The embodiment of the disclosure adds the image acquisition equipment, can test the voice recognition fed back in the form of the image, and has richer test contents. In the embodiment of the disclosure, the method capable of deep learning compares the similarity between the response image and the preset response image to determine the accuracy of the speech recognition of the device to be detected, and may also calculate the distance between the response image and the preset response image, such as euclidean distance, manhattan distance, pearson correlation coefficient, hamming distance, mahalanobis distance, and the like, and when the distance is smaller than the preset value, it indicates that the two images are relatively similar.

In a possible implementation manner, the determining, according to the response image and a preset response image, the accuracy of speech recognition of the device under test includes:

inputting the response image and the preset response image into an image similarity model, and outputting whether the response image is a similar image of the preset response image, wherein the image similarity model is set to be obtained according to the similarity relation training of a first image sample and a second image sample;

and determining the voice recognition accuracy of the equipment to be tested according to the output result.

In the embodiment of the disclosure, the image similarity model can be trained in advance by a deep learning method. The training method comprises the following steps: acquiring image characteristics of a first image sample and a second image sample, wherein a label is preset on the second image sample, and the label can comprise two types of similar or dissimilar. And constructing an image similarity model, wherein network parameters are arranged in the image similarity model, inputting the first image sample and the second image sample into the image similarity model to obtain a prediction result, and iteratively adjusting the network parameters based on the difference between the prediction result and the label of the labeled second image until the difference meets the preset requirement. In one example, the response image may be subjected to noise reduction before being input into the image similarity model.

According to the voice recognition method and the voice recognition device, the similarity between the response image and the preset response image is compared by utilizing a deep learning method, so that the accuracy of the voice recognition of the device to be tested is judged, and the voice recognition method and the voice recognition device have the advantage of high judgment accuracy.

FIG. 4 is a flowchart of a method of one embodiment of a method of detecting speech recognition accuracy provided by the present disclosure. Although the present disclosure provides method steps as illustrated in the following examples or figures, more or fewer steps may be included in the method based on conventional or non-inventive efforts. In steps where no necessary causal relationship exists logically, the order of execution of the steps is not limited to that provided by the disclosed embodiments.

Specifically, an embodiment of the method for detecting speech recognition accuracy provided by the present disclosure is shown in fig. 1, where the method may be applied to interaction of multiple terminal devices, and includes:

step S401, playing an audio file;

step S403, receiving a response audio fed back by the device to be tested, wherein the response audio is an audio generated by the device to be tested after receiving the audio file;

step S405, monitoring the network connection state among the voice playing equipment, the audio collecting equipment and the audio analyzing equipment, if the network connection state does not accord with the preset condition, stopping playing the audio file, stopping receiving the response audio and deleting the received response audio;

step S407, when the playing number or duration of the audio files reaches the detection condition, determining the accuracy of the voice recognition of the device to be tested according to the received response audio and the preset response audio of which the network state meets the preset condition.

In a possible implementation manner, after the playing the audio file, the method further includes:

receiving a response image fed back by the equipment to be tested, wherein the response image is an image generated by the equipment to be tested after receiving the audio file;

and when the playing number or duration of the audio files reaches a detection condition, determining the accuracy of the voice recognition of the equipment to be tested according to the received response image and the preset response image of which the network state meets the preset condition.

With regard to the apparatus in the above-described embodiments, the specific manner in which each device performs the operations has been described in detail in the embodiments related to the method, and will not be described in detail here.

In one possible implementation, there is provided a system for detecting speech recognition accuracy, comprising:

the equipment to be tested has a voice interaction function;

the device for detecting the accuracy rate of voice recognition according to any embodiment of the disclosure.

In the embodiment of the present disclosure, the voice interaction function may include a function that the device under test may perform information transfer with a human through a natural language. The device to be tested can be applied to a home environment, including various household appliances, for example: televisions, stereos, lamps, air conditioners, refrigerators, electric rice cookers, soymilk makers, washing machines, and the like. The device under test may be applied to vehicle-mounted scenarios, including various vehicle-mounted devices, for example: navigation, air conditioner, air purifier, windshield wiper, intelligent driving, etc. The device under test may include electronic devices such as computers, tablets, mobile phones, and the like. The equipment to be tested can be applied to medical scenes, such as entry equipment, registration equipment, payment equipment and the like of electronic medical records. The device to be tested can be applied to enterprise scenes and can comprise office equipment such as intelligent customer service. The equipment to be tested can be applied to education and travel scenes, such as various teaching equipment, intelligent earphones and the like.

It should be noted that the kind of the device under test is not limited to the above examples, and other modifications are possible for those skilled in the art in light of the technical spirit of the present application, and all that can be achieved is included in the scope of the present application as long as the achieved functions and effects are the same as or similar to the present application.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

14页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:基于互补特征学习框架的语音情感识别方法及装置

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!