Corpus labeling system and electronic equipment

文档序号:1614258 发布日期:2020-01-10 浏览:23次 中文

阅读说明:本技术 语料标注系统及电子设备 (Corpus labeling system and electronic equipment ) 是由 于博文 郭慧 于 2019-09-23 设计创作,主要内容包括:本申请公开了语料标注系统及电子设备,涉及人工智能领域。具体包括:辅助标注组件,用于响应于对待标注语料数据进行标注的标注请求,展示辅助标注界面,辅助标注界面至少显示有提示信息,提示信息表征与待标注语料数据相关联的语料数据的标注结果的相关信息;标注组件,用于响应于表征标注结果的输入操作,并在与待标注语料数据的对应位置上展示目标标注结果;质量检测组件,用于响应于对目标标注结果的保存操作,启动检测机制,并获取针对目标标注结果的反馈信息,基于反馈信息确定目标标注结果是否满足准确度要求,以从系统架构层面提供一种语料标注平台,旨在提升语料标注质量。(The application discloses corpus labeling system and electronic equipment, and relates to the field of artificial intelligence. The method specifically comprises the following steps: the auxiliary labeling component is used for responding to a labeling request for labeling the corpus data to be labeled and displaying an auxiliary labeling interface, wherein the auxiliary labeling interface at least displays prompt information, and the prompt information represents related information of a labeling result of the corpus data associated with the corpus data to be labeled; the marking component is used for responding to the input operation of the representation marking result and displaying the target marking result on the position corresponding to the corpus data to be marked; and the quality detection assembly is used for responding to the storage operation of the target labeling result, starting a detection mechanism, acquiring feedback information aiming at the target labeling result, and determining whether the target labeling result meets the accuracy requirement based on the feedback information so as to provide a corpus labeling platform from the system architecture level and aim to improve the corpus labeling quality.)

1. A corpus annotation system, comprising at least:

the auxiliary labeling component is used for responding to a labeling request for labeling the corpus data to be labeled and displaying an auxiliary labeling interface, wherein the auxiliary labeling interface at least displays prompt information, and the prompt information represents related information of a labeling result of the corpus data associated with the corpus data to be labeled;

the marking component is used for responding to the input operation of the representation marking result and displaying the target marking result on the position corresponding to the corpus data to be marked;

and the quality detection component is used for responding to the storage operation of the target labeling result, starting a detection mechanism, acquiring feedback information aiming at the target labeling result, and determining whether the target labeling result meets the accuracy requirement based on the feedback information.

2. The system according to claim 1, wherein the labeling component is further configured to display a corpus data list, and the corpus data list includes corpus data to be labeled whose corpus labeling result does not satisfy the accuracy requirement.

3. The system of claim 1, wherein the auxiliary tagging component is further configured to:

and acquiring a historical labeling result of the corpus data associated with the corpus data to be labeled, and taking the corpus data and the historical labeling result associated with the corpus data to be labeled as the prompt information.

4. The system of claim 1, wherein the auxiliary tagging component is further configured to:

acquiring a preset corpus data set matched with the characteristic information of the corpus data to be labeled, wherein the characteristic information of each preset corpus data in the preset corpus data set is matched, and the labeling results of each preset corpus data are the same;

and taking the preset corpus data set and the labeling result as the prompt information.

5. The system of claim 1, wherein the auxiliary tagging component is further configured to:

selecting reference corpus data associated with the corpus data to be labeled and a labeling result of the reference corpus data associated with the corpus data to be labeled from a labeling template, wherein the labeling template represents a corresponding relation between the reference corpus data and the labeling result;

and taking the reference corpus data and the labeling result selected from the labeling template as the prompt information.

6. The system of claim 1, wherein the auxiliary tagging component is further configured to:

acquiring a labeling result of the corpus data associated with the corpus data to be labeled;

determining semantic features of the corpus data to be labeled based on labeling results of the corpus data associated with the corpus data to be labeled;

and taking the semantic features of the corpus data to be labeled as prompt information.

7. The system of any one of claims 1 to 6, wherein the quality detection component is further configured to:

detecting whether a preset labeling result of the corpus data to be labeled exists or not;

after the storage is determined, matching the target labeling result with the preset labeling result;

and using the matching result as feedback information of the target labeling result to determine whether the target labeling result meets the accuracy requirement.

8. The system of any one of claims 1 to 6, wherein the quality detection component is further configured to:

obtaining satisfaction information of a target labeling result aiming at the corpus data to be labeled;

and taking the satisfaction degree information of the target labeling result aiming at the corpus data to be labeled as the feedback information to determine whether the target labeling result meets the accuracy requirement.

9. The system of any one of claims 1 to 6, wherein the quality detection component is further configured to:

determining a corpus set to which the corpus data to be labeled belongs based on the feature information of the corpus data to be labeled, wherein the feature information of the corpus data contained in the corpus set is matched with the feature information of the corpus data to be labeled;

obtaining satisfaction information of the corpus set to which the corpus data to be labeled belongs;

and taking the satisfaction information of the corpus set to which the corpus data to be labeled belongs as the feedback information.

10. The system of claim 1, wherein the alert component is configured to output an alert message upon determining that the target annotation result does not meet the accuracy requirement.

11. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to implement the functionality of the corpus annotation system according to any one of claims 1-10.

Technical Field

The application relates to the field of data processing, in particular to the field of artificial intelligence.

Background

In recent years, the technology in the field of artificial intelligence is rapidly developed and gradually enters the daily life of people. The basic requirement of artificial intelligence is that a machine can receive and process information like a human, and language is used as the most main carrier of information, so that the machine becomes the leading research direction in the field of artificial intelligence. In the process of language model training, a large amount of corpus labeling is needed to perfect the quality of the model. However, at present, the research on corpus tagging is based on a tagging method assisted by natural language understanding, and is not promoted to a system architecture level.

Disclosure of Invention

The embodiment of the application provides a corpus tagging system and electronic equipment, and provides a corpus tagging platform from a system architecture level, aiming at improving corpus tagging quality.

The embodiment of the present application provides a corpus tagging system, which at least includes:

the auxiliary labeling component is used for responding to a labeling request for labeling the corpus data to be labeled and displaying an auxiliary labeling interface, wherein the auxiliary labeling interface at least displays prompt information, and the prompt information represents related information of a labeling result of the corpus data associated with the corpus data to be labeled;

the marking component is used for responding to the input operation of the representation marking result and displaying the target marking result on the position corresponding to the corpus data to be marked;

and the quality detection component is used for responding to the storage operation of the target labeling result, starting a detection mechanism, acquiring feedback information aiming at the target labeling result, and determining whether the target labeling result meets the accuracy requirement based on the feedback information.

The embodiment of the application provides a corpus labeling platform from the system architecture level, namely a corpus labeling system, and prompts before labeling by using the prompt information of an auxiliary labeling component in the corpus labeling system, wherein the prompt information shows corpus data associated with the corpus data to be labeled and the labeling result of the corpus data associated with the corpus data to be labeled, so that the labeling personnel can refer to the related content of the prompt information conveniently, and a foundation is laid for improving the labeling quality. Moreover, the corpus labeling system is also provided with a quality detection component, and after a target labeling result aiming at the corpus data to be labeled is stored, a detection mechanism is started, feedback information aiming at the target labeling result is obtained, and then the quality detection component can be utilized to detect whether the target labeling result meets the requirement, so that the foundation is further laid for improving the labeling quality.

In an embodiment, the tagging component is further configured to display a corpus data list, where the corpus data list includes corpus data to be tagged, where the corpus tagging result does not meet the accuracy requirement.

Here, in order to avoid unnecessary labeling work and waste human labeling resources, only the corpus data which is labeled by the automatic labeling method and has a labeling result not meeting the accuracy requirement is labeled, that is, the corpus data to be labeled in the embodiment is labeled by the automatic labeling method and has a labeling result not meeting the accuracy requirement, so that the waste of human labeling resources can be avoided, and meanwhile, the labeling quality can be improved by the human labeling method, the engineering requirement is met, and a foundation is laid for the subsequent on-line engineering service.

In one embodiment, the auxiliary labeling component is further configured to:

and acquiring a historical labeling result of the corpus data associated with the corpus data to be labeled, and taking the corpus data associated with the corpus data to be labeled and the historical labeling result as prompt information.

Here, in the embodiment, the corpus data associated with the corpus data to be labeled and the historical labeling result of the corpus data associated with the corpus data to be labeled are displayed as the prompt information, so that a labeling person refers to the historical labeling result of the associated corpus data to assist in completing the labeling operation, and a foundation is laid for improving the labeling quality.

In one embodiment, the auxiliary labeling component is further configured to:

acquiring a preset corpus data set matched with the characteristic information of the corpus data to be labeled, wherein the characteristic information of each preset corpus data in the preset corpus data set is matched, and the labeling results of each preset corpus data are the same;

and taking the preset corpus data set and the labeling result as prompt information.

In this embodiment, the preset corpus data set with the matched feature information and the labeling result of the preset corpus data set are displayed as the prompt information, so that a labeling person refers to the preset corpus data set with the matched feature information and refers to the labeling result of the preset corpus data set, and thus, the labeling operation is completed in an auxiliary manner, and a foundation is laid for improving the labeling quality. In addition, in the implementation mode, the prompt contents of the prompt information are added from different dimensions, and multi-dimensional reference opinions are given, so that a foundation is further laid for improving the labeling quality.

In one embodiment, the auxiliary labeling component is further configured to:

selecting reference corpus data associated with the corpus data to be labeled and a labeling result of the reference corpus data associated with the corpus data to be labeled from a labeling template, wherein the corresponding relation between the reference corpus data and the labeling result is represented in the labeling template;

and selecting the reference corpus data and the labeling result from the labeling template as prompt information.

Here, the labeling template is set in this embodiment, and the reference corpus data in the labeling template and the labeling result of the reference corpus data are used for prompting, that is, the reference corpus data associated with the corpus data to be labeled and the labeling result thereof are selected from the labeling template, so that the reference corpus data associated with the corpus data to be labeled and the labeling result of the reference corpus data associated with the corpus data to be labeled are displayed in the prompting information, and thus, the labeling operation of the labeling personnel is assisted to complete this time, and a foundation is laid for improving the labeling quality. In addition, in the implementation mode, the prompt contents of the prompt information are added from different dimensions, and multi-dimensional reference opinions are given, so that a foundation is further laid for improving the labeling quality.

In one embodiment, the auxiliary labeling component is further configured to:

acquiring a labeling result of the corpus data associated with the corpus data to be labeled;

determining semantic features of the corpus data to be labeled based on labeling results of the corpus data associated with the corpus data to be labeled;

and taking the semantic features of the corpus data to be labeled as prompt information.

In the embodiment, the content of the prompt information is increased from the dimension of the semantic features, that is, the labeling result of the corpus data associated with the corpus data to be labeled is first obtained, and then the obtained labeling result is utilized to analyze and obtain the semantic features of the corpus data to be labeled, that is, the true intention of the labeled corpus data is obtained, and then the semantic features of the corpus data to be labeled are used as the prompt information to assist the labeling personnel in completing the labeling process, thereby laying a foundation for improving the labeling quality. In addition, in the implementation mode, the prompt contents of the prompt information are added from different dimensions, and multi-dimensional reference opinions are given, so that a foundation is further laid for improving the labeling quality.

In one embodiment, the quality detection assembly is further configured to:

detecting whether a preset labeling result of the corpus data to be labeled exists or not;

after the target marking result is determined to exist, matching the target marking result with a preset marking result;

and the matching result is used as feedback information of the target labeling result to determine whether the target labeling result meets the accuracy requirement.

In the embodiment, the quality detection assembly matches the target labeling result with the preset labeling result, so as to determine whether the target labeling result meets the requirement, realize the purpose of monitoring the labeling result of the labeling personnel, and lay a foundation for improving the labeling quality. And the method is simple and feasible, and lays a foundation for subsequent on-line engineering services. Specifically, in order to achieve the purpose of feeding back the labeling result in real time in the labeling process of the labeling personnel, the labeling data with the preset labeling result can be mixed in the corpus data to be labeled, and the preset labeling result meets the requirement of accuracy, so that after the storage operation of the target labeling result is detected, the quality detection component firstly determines whether the preset labeling result corresponding to the corpus data to be labeled exists or not, if so, the target labeling result is matched with the preset labeling result, and then whether the target labeling result meets the requirement or not is detected according to the matching result, a foundation is laid for the purpose of monitoring the labeling personnel in real time in the labeling process, and a foundation is further laid for improving the labeling quality from the monitoring angle.

In one embodiment, the quality detection assembly is further configured to:

obtaining satisfaction information of a target labeling result aiming at the corpus data to be labeled;

and taking the satisfaction degree information of the target labeling result aiming at the corpus data to be labeled as feedback information to determine whether the target labeling result meets the accuracy requirement.

In the embodiment, the satisfaction information on the line, namely the satisfaction information aiming at the target labeling result, is used for detecting whether the target labeling result meets the requirement or not, and further, a foundation is laid for improving the labeling quality from the monitoring perspective.

In one embodiment, the quality detection assembly is further configured to:

determining a corpus set to which the corpus data to be labeled belongs based on the feature information of the corpus data to be labeled, wherein the feature information of the corpus data contained in the corpus set is matched with the feature information of the corpus data to be labeled, and the labeling result of the corpus data contained in the corpus set is matched with the target labeling result of the corpus data to be labeled;

obtaining satisfaction information of a corpus set to which corpus data to be annotated belongs;

and taking the satisfaction information of the corpus set to which the corpus data to be labeled belongs as feedback information.

In this embodiment, the on-line satisfaction information, that is, the satisfaction information for the corpus collection is used to detect whether the target labeling result meets the requirement, and a foundation is further laid for improving the labeling quality from the monitoring perspective. Because the characteristic information of the corpus data to be labeled is matched with the characteristic information of the corpus data in the corpus set, in the labeling process, the target labeling result of the corpus data to be labeled is also matched with the labeling result of the corpus data in the corpus set, and therefore, the purpose of monitoring can be achieved by using the satisfaction information aiming at the corpus set as the feedback information aiming at the target labeling result of the corpus data to be labeled.

In one embodiment, the warning component is configured to output the warning information after determining that the target annotation result does not meet the accuracy requirement.

In the embodiment, the corpus labeling system is provided with the alarm component, so that the early warning information is output after the target labeling result is detected to not meet the accuracy requirement, so that the early warning information is utilized to warn the labeling personnel, and a foundation is further laid for improving the labeling quality.

In a second aspect, an embodiment of the present application provides an electronic device, including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to annotate the functionality of the system with the corpus.

One embodiment in the above application has the following advantages or benefits:

the embodiment of the application provides a corpus labeling platform from the system architecture level, namely a corpus labeling system, and prompts before labeling by using the prompt information of an auxiliary labeling component in the corpus labeling system, wherein the prompt information shows corpus data associated with the corpus data to be labeled and the labeling result of the corpus data associated with the corpus data to be labeled, so that the labeling personnel can refer to the related content of the prompt information conveniently, and a foundation is laid for improving the labeling quality. Moreover, the corpus labeling system is also provided with a quality detection component, and after a target labeling result aiming at the corpus data to be labeled is stored, a detection mechanism is started, feedback information aiming at the target labeling result is obtained, and then the quality detection component can be utilized to detect whether the target labeling result meets the requirement, so that the foundation is further laid for improving the labeling quality.

Other effects of the above-described alternative will be described below with reference to specific embodiments.

Drawings

The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

FIG. 1 is a schematic diagram of a corpus annotation system according to a first embodiment of the present application;

FIG. 2 is a diagram of a corpus annotation system according to a second embodiment of the present application;

FIG. 3 is a diagram illustrating a scenario of a specific application according to an embodiment of the present application;

FIG. 4 is a block diagram of an electronic device for implementing the corpus tagging system according to an embodiment of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

As the existing intelligent (with or without a screen) sound box product, the language appeal of tens of millions or even billions of users can be received every day, the language requirement of the users is understood and met, and the correct label of the language expression requested by the users on the line is established, so that dictionary restoration and language model restoration are performed, the user satisfaction is improved, and the product experience is improved. Obviously, the corpus annotation is the most basic in the above process, and the corpus annotation quality plays a crucial role in subsequent dictionary repair and/or language model processing. Based on this, the embodiment of the application provides a corpus tagging system, which aims to monitor and manage tagging quality of tagging personnel, improve accuracy of corpus tagging and further lay a foundation for outputting reliable corpus tagging results to a model/dictionary.

Specifically, as shown in fig. 1, the system includes at least:

the auxiliary labeling component 101 is configured to display an auxiliary labeling interface in response to a labeling request for labeling the corpus data to be labeled, where the auxiliary labeling interface at least displays prompt information, and the prompt information represents related information of a labeling result of corpus data associated with the corpus data to be labeled;

the labeling component 102 is used for responding to input operation of representing a labeling result and displaying a target labeling result on a position corresponding to the corpus data to be labeled;

and the quality detection component 103 is configured to start a detection mechanism in response to the operation of saving the target labeling result, acquire feedback information for the target labeling result, and determine whether the target labeling result meets the accuracy requirement based on the feedback information.

It should be noted that, the components of the embodiments of the present application may be specifically integrated into one device, or may be respectively integrated into different devices, and the embodiments of the present application are not limited thereto as long as the technical solutions of the embodiments of the present application can be implemented.

In a specific example, as shown in fig. 2, the alarm component 104 is configured to output the warning information after determining that the target annotation result does not meet the accuracy requirement. That is to say, the corpus labeling system is provided with the alarm component in the example, so that the early warning information is output after the target labeling result is detected to not meet the accuracy requirement, so that the early warning information is utilized to warn the labeling personnel, and a foundation is further laid for improving the labeling quality.

In a specific example, in order to avoid unnecessary labeling work and waste human labeling resources, only the corpus data which is labeled by an automatic labeling method and has a labeling result not meeting the requirement of accuracy is labeled, namely the corpus data to be labeled in the example is the corpus data which is labeled by the automatic labeling method and has a labeling result not meeting the requirement of accuracy, so that the waste of human labeling resources can be avoided, meanwhile, the labeling quality can be improved by a human labeling method, the engineering requirement is met, and a foundation is laid for the subsequent on-line engineering service. It should be noted that the automatic labeling method may be any one of the existing corpus labeling methods that can be implemented, and the embodiment of the present application does not limit this. Specifically, the labeling component is further configured to display a corpus data list, where the corpus data list includes corpus data to be labeled whose corpus labeling result does not meet the accuracy requirement.

In the embodiment of the present application, the following manner may be adopted to enrich the prompting content of the prompting message, and specifically,

the first method is as follows: the auxiliary labeling component is further used for: and acquiring a historical labeling result of the corpus data associated with the corpus data to be labeled, and taking the corpus data associated with the corpus data to be labeled and the historical labeling result as prompt information.

That is to say, in the method, the corpus data associated with the corpus data to be labeled and the historical labeling result of the corpus data associated with the corpus data to be labeled are displayed as prompt information, so that a labeling person can refer to the historical labeling result of the associated corpus data to assist in completing the labeling operation, and a foundation is laid for improving the labeling quality.

In practical application, the historical marking result is a result subjected to satisfaction degree inspection, namely the historical marking result is determined to meet the accuracy requirement after satisfaction degree information detection, so that the information provided for marking personnel is accurate enough, and a foundation is laid for improving the marking quality.

The second method comprises the following steps: the auxiliary labeling component is further used for: acquiring a preset corpus data set matched with the characteristic information of the corpus data to be labeled, wherein the characteristic information of each preset corpus data in the preset corpus data set is matched, and the labeling results of each preset corpus data are the same or related; and taking the preset corpus data set and the labeling result as prompt information. The feature information may specifically be semantic features and/or text features corresponding to the corpus data to be labeled.

That is to say, in this way, the preset corpus data set with matched feature information and the labeling result of the preset corpus data set are displayed as prompt information, so that a labeling person refers to the preset corpus data set with matched feature information and refers to the labeling result of the preset corpus data set, and thus, the labeling operation of this time is completed in an auxiliary manner, and a foundation is laid for improving the labeling quality. In addition, in the implementation mode, the prompt contents of the prompt information are added from different dimensions, and multi-dimensional reference opinions are given, so that a foundation is further laid for improving the labeling quality.

In practical application, the labeling result corresponding to the preset corpus data set is a result subjected to satisfaction degree inspection, namely the labeling result corresponding to the preset corpus data set is determined to meet the requirement of accuracy after satisfaction degree information detection, so that the information provided for labeling personnel is accurate enough, and a foundation is laid for improving the labeling quality.

In practical application, a similarity calculation method may be adopted to determine the preset corpus data set matched with the feature information of the corpus data to be annotated, and the embodiment of the present application does not limit the specific calculation manner.

The third method comprises the following steps: the auxiliary labeling component is further used for: selecting reference corpus data associated with the corpus data to be labeled and a labeling result of the reference corpus data associated with the corpus data to be labeled from a labeling template, wherein the corresponding relation between the reference corpus data and the labeling result is represented in the labeling template; and selecting the reference corpus data and the labeling result from the labeling template as prompt information.

That is to say, the method sets a labeling template, and prompts by using the reference corpus data in the labeling template and the labeling result of the reference corpus data, that is, the reference corpus data associated with the corpus data to be labeled and the labeling result thereof are selected from the labeling template, and then the reference corpus data associated with the corpus data to be labeled and the labeling result of the reference corpus data associated with the corpus data to be labeled are displayed in the prompt message, so as to assist the labeling personnel in completing the labeling operation, and lay the foundation for improving the labeling quality. In addition, in the implementation mode, the prompt contents of the prompt information are added from different dimensions, and multi-dimensional reference opinions are given, so that a foundation is further laid for improving the labeling quality.

In a specific example, the labeling result of the reference corpus data in the labeling template is a result subjected to satisfaction degree test, that is, the labeling result of the reference corpus data in the labeling template is determined to meet the requirement of accuracy after satisfaction degree information detection, so that the information provided for the labeling personnel is accurate enough, and a foundation is laid for improving the labeling quality.

In practical application, the labeling template may further include a word segmentation model for a fixed grammar and a word segmentation mode for the word segmentation model, and further include labeling results of each corpus data obtained by performing word segmentation processing on the reference corpus data by using the word segmentation model.

The method is as follows: the auxiliary labeling component is further used for: acquiring a labeling result of the corpus data associated with the corpus data to be labeled; determining semantic features of the corpus data to be labeled based on labeling results of the corpus data associated with the corpus data to be labeled; and taking the semantic features of the corpus data to be labeled as prompt information.

That is to say, in this way, the content of the prompt information is increased from the dimension of the semantic features, that is, the tagging result of the corpus data associated with the corpus data to be tagged is first obtained, and then the obtained tagging result is utilized to analyze and obtain the semantic features of the corpus data to be tagged, that is, the true intention of the tagged corpus data is obtained, and then the semantic features of the corpus data to be tagged are used as the prompt information to assist the tagging personnel in completing the tagging process, thereby laying the foundation for improving the tagging quality. In addition, in the implementation mode, the prompt contents of the prompt information are added from different dimensions, and multi-dimensional reference opinions are given, so that a foundation is further laid for improving the labeling quality.

In practical application, the labeling result of the corpus data associated with the corpus data to be labeled is a result subjected to satisfaction degree inspection, namely the labeling result of the corpus data associated with the corpus data to be labeled is determined to meet the requirement of accuracy after satisfaction degree information detection, so that the information provided for labeling personnel is accurate enough, and a foundation is laid for improving the labeling quality.

Here, it should be noted that, in practical applications, the four manners may be executed alternatively, or any two or more of the four manners may be selected, and the embodiment of the present application is not limited thereto.

In the embodiment of the present application, the quality detection component can implement the quality detection process in the following manner, specifically,

the first method is as follows: a quality detection assembly further for: detecting whether a preset labeling result of the corpus data to be labeled exists or not; after the target marking result is determined to exist, matching the target marking result with a preset marking result; and the matching result is used as feedback information of the target labeling result to determine whether the target labeling result meets the accuracy requirement.

That is to say, in the method, the quality detection assembly matches the target labeling result with the preset labeling result, so as to determine whether the target labeling result meets the requirement, realize the purpose of monitoring the labeling result of the labeling personnel, and lay the foundation for improving the labeling quality. And the method is simple and feasible, and lays a foundation for subsequent on-line engineering services. Specifically, in order to achieve the purpose of feeding back the labeling result in real time in the labeling process of the labeling personnel, the labeling data with the preset labeling result can be mixed in the corpus data to be labeled, and the preset labeling result meets the requirement of accuracy, so that after the storage operation of the target labeling result is detected, the quality detection component firstly determines whether the preset labeling result corresponding to the corpus data to be labeled exists or not, if so, the target labeling result is matched with the preset labeling result, and then whether the target labeling result meets the requirement or not is detected according to the matching result, a foundation is laid for the purpose of monitoring the labeling personnel in real time in the labeling process, and a foundation is further laid for improving the labeling quality from the monitoring angle.

The second method comprises the following steps: a quality detection assembly further for: obtaining satisfaction information of a target labeling result aiming at the corpus data to be labeled; and taking the satisfaction degree information of the target labeling result aiming at the corpus data to be labeled as feedback information to determine whether the target labeling result meets the accuracy requirement. That is to say, the implementation method utilizes the on-line satisfaction information, that is, the satisfaction information fed back by the user and directed at the target labeling result to detect whether the target labeling result meets the requirement, thereby further laying a foundation for improving the labeling quality from the monitoring perspective.

The third method comprises the following steps: a quality detection assembly further for: determining a corpus set to which the corpus data to be labeled belongs based on the characteristic information of the corpus data to be labeled; obtaining satisfaction information of a corpus set to which corpus data to be annotated belongs; taking satisfaction information of the corpus set to which the corpus data to be labeled belongs as feedback information; the feature information of the corpus data contained in the corpus set is matched with the feature information of the corpus data to be labeled, and the labeling result of the corpus data contained in the corpus set is matched with the target labeling result of the corpus data to be labeled.

That is to say, the method utilizes the on-line satisfaction information, namely, the satisfaction information of the corpus collection to detect whether the target labeling result meets the requirement, and further lays a foundation for improving the labeling quality from the monitoring perspective. Because the characteristic information of the corpus data to be labeled is matched with the characteristic information of the corpus data in the corpus set, in the labeling process, the target labeling result of the corpus data to be labeled is also matched with the labeling result of the corpus data in the corpus set, and therefore, the purpose of monitoring can be achieved by using the satisfaction information aiming at the corpus set as the feedback information aiming at the target labeling result of the corpus data to be labeled.

Here, it should be noted that, in practical applications, the three manners may be performed alternatively, or any two or more of the four manners may be selected, and the embodiment of the present application is not limited thereto.

Therefore, the embodiment of the application provides a corpus tagging platform from the system architecture level, namely a corpus tagging system, and prompts the corpus tagging personnel before tagging by using the prompt information of the auxiliary tagging component in the corpus tagging system, wherein the corpus data associated with the corpus data to be tagged and the tagging result of the corpus data associated with the corpus data to be tagged are displayed in the prompt information, so that the tagging personnel can conveniently refer to the relevant content of the prompt information, and a foundation is laid for improving the tagging quality. Moreover, the corpus labeling system is also provided with a quality detection component, and after a target labeling result aiming at the corpus data to be labeled is stored, a detection mechanism is started, feedback information aiming at the target labeling result is obtained, and then the quality detection component can be utilized to detect whether the target labeling result meets the requirement, so that the foundation is further laid for improving the labeling quality.

The following describes the embodiments of the present application in further detail with reference to fig. 3: the corpus data to be labeled corresponding to the corpus labeling system can be online data of lines of intelligent sound products, namely, online user real voice interaction data is used for labeling personnel to label the corpus. For example, in practical application, the corpus data to be labeled corresponds to audio data of the intelligent sound, for example, audio data input to the intelligent sound by a user is acquired, the audio data is analyzed to obtain text data, and the text data is used as corpus data to be labeled; or, in another example, the text data is input into an automatic labeling system, the text data of which the labeling result output by the automatic labeling system does not meet the accuracy requirement is obtained based on the detection result, the text data of which the labeling result output by the automatic labeling system does not meet the accuracy requirement is used as the corpus data to be labeled, and the labeled reliable corpus labeling result is output to the model or the online dictionary, so that the quality of the model or the dictionary is improved.

The corpus labeling system in the example comprises four components, namely an auxiliary labeling component, a quality detection component and an alarm component, and the functions of the components are explained in detail by combining the example as follows:

the marking device comprises a first auxiliary marking component and a second auxiliary marking component, wherein the main function of the auxiliary marking component is to provide diversified marking references for marking personnel, and further improve the working quality of the marking personnel subjectively. Specifically, as shown in fig. 3, query corpora to be labeled are obtained, and the query corpora to be labeled and the labeling reference information are displayed; further, the assembly provides four labeling references, respectively:

the label refers to one: based on the auxiliary labeling of the historical data, the query corpus (corresponding to the corpus data to be labeled) to be labeled at present may be correctly labeled by other people before, so that the correctly labeled historical labeling result can be used as a reference. Here, it should be noted that reference is given only, and it is not required that a person who performs annotation is entirely made by reference, and therefore, there are cases where different intentions may be expressed in different scenes and different environments in the same query.

The label refers to two: setting a template based on the auxiliary labeling of the template, wherein the template is provided with fixed grammar, and the query corpus to be labeled can be automatically segmented by utilizing the fixed grammar in the template; furthermore, the template corresponds to a model, the query corpus after template segmentation processing can be analyzed by using the model, so that the query corpus after segmentation is labeled, and a labeling result of the query corpus after segmentation is obtained. Therefore, the labeling mode of the template can also be used as auxiliary information for reference of a labeling person.

The notation refers to three: based on the auxiliary labeling of the online similar query corpora, for example, a similar query set of the query corpora to be labeled is obtained by utilizing query similarity calculation methods such as query rewriting and editing distance, and labeling personnel are assisted to finish labeling through the labeling result of the similar query set. Such as: the query corpus to be labeled is equal to that i want to listen to the blue and white porcelain, and then the similar query set is equal to { i want to listen to the blue and white porcelain, i want to listen to the light and white porcelain } and the like, and the similar queries serve as a group, and the labeling results are basically consistent.

The label refers to four: in the auxiliary labeling based on the successive co-occurrence information, generally, the results of successive labeling of the same query by different users may not be the same, so that the main intention of the same query can be determined by using the statistics of the successive query and/or the preambles of different users of the same query, and thus, the labeling personnel are assisted to complete the labeling. Such as: and (4) playing the query corpus to be labeled as a little bit, and obtaining the real intention of the query corpus to be labeled, namely the sound is a little bit, by utilizing the previous and subsequent co-occurrence information.

And the labeling component is used for responding to the input operation of a labeling person on a labeling result and displaying the target labeling result at a position corresponding to the query corpus to be labeled.

The quality detection assembly is used for starting a detection mechanism after the storage operation aiming at the target labeling result is detected, and the assembly can realize two detection mechanisms, wherein one detection mechanism is used for real-time detection and real-time feedback; the other is used for delaying feedback; and summarizing the detection results of the two detection mechanisms. In particular, the amount of the solvent to be used,

1. the real-time quality detection function is used for monitoring the real-time quality of the marking data of the marking personnel, and comprises two modes, namely:

the first method is as follows: the correct linguistic data which are labeled manually or correct linguistic data which are classified by the labeling model and have higher confidence coefficient are confused into the linguistic data to be labeled which are labeled routinely and manually, and then after labeling personnel finish labeling the correct linguistic data, the accuracy of a labeling result can be immediately determined, so that the labeling result can be fed back immediately when the labeling result is inaccurate, and the purpose of monitoring the quality immediately is realized.

The second method comprises the following steps: and setting a manual auditor, and performing partial auditing operations such as semantic analysis, query rewriting and the like on the labeling personnel by using the manual auditor so as to feed back the accuracy of the labeling result in real time and realize the purpose of monitoring the quality in real time.

2. A delay quality detection function for monitoring the delay quality of the marking data of the marking personnel

In the first mode, the on-line effect satisfaction is tracked, and the process of detecting the labeled data of the query corpus is completed by using the entity cluster, for example, the high-heat entity cluster, song ═ blue and white porcelain, so that the expressions of all the blue and white porcelain transformations are the entity cluster of song ═ blue and white porcelain, such as listening to the blue and white porcelain, watching the blue and white porcelain, coming one blue and white porcelain, singing the blue and white porcelain, and the like, based on the above, when the query corpus to be labeled has the blue and white porcelain, the labeling personnel can label the blue and white porcelain existing in the query corpus to be labeled based on the entity cluster of the "song ═ blue and white porcelain", at this time, the satisfaction degree classification model on the line is utilized to perform regression on the satisfaction degree data of the entity cluster of the 'song ═ blue and white porcelain', so that the satisfaction degree information of the entity cluster of the 'song ═ blue and white porcelain' is obtained, and the satisfaction degree information is fed back to the labeling personnel in a delayed mode, so that the purpose of detecting the labeling quality is achieved.

And secondly, on-line effect satisfaction degree tracing, wherein a generalized query cluster is utilized to complete the detection process of the labeled data of the query corpus to be labeled, for example, when the voice reaches ten, the voice reaches twenty, and the voice reaches thirty, the query corpus belongs to one category of query, at this time, the generalized query cluster can be obtained based on the characteristics, if the query corpus to be labeled has the similar contents, labeling can be completed based on the generalized query cluster, at this time, the satisfaction degree information of the generalized query cluster can be fed back to a labeling person in a delayed manner, so that the purpose of detecting the labeling quality can be realized.

And fourthly, the alarm component is used for outputting early warning information when the fact that the labeling result does not meet the accuracy requirement is determined.

In practical application, the training is carried out on the annotating personnel which do not meet the requirement of the annotation accuracy for many times, and the training documents can be obtained based on the detection results of the quality detection component.

Therefore, the corpus labeling system can improve the labeling quality of corpus labeling personnel under the condition of limited manpower, and then the labeled reliable corpus labeling result is output to the model or the online dictionary so as to improve the quality of the model or the dictionary.

According to an embodiment of the present application, an electronic device and a readable storage medium are also provided.

Fig. 4 is a block diagram of an electronic device of a corpus tagging system according to an embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 4, the electronic apparatus includes: one or more processors 401, memory 402, and interfaces for connecting the various components, including high-speed interfaces and low-speed interfaces. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display Graphical information for a Graphical User Interface (GUI) on an external input/output device, such as a display device coupled to the Interface. In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). In fig. 4, one processor 401 is taken as an example.

Memory 402 is a non-transitory computer readable storage medium as provided herein. The memory stores instructions executable by the at least one processor to cause the at least one processor to perform the corpus tagging system functions provided herein. The non-transitory computer readable storage medium of the present application stores computer instructions for causing a computer to perform the corpus tagging system functions provided herein.

The memory 402, which is a non-transitory computer readable storage medium, can be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as the program instructions/modules corresponding to the corpus annotation system in the embodiments of the present application (e.g., the auxiliary annotation component 101, the annotation component 102, the quality detection component 103, and the alert component 104 shown in fig. 2). The processor 101 executes various functional applications of the server and data processing by running non-transitory software programs, instructions, and modules stored in the memory 102, that is, the functions of the corpus tagging system in the above-described embodiment are realized.

The memory 402 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the corpus tagging system, and the like. Further, the memory 402 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 402 may optionally include memory located remotely from the processor 401, and these remote memories may be connected to the electronic device corresponding to the corpus annotation system via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device of the corpus tagging system may further include: an input device 403 and an output device 404. The processor 401, the memory 402, the input device 403 and the output device 404 may be connected by a bus or other means, and fig. 4 illustrates an example of a connection by a bus.

The input device 403 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic device of the corpus tagging system, such as a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointer stick, one or more mouse buttons, a track ball, a joystick, etc. The output devices 404 may include a display device, auxiliary lighting devices (e.g., LEDs), and haptic feedback devices (e.g., vibrating motors), among others. The Display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) Display, and a plasma Display. In some implementations, the display device can be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, Integrated circuitry, Application Specific Integrated Circuits (ASICs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (Cathode Ray Tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

The embodiment of the application provides a corpus labeling platform from the system architecture level, namely a corpus labeling system, and prompts before labeling by using the prompt information of an auxiliary labeling component in the corpus labeling system, wherein the prompt information shows corpus data associated with the corpus data to be labeled and the labeling result of the corpus data associated with the corpus data to be labeled, so that the labeling personnel can refer to the related content of the prompt information conveniently, and a foundation is laid for improving the labeling quality. Moreover, the corpus labeling system is also provided with a quality detection component, and after a target labeling result aiming at the corpus data to be labeled is stored, a detection mechanism is started, feedback information aiming at the target labeling result is obtained, and then the quality detection component can be utilized to detect whether the target labeling result meets the requirement, so that the foundation is further laid for improving the labeling quality.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and the present invention is not limited thereto as long as the desired results of the technical solutions disclosed in the present application can be achieved.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

16页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种基于预训练模型的自然语言理解方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!