On-line artificial Chinese text marking system

文档序号:1556937 发布日期:2020-01-21 浏览:15次 中文

阅读说明:本技术 在线人工中文文本标注系统 (On-line artificial Chinese text marking system ) 是由 罗冠 吴超尘 胡卫明 于 2019-09-12 设计创作,主要内容包括:本发明属于文本标注领域,具体涉及一种在线人工中文文本标注系统,旨在解决现有人工中文文本标注系统无法实现多人协作标注的问题。本发明系统包括:服务器、第一、二客户端;第一、二客户端分别与服务器相连;服务器包括数据库;第一客户端包括管理模块;第二客户端包括标注、重分词、切换模块;管理模块配置为拆分标注文本,并上传数据库;数据库配置为根据分配指令将标注项目与标注用户关联;标注模块配置为对标注项目中的语句进行标注;重分词模块配置为依据输入指令对标注项目的语句进行重新分词;切换模块配置为对标注、重分词模块工作状态的切换。本发明实现了多人协作标注,并提高了文本标注的准确率和效率。(The invention belongs to the field of text labeling, particularly relates to an online artificial Chinese text labeling system, and aims to solve the problem that the conventional artificial Chinese text labeling system cannot realize multi-person collaborative labeling. The system of the invention comprises: the system comprises a server, a first client and a second client; the first client and the second client are respectively connected with the server; the server comprises a database; the first client comprises a management module; the second client comprises a labeling module, a repeated word segmentation module and a switching module; the management module is configured to split the marked text and upload the marked text to the database; the database is configured to associate the annotation item with the annotation user according to the allocation instruction; the marking module is configured to mark the statement in the marked item; the repeated word segmentation module is configured to perform repeated word segmentation on the sentence of the marked item according to the input instruction; the switching module is configured to switch the working states of the labeling and re-word-dividing module. The invention realizes multi-person cooperation labeling and improves the accuracy and efficiency of text labeling.)

1. An on-line artificial Chinese text annotation system is characterized in that the annotation system comprises a server, one or more first clients and one or more second clients; the first client and the second client are respectively connected with the server, and the server comprises a database; the first client comprises a management module; the second client comprises a labeling module, a word re-dividing module and a switching module;

the management module is configured to acquire a text to be labeled and split the text to be labeled into a plurality of items to be labeled according to an input splitting instruction; uploading a text to be labeled consisting of a plurality of labeled items to the database; the item to be marked comprises one or more segmented sentences;

the database is configured to store a text to be labeled, which is composed of a plurality of labeled items; associating the annotation item with an annotation user according to a task allocation instruction input by the first client and/or the second client;

the marking module is configured to acquire a marking item corresponding to a corresponding marking user from the database, and mark a statement in the marking item according to an input marking instruction; sending the marked items to the database;

the word re-segmentation module is configured to re-segment the sentences marked with the items according to the input instruction to obtain the sentences of the new combined word sequence;

and the switching module is configured to switch the working state of the labeling module and the working state of the word re-dividing module for the acquired items to be labeled.

2. The on-line artificial Chinese text annotation system of claim 1, wherein the annotation module "annotates the sentences in the annotation item according to the input annotation command" comprises: obtaining a labeling label corresponding to each word in the sentence according to an input labeling instruction; the sentence is a participled sentence which comprises one or more words.

3. The on-line artificial Chinese text annotation system of claim 2, wherein the entered annotation command is a keyboard entry command corresponding to an annotation tag set by a keyboard response function based on JavaScript.

4. The on-line artificial Chinese text annotation system of claim 1, wherein the annotation module is further configured to switch the sentence in the annotation item or the word in the sentence corresponding to each second client via a preset switching instruction.

5. The system of claim 2, wherein the tagging module is further configured to obtain and display all of the tag tags.

6. The on-line artificial Chinese text tagging system of claim 1 wherein the method of "re-segmenting the words of the sentence tagged with the item according to the input instruction to obtain the sentence with the new combined word sequence" in the re-segmentation module is: and re-selecting the check boxes between the two Chinese characters in the sentence labeled with the item through the input instruction, and re-combining the word sequence according to the selection result to obtain the sentence with the re-participled words.

7. The system of claim 1, wherein the database is a relational database that stores data comprising: user information, labeled statement information, user labeling information and project information.

8. The on-line artificial Chinese text annotation system of claim 7, wherein the user information comprises a user id, a user name, a password, a user type; the labeled sentence information comprises a sentence id, a sentence text, a word segmentation mark, an initial label and a project id; the user labeling information comprises a user id, a sentence id, a word segmentation identifier and a user label; the item information includes an item id, a user id having access rights to this item.

9. The system of claim 1, wherein the management module is further configured to generate a database statement according to the input command of the first client, and perform adding a tagging user, deleting the tagging user, querying a progress of a tagging item corresponding to the tagging user, deleting the tagging item, and modifying an access right of the tagging item.

10. The system for on-line artificial Chinese text annotation of claim 1, wherein the pages corresponding to each module of the first client and the second client are HTML pages, and if the HTML pages interact with the server, the server is accessed by Ajax request and a request processing result is returned; otherwise, processing dynamic interaction in the HTML webpage through the jQuery frame.

11. The system of claim 1, wherein in the annotation module, when the second client accesses its corresponding annotation item, the server caches all the statements in the annotation item in the database in the memory of the client.

12. The system of claim 1, wherein if there are multiple labeling results and/or segmentation results of the same sentence in the database of the server, the labeling results and/or segmentation results of the sentence are calibrated according to the priority of the labeling user.

Technical Field

The invention belongs to the field of text labeling, and particularly relates to an online artificial Chinese text labeling system.

Background

With the rapid development of artificial intelligence and natural language processing technology, people are increasingly applying the technology to the fields of education, medical treatment, scientific research, commerce and the like. The natural language processing technology based on machine learning generally needs high-quality artificially labeled data for model training, but the labeled Chinese corpus is very few at the present stage, and the data scale and quality of most corpora do not meet the requirement of high-quality business models at all. We often need to manually make a labeled chinese corpus.

The traditional tools for manually labeling sentences are often text editors, such as Notepad + +, Visual studio code, Notepad, etc., the original purpose of these editors is mainly to save and edit texts, and browse, add, modify, etc., and the labeling with these editors is often very time-consuming and error-prone. For example, in Chinese entity labeling, it is often necessary to locate a current sentence to be labeled and find a word to be labeled, and in this process, it is highly likely that a labeling person may miss a part of the word or sentence to be labeled. In addition, the annotating personnel often need to switch the content being annotated, generally by dragging a scroll bar of a text browser or opening a file, which is time-consuming and easily fatigues the annotating personnel.

Compared with a text editor, the existing single-edition labeling software improves the efficiency and the accuracy of labeling and can realize the functions of project management. But only for small teams: i.e., 1 to 2 annotators, are not adequate for team collaboration requiring dozens or even more annotators. For example, in the labeling process, a sentence needs to be labeled together, and since a single version of labeling program cannot be networked, the text labeled by two people needs to be copied to the same computer for comparison, or if a word segmentation error is found in the labeling process, the text cannot be directly modified. Therefore, the invention provides an online artificial Chinese text labeling system.

Disclosure of Invention

In order to solve the above problems in the prior art, that is, to solve the problem that the conventional artificial Chinese text labeling system cannot realize multi-user collaborative labeling, a first aspect of the present invention provides an online artificial Chinese text labeling system, which includes a server, one or more first clients and one or more second clients; the first client and the second client are respectively connected with the server, and the server comprises a database; the first client comprises a management module; the second client comprises a labeling module, a word re-dividing module and a switching module;

the management module is configured to acquire a text to be labeled and split the text to be labeled into a plurality of items to be labeled according to an input splitting instruction; uploading a text to be labeled consisting of a plurality of labeled items to the database; the item to be marked comprises one or more segmented sentences;

the database is configured to store a text to be labeled, which is composed of a plurality of labeled items; associating the annotation item with an annotation user according to a task allocation instruction input by the first client and/or the second client;

the marking module is configured to acquire a marking item corresponding to a corresponding marking user from the database, and mark a statement in the marking item according to an input marking instruction; sending the marked items to the database;

the word re-segmentation module is configured to re-segment the sentences marked with the items according to the input instruction to obtain the sentences of the new combined word sequence;

and the switching module is configured to switch the working state of the labeling module and the working state of the word re-dividing module for the acquired items to be labeled.

In some preferred embodiments, the method of "annotating a sentence in an annotation item according to an input annotation instruction" in the annotation module is as follows: obtaining a labeling label corresponding to each word in the sentence according to an input labeling instruction; the sentence is a participled sentence which comprises one or more words.

In some preferred embodiments, the input annotation instruction is a keyboard input instruction corresponding to an annotation tag set by a JavaScript-based keyboard response function.

In some preferred embodiments, the tagging module is further configured to switch the statement or the word in the statement in the tagging item corresponding to each second client by using a preset switching instruction.

In some preferred embodiments, the labeling module is further configured to obtain and display all the labeling labels.

In some preferred embodiments, the method of "re-segmenting the words of the sentence labeled with the item according to the input instruction and obtaining the sentence with the new combined word sequence" in the re-segmentation module includes: and re-selecting the check boxes between the two Chinese characters in the sentence labeled with the item through the input instruction, and re-combining the word sequence according to the selection result to obtain the sentence with the re-participled words.

In some preferred embodiments, the database is a relational database that stores data including: user information, labeled statement information, user labeling information and project information.

In some preferred embodiments, the user information includes a user id, a user name, a password, a user type; the labeled sentence information comprises a sentence id, a sentence text, a word segmentation mark, an initial label and a project id; the user labeling information comprises a user id, a sentence id, a word segmentation identifier and a user label; the item information includes an item id, a user id having access rights to this item.

In some preferred embodiments, the method is further configured to generate a database statement according to the input instruction of the first client, and perform adding, deleting, querying, deleting and modifying access rights of the annotation item, on the progress of the annotation item corresponding to the annotation user.

In some preferred embodiments, the pages corresponding to the modules of the first client and the second client are HTML web pages, and if the HTML web pages interact with the server, the server is accessed through Ajax request and a request processing result is returned; otherwise, processing dynamic interaction in the HTML webpage through the jQuery frame.

In some preferred embodiments, in the annotation module, when the second client accesses the corresponding annotation item, the server caches all statements in the annotation item in the database to a memory of the client.

In some preferred embodiments, if there are a plurality of labeling results and/or word segmentation results of the same sentence in the database of the server, the labeling results and/or word segmentation results of the sentence are calibrated according to the priority of the labeling user.

The invention has the beneficial effects that:

the invention realizes multi-person cooperation labeling and improves the accuracy and efficiency of text labeling. The invention realizes multi-user cooperative labeling by adopting an online artificial Chinese text labeling system, independently labeling and word segmentation by adopting multiple users for the same sentence, and calibrating the difference online. The invention provides the functions of rapidly acquiring the label of each word in the sentence of the label item through the preset label instruction, rapidly switching the sentence in the label item or the word in the sentence according to the switching instruction, acquiring all label labels and the corresponding set label instructions thereof to display on the current label page, and the like in the label module, thereby saving the label time and the energy of label personnel and improving the label efficiency. In the word segmentation module, word sequences are recombined in a check box selecting mode, word segmentation modification can be realized, and labeling is carried out again after word segmentation is modified, so that the accuracy of text labeling is improved. Meanwhile, the management module is added, so that unified management of the marked items and the marked users is facilitated on line and in real time.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings.

FIG. 1 is an exemplary diagram of a functional architecture of a client of an online artificial Chinese text annotation system in accordance with one embodiment of the invention;

FIG. 2 is an exemplary diagram of a system architecture for an online artificial Chinese text annotation system in accordance with one embodiment of the invention;

FIG. 3 is an exemplary diagram of an annotation user interface in accordance with one embodiment of the invention;

FIG. 4 is an exemplary diagram of an administrator user interface of one embodiment of the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

The invention discloses an online artificial Chinese text annotation system, which comprises a server, one or more first clients and one or more second clients, as shown in fig. 1 and 2; the first client and the second client are respectively connected with the server, and the server comprises a database; the first client comprises a management module; the second client comprises a labeling module, a word re-dividing module and a switching module;

the management module is configured to acquire a text to be labeled and split the text to be labeled into a plurality of items to be labeled according to an input splitting instruction; uploading a text to be labeled consisting of a plurality of labeled items to the database; the item to be marked comprises one or more segmented sentences;

the database is configured to store a text to be labeled, which is composed of a plurality of labeled items; associating the annotation item with an annotation user according to a task allocation instruction input by the first client and/or the second client;

the marking module is configured to acquire a marking item corresponding to a corresponding marking user from the database, and mark a statement in the marking item according to an input marking instruction; sending the marked items to the database;

the word re-segmentation module is configured to re-segment the sentences marked with the items according to the input instruction to obtain the sentences of the new combined word sequence;

and the switching module is configured to switch the working state of the labeling module and the working state of the word re-dividing module for the acquired items to be labeled.

In order to more clearly explain the on-line artificial Chinese text labeling system of the present invention, the following is a detailed description of each module in an embodiment of the method of the present invention with reference to the drawings.

The invention discloses an on-line artificial Chinese text labeling system, which comprises a server, one or more first clients and one or more second clients; the first client and the second client are respectively connected with the server, and the server comprises a database; the first client comprises a management module; the second client comprises a labeling module, a word re-dividing module and a switching module. The modules are shown in fig. 1 and described in detail as follows:

1. labeling module

In the entity labeling, the label is preferably composed of one or more English letters or Chinese characters, and other symbols can be selected for representation. The existing labeling software needs the labeling user to input completely when acquiring the labeling label. In this embodiment, a keyboard input instruction corresponding to the annotation tag, that is, the annotation instruction, is set in advance based on the JavaScript keyboard response function, and the annotation tag of the word in the statement of the annotation item is obtained by calling one key of the annotation instruction in the annotation process. In the present embodiment, the keyboard shortcut key is preferably a numeric key, and other keys on the keyboard may also be selected as the shortcut key. And marking the user as a second client code or a user identity code for user login through the second client. I.e. the person who marks the marked item. In the system, personnel are divided into a labeling user and an administrator user, a first client is a client for encoding by the administrator user, and a second client is a client for encoding by the labeling user.

In addition, in order to improve the labeling efficiency of the labeling user and achieve fast switching of the sentences or words in the sentences in the labeling items, in this embodiment, the labeling module is further configured with a switching instruction, and preferably, the automatic skipping of the sentences or words in the sentences in the labeling items can be achieved through the switching instruction, but the words in the sentences can also be switched through left and right keys of a keyboard. The sentence or word to be processed does not need to be switched by clicking a command through a mouse, so that the marking time is saved.

When a word is jumped to for labeling, the labeling efficiency of the traditional labeling method is based on the familiarity degree of the labeling user to the labeling label. However, in the entity labeling, the labeling labels corresponding to the labeling words are often dozens or even dozens, and repeated searching is needed under the condition that the labeling labels are unfamiliar, so that the efficiency of the labeling user is greatly reduced. In order to solve the problem, a label display is added to a label module, namely, all preset label and corresponding label instructions are obtained and displayed on a current labeled client page, so that a user can conveniently view the content, the brief introduction and the corresponding label instructions of the label at any time.

Meanwhile, the invention provides the user client with the interactive page which can be accessed through the browser, so that the user can access the content on the remote server through the page and can modify the layout of the page elements according to the size of the screen of the user, and thus, the users with different browser sizes can completely view all the elements of the page. The user often encounters the situation of slow network speed in the process of labeling, and particularly when the user is in cooperation across provinces and even across countries, the experience is greatly reduced due to frequent interaction between the page and the server, the waiting time is prolonged, and pressure is applied to the server. The invention ensures the interaction times with the server through the front-end and back-end separation architecture and the preloading technology, thereby not only improving the user experience, but also reducing the burden of the server. The specific treatment steps are as follows:

firstly, pages of a first client and a second client are written into a static network surface of pure HTML, and dynamic functions irrelevant to a server in an HTML webpage are mainly realized by jQuery. When the function of interacting with the server is involved, the HTML webpage requests data from the server through the Ajax technology on the premise of not needing page refreshing and processes a return result.

Secondly, the annotation text is divided into a plurality of annotation items, each item comprises one or more participled sentences, and when a user accesses a certain annotation item, all the sentences in the item are loaded into a user cache, so that the user can browse or switch the sentences in one item without interacting with a server. Meanwhile, the system also supports the user to check the annotation progress in each project.

2. Word segmentation module

In this embodiment, the word segmentation module is configured to reselect a check box between two chinese characters in the sentence labeled with the item by inputting the instruction, and recombine the word sequence according to the selection result to obtain the sentence subjected to word segmentation again.

The word segmentation module is mainly used for carrying out word segmentation modification. The existing word segmentation means can not divide the Chinese sentence into words one by one, especially some texts with strong specialties, such as: professional books of medicine, law, engineering, etc. Therefore, the invention adds a word segmentation modification function on the basis of labeling, when a labeling user finds that the word segmentation of the sentence in a labeling item is wrong in the labeling process, the labeling module can be switched to the word segmentation module through the switching module, the modification mode is entered, check boxes between two Chinese characters are selected through a click command of a mouse, the word sequence is recombined based on the selection result, namely segmentation is added or deleted between the two Chinese characters, the selected check boxes are divided words, the labeling user stores the divided words, the system can be updated, and the new word segmentation sentence can cover the old sentence.

If the first client adds the access authority of the statement in a certain labeling item to the labeling user A and the labeling user B, the labeling users A and B respectively access the statement to perform labeling work, and default Chinese word segmentation in the labeling text is found to be inaccurate in the labeling processes A and B, so that the labeling users perform word segmentation on the statement again in the system and finish labeling. The marked content is saved back to the server by the A and the B. And the first client calibrates the labeling result and/or the word segmentation result according to the preset priority of the labeling user so as to control the quality. Namely, the labeling result and/or word segmentation result of the labeling user with high priority modifies the labeling result and/or word segmentation result of the labeling user with low priority.

3. Switching module

And the switching module is configured to switch the working state of the labeling module and the working state of the word re-dividing module for the acquired items to be labeled. The method mainly comprises the steps that a labeling user finds that a sentence word segmentation result in a labeling project is wrong, and when the user wants to perform word segmentation again, the working state of a labeling module and the working state of a word segmentation module can be switched with each other through a switching module.

4. Management module

The management module is configured to acquire a text to be annotated and split the text to be annotated into a plurality of items to be annotated according to an input splitting instruction; uploading a text to be labeled consisting of a plurality of labeled items to the database; the item to be marked comprises one or more segmented sentences; the configuration is the adding of the annotation item. And the system is also configured to generate database statements according to the input instruction of the first client to add the annotation user, delete the annotation user, inquire the progress of the annotation item corresponding to the annotation user, delete the annotation item and modify the access right of the annotation item. The management module can also be divided into project management and user management.

The method comprises the following steps that project management comprises adding a marked project, deleting the marked project, modifying the corresponding relation between the marked project and a marked user, inquiring the progress of the marked project and modifying the access right of the marked project; the label item comprises a plurality of label texts;

the user management module comprises a progress of adding a marking user, deleting the marking user and inquiring a marking item corresponding to the marking user.

In the present embodiment, in order to more clearly understand the functions of the client, an example diagram of interfaces of the annotation user and the administrator user in the annotation system is given in the present embodiment.

Fig. 3 is an exemplary diagram of an interface of a labeling user, the left side is a management module, the user can modify user information and switch labeling items, the middle is a labeling and word segmentation module, the user completes labeling of sentences in the module, and the lower side is a control module, the user can rapidly switch sentences in the labeling items in the control module. Wherein, the annotator annotates the user. And the right side is provided with a labeling information display module which is used for acquiring and displaying all preset labeling labels and corresponding labeling instructions. Wherein the word segmentation mode is a switching module.

FIG. 4 is an exemplary diagram of an administrator user interface, on the left side is an item selection and user management module, where a user can select an item to view, and an administrator can add a new annotation user. The top and middle are project progress modules, and the administrator user can check the labeling users distributed by each project and the completion condition of each labeling user to the project in the modules and browse the texts successfully labeled by the users. The project control module is arranged below the system, and an administrator can download, add, delete and modify projects and can adjust the access authority of users to the projects.

In the invention, the server responds to the user request of the client and interacts with the database. The server is mainly divided into the following parts:

the data persistence layer encapsulates all tasks interacting with the database and comprises parameters connected with the database;

the service layer mainly comprises a service logic and an algorithm for processing data, further analyzes the request sent by the controller, and calls the data persistence layer to interact with the database;

and the controller responds to the user request, calls the service layer to process the request and returns the processed data to the user client.

In the invention, a database is configured to store a text to be labeled, which is composed of a plurality of labeled items; and associating the annotation item with the annotation user according to the task allocation instruction input by the first client and/or the second client. It is a relational database, and the data stored by it includes: user information, labeled statement information, user labeling information and project information.

The user information comprises a user id, a user name, a password and a user type; the labeled sentence information comprises a sentence id, a sentence text, a word segmentation mark, an initial label and a project id; the user labeling information comprises a user id, a sentence id, a word segmentation identifier and a user label; the item information includes an item id, a user id having access rights to this item.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process and related description of the system described above may refer to the corresponding process in the foregoing method embodiment, and details are not described herein again.

It should be noted that, the online artificial chinese text labeling system provided in the foregoing embodiment is only illustrated by the division of the foregoing functional modules, and in practical applications, the above functions may be distributed by different functional modules according to needs, that is, the modules or steps in the embodiments of the present invention are further decomposed or combined, for example, the modules in the foregoing embodiments may be combined into one module, or may be further split into multiple sub-modules, so as to complete all or part of the above described functions. The names of the modules and steps involved in the embodiments of the present invention are only for distinguishing the modules or steps, and are not to be construed as unduly limiting the present invention.

The terms "comprises," "comprising," or any other similar term are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.

11页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:意图驱动的适应竞争及合作意向的内容填充系统

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!