Information processing system, information processing method, and non-transitory computer-readable recording medium

文档序号:661719 发布日期:2021-04-27 浏览:4次 中文

阅读说明:本技术 信息处理系统、信息处理法及非暂态电脑可读取记录媒体 (Information processing system, information processing method, and non-transitory computer-readable recording medium ) 是由 曾俋颖 汤珮茹 于 2019-10-08 设计创作,主要内容包括:一种信息处理系统、信息处理法及非暂态电脑可读取记录媒体。该信息处理系统包含至少一处理器、通信接口及数据库。通信接口耦接于至少一处理器。数据库连接于该一或多个处理器,以及数据库经组配以存储接收自通信接口至少一文本。至少一处理器经组配以:使用至少一文本的多个字词的基础特征信息来获得多个训练字词;分类该些训练字词以建立对应于第一类别的第一列表以及对应于第二类别的第二列表;使用在第一列表与第二列表中的多个关键词于待标注文本中进行匹配,并分别计算出待标注文本关于第一列表及第二列表的信心值;以及根据信心值以将待标注文本标注为第一类别或第二类别。(An information processing system, an information processing method and a non-transitory computer readable recording medium are provided. The information processing system comprises at least one processor, a communication interface and a database. The communication interface is coupled to the at least one processor. A database is coupled to the one or more processors and configured to store at least one text received from the communication interface. At least one processor configured to: obtaining a plurality of training words by using basic characteristic information of a plurality of words of at least one text; classifying the training words to establish a first list corresponding to a first category and a second list corresponding to a second category; matching the plurality of keywords in the first list and the second list in the text to be labeled, and respectively calculating the confidence values of the text to be labeled on the first list and the second list; and marking the text to be marked as a first category or a second category according to the confidence value.)

1. An information processing system comprising:

at least one processor;

a communication interface coupled to the at least one processor; and

a database coupled to the one or more processors and configured to store at least one text received from the communication interface, wherein the at least one processor is configured to:

obtaining a plurality of training words by using basic characteristic information of a plurality of words of the at least one text;

classifying the training words to establish a first list corresponding to a first category and a second list corresponding to a second category;

matching a plurality of keywords in the first list and the second list in a text to be labeled, and respectively calculating a confidence value of the text to be labeled about the first list and the second list; and

and marking the text to be marked as the first category or the second category according to the confidence value.

2. The information handling system of claim 1, wherein the at least one processor is further configured to:

and calculating a reference value of the training words by using the basic feature information and a probability value of the basic feature information.

3. The information handling system of claim 2, wherein the at least one processor is further configured to:

setting the training words as the keywords of the first list in response to the reference value meeting a first threshold; and

and setting the training words as the keywords of the second list in response to the reference value meeting a second threshold value.

4. The information handling system of claim 2, wherein the at least one processor is further configured to:

calculating the reference value of each training word by using the occurrence frequency and the popularity information of the training words; and

setting the training words meeting a first threshold in the reference value as the keywords of the first list, and setting the training words meeting a second threshold in the reference value as the keywords of the second list.

5. The information handling system of claim 1, wherein the at least one processor is further configured to:

setting the keywords of the first list and the keywords of the second list as node values of a dictionary tree; and

and comparing a plurality of words of the text to be labeled by using the plurality of node values.

6. The information handling system of claim 5, wherein the at least one processor is further configured to:

recording a first matching number to set the first matching number as the confidence value of the first list, wherein the first matching number is the number of the matching between the node values corresponding to the first list and the words of the text to be labeled; and

recording a second matching number to set the second matching number as the confidence value of the second list, wherein the second matching number is the number of the node values corresponding to the second list matching with the words.

7. The information handling system of claim 6, wherein the at least one processor is further configured to:

and marking the text to be marked as the first category or the second category by using the maximum confidence value of the first list and the second list.

8. The information handling system of claim 1, wherein the at least one processor is further configured to:

receiving a new text through the communication interface;

obtaining a plurality of first keywords and/or a plurality of second keywords in the new text by using the basic feature information of a plurality of words in the new text; and

updating the first list corresponding to the first category according to the first keywords and/or updating the second list corresponding to the second category according to the second keywords.

9. The information handling system of claim 1, wherein the at least one processor is further configured to:

receiving a new text through the communication interface;

obtaining a plurality of third key words in the new text by using the basic characteristic information of a plurality of words in the new text; and

and establishing a third list corresponding to a third category according to the third key words in the new text.

10. The information handling system of claim 8, further comprising:

a user interface coupled to the at least one processor, wherein the user interface is configured to receive an operation command for the at least one processor to execute the operation command to:

modifying the first category into a fourth category, so that the fourth category comprises the plurality of first keywords; and/or

And modifying the second category into a fifth category, so that the second list corresponding to the fifth category comprises the plurality of second keywords.

11. An information processing method, comprising:

obtaining a plurality of training words by using basic characteristic information of a plurality of words of at least one text;

classifying the training words to establish a first list corresponding to a first category and a second list corresponding to a second category;

matching a plurality of keywords in the first list and the second list in a text to be labeled, and respectively calculating a confidence value of the text to be labeled about the first list and the second list; and

and marking the text to be marked as the first category or the second category according to the confidence value.

12. The information processing method of claim 11, further comprising:

and calculating a reference value of the training words by using the basic feature information and a probability value of the basic feature information.

13. The information processing method of claim 12, further comprising:

setting the training words as the keywords of the first list in response to the reference value meeting a first threshold; and

and setting the training words as the keywords of the second list in response to the reference value meeting a second threshold value.

14. The information processing method of claim 12, further comprising:

calculating the reference value of each training word by using the occurrence frequency and the popularity information of the training words; and

setting the training words meeting a first threshold in the reference value as the keywords of the first list, and setting the training words meeting a second threshold in the reference value as the keywords of the second list.

15. The information processing method of claim 11, further comprising:

setting the keywords of the first list and the keywords of the second list as node values of a dictionary tree; and

and comparing a plurality of words of the text to be labeled by using the plurality of node values.

16. The information processing method of claim 15, further comprising:

recording a first matching number to set the first matching number as the confidence value of the first list, wherein the first matching number is the number of the matching between the node values corresponding to the first list and the words of the text to be labeled; and

recording a second matching number to set the second matching number as the confidence value of the second list, wherein the second matching number is the number of the node values corresponding to the second list matching with the words.

17. The information processing method of claim 16, further comprising:

and marking the text to be marked as the first category or the second category by using the maximum confidence value of the first list and the second list.

18. The information processing method of claim 11, further comprising:

obtaining a plurality of first keywords and/or a plurality of second keywords in a new text by using the basic characteristic information of a plurality of words in the new text; and

updating the first list corresponding to the first category according to the first keywords and/or updating the second list corresponding to the second category according to the second keywords.

19. The information processing method of claim 11, further comprising:

obtaining a plurality of third key words in a new text by using the basic characteristic information of a plurality of words in the new text; and

and establishing a third list corresponding to a third category according to the third key words in the new text.

20. The information processing method of claim 18, further comprising:

modifying the first category into a fourth category, so that the fourth category comprises the plurality of first keywords; and/or

And modifying the second category into a fifth category, so that the second list corresponding to the fifth category comprises the plurality of second keywords.

21. A non-transitory computer readable recording medium storing a plurality of program codes, wherein when the program codes are loaded into at least one processor, the at least one processor executes the program codes to perform the following steps:

obtaining a plurality of training words by using basic characteristic information of a plurality of words of at least one text;

classifying the training words to establish a first list corresponding to a first category and a second list corresponding to a second category;

matching a plurality of keywords in the first list and the second list in a text to be labeled, and respectively calculating a confidence value of the text to be labeled about the first list and the second list; and

and marking the text to be marked as the first category or the second category according to the confidence value.

Technical Field

The present disclosure relates to a processing system and a processing method, and more particularly, to an information processing system and an information processing method.

Background

In a conventional text labeling method, after articles are read by human power (e.g., analysts), the articles are labeled one by one according to experience of the analysts. However, such an approach is rather time consuming and the results of the annotation are also highly dependent on the experience of the analyst. Furthermore, there is a considerable risk in terms of data privacy, since the article needs to be read by an analyst.

On the other hand, the mechanism for training the classification model by the machine learning method needs a large amount of accurately labeled articles to ensure the accuracy of the classification model. If the number of the labeled articles is insufficient or the quality is poor, the accuracy will also be low. Therefore, how to improve the classification accuracy and the data confidentiality at the same time is a technical problem which needs to be solved urgently in the field of text classification.

Disclosure of Invention

This summary is intended to provide a simplified summary of the disclosure in order to provide a basic understanding to the reader. This summary is not an extensive overview of the disclosure and is intended to neither identify key/critical elements of the embodiments nor delineate the scope of the disclosure.

According to an embodiment of the present disclosure, an information processing system is disclosed, which includes at least one processor, a communication interface, and a database. The communication interface is coupled to the at least one processor. A database is coupled to the one or more processors and configured to store at least one text received from the communication interface. At least one processor configured to: obtaining a plurality of training words by using basic characteristic information of a plurality of words of at least one text; classifying the training words to establish a first list corresponding to a first category and a second list corresponding to a second category; matching the plurality of keywords in the first list and the second list in the text to be labeled, and respectively calculating the confidence values of the text to be labeled on the first list and the second list; and marking the text to be marked as a first category or a second category according to the confidence value.

According to another embodiment, an information processing method is disclosed, including: obtaining a plurality of training words by using basic characteristic information of a plurality of words of at least one text; classifying the training words to establish a first list corresponding to a first category and a second list corresponding to a second category; matching a plurality of keywords in the first list and the second list in a text to be labeled, and respectively calculating a confidence value of the text to be labeled about the first list and the second list; and labeling the text to be labeled as the first category or the second category according to the confidence value.

According to another embodiment, a non-transitory computer readable recording medium storing a plurality of program codes, wherein when the program codes are loaded into at least one processor, the at least one processor executes the program codes to perform the following steps: obtaining a plurality of training words by using basic characteristic information of a plurality of words of at least one text; classifying the training words to establish a first list corresponding to a first category and a second list corresponding to a second category; matching a plurality of keywords in the first list and the second list in a text to be labeled, and respectively calculating a confidence value of the text to be labeled about the first list and the second list; and labeling the text to be labeled as the first category or the second category according to the confidence value.

Drawings

The following detailed description will facilitate a preferred understanding of embodiments of the disclosure when read in conjunction with the accompanying drawings. It should be noted that the features of the drawings are not necessarily drawn to scale as may be required to practice the description. In fact, the dimensions of the various features may be arbitrarily increased or reduced for clarity of discussion.

FIG. 1 is a functional block diagram illustrating an information handling system according to some embodiments of the present disclosure.

Fig. 2 is a flow chart illustrating an information processing method according to some embodiments of the present disclosure.

Fig. 3 is a flow chart illustrating an information processing method according to further embodiments of the present disclosure.

Description of reference numerals:

100 information processing system

110 processor

120 communication interface

130 database

140 user interface (interface)

Steps S210 to S240, S310 to S330

Detailed Description

The following disclosure provides many different embodiments, or examples, for implementing different features of the disclosure. Specific examples of components and arrangements are described below to simplify the present disclosure. Of course, these examples are merely illustrative and are not intended to be limiting. For example, forming a first feature over or on a second feature in the description that follows may include embodiments in which the first and second features are formed in direct contact, and may also include embodiments in which additional features may be formed between the first and second features such that the first and second features may not be in direct contact. Additionally, the present disclosure may repeat reference numerals and/or letters in the various examples. This repetition is for the purpose of simplicity and clarity and does not in itself dictate a relationship between the various embodiments and/or configurations discussed.

Referring to FIG. 1, a functional block diagram illustrating an information handling system 100 according to some embodiments of the present disclosure is shown. As shown in FIG. 1, information handling system 100 includes a processor 110, a communication interface 120, and a database 130. In some embodiments, data processing may be performed by at least one processor 110, such that information handling system 100 operates in a multi-threading (multithreading) environment. For convenience of explanation, the present disclosure will be described below with reference to an embodiment of the processor 110.

The communication interface 120 is coupled to the processor 110 and configured to transmit/receive text data with another device or system (not shown). In some embodiments, the communication interface 120 may be, but is not limited to, a communication chip supporting Global System for Mobile communication (GSM), Long Term Evolution (LTE), Worldwide Interoperability for Microwave Access (WiMAX), Wireless Fidelity (Wi-Fi), bluetooth technology, and a wired network.

The database 130 is coupled to the processor 110. In some embodiments, information handling system 100 may provide an external database (not shown) external to the system that is communicatively coupled to processor 110 via communication interface 130 to access data external to the system.

In some embodiments, the database 130 is configured to store at least one text through the communication interface 120. The text may be a file representing any language.

Referring to fig. 2, a flow chart illustrating an information processing method according to some embodiments of the present disclosure is shown. The information processing method of fig. 2 may be performed by the information processing system 100 of fig. 1. For the purpose of describing the information processing method of fig. 2, various related terms or elements will be described with reference to fig. 1.

In step S210, a plurality of training words are obtained by using the basic feature information of at least one text word.

In some embodiments, the processor 110 uses words in the text as a basis for training keywords of the dictionary.

First, the processor 110 parses the words in the text through natural language processing techniques, such as finding words or word breaks in the text. Next, the processor 110 obtains the basic feature information of the words according to a predetermined database (not shown). The basic feature information may be, but is not limited to, Mutual Information (MI) of words, entropy (entropy), word frequency (TF), combination variance (AV), and context relationship (position). In some embodiments, processor 110 uses a composite weight calculation formula, such as formula (1), to calculate the reference value for each word.

W(new word)=α×WMI+β×Wentropy+γ×WTF+δ×WAV+ε×Wposition,0<α,β,γ,δ,ε<1 … formula (1)

In formula (1), w (new word) is a reference value of a word. WMIMutual information of words, WentropyEntropy values of words, WTFFrequency of words, WAVAs the variation value between words and left and right words, WpositionIs the value of the relative relationship between the words and context, and α, β, γ, δ, and ε are probability values. The mutual information is an estimated value of the degree of closeness or the relevance between a word and other adjacent words, and the entropy value is an estimated value of the degree of freedom between the word and other adjacent words. Mutual Information and entropy are a part of Information Theory (Information Theory), and therefore are not described in detail herein.

Therefore, by adjusting the probability value of each basic feature information of the formula (1), different probability values can be used for reference of subsequently finding a plurality of keywords.

In step S220, the processor 110 classifies the training words to create a plurality of lists respectively corresponding to a plurality of categories.

In some embodiments, the processor 110 may set different thresholds to determine the classification of the keyword. For example, the training words are "artificial intelligence server", "intelligent robot", "virtual assistant", "natural language", "home appliance", etc. detected in the text, but only if the reference values of the first four are greater than the first threshold, the training words are set as the keywords in the first list related to artificial intelligence (first category). As another example, if the training words detected in the text are "financial transaction", "bitcoin", "smart contract" and "bank", and the reference value of only the first three of these training words is greater than the second threshold value, the training words "financial transaction", "bitcoin" and "smart contract" are set as the keywords in the second list related to the blockchain (second category). By analogy, the processor 110 may build many different lists.

In some embodiments, the keywords of the first list are configured as dictionaries associated with artificial intelligence, and the keywords of the second list are configured as dictionaries associated with blockchains. In this way, the information processing system 100 can classify or label the text to be classified based on the dictionary files. It is worth mentioning that the terms "list" and "dictionary" in this disclosure are used interchangeably.

In step S225, the processor 110 determines whether the training of the dictionary is completed.

In some embodiments, steps S210 to S220 may be regarded as a loop, and in the method for creating a list of the present disclosure, the loop may be repeatedly executed for a plurality of times, and a plurality of training words may be repeatedly obtained based on a plurality of words of the same or different texts, so that the keywords in the list classified into each category are more correct. For example, at loop L1, the training word "Bank" may be categorized into a second list of blockchain categories as a keyword for the second list. However, it is possible that at loop L2, the training word "bank" that is less fit for the classification of "blockchain" is removed from the second list. In this way, multiple loops are executed to continually update and optimize the list of keys.

In some embodiments, the information processing method of the present disclosure uses a word extraction algorithm to reduce the time required to train a word and improve the precision of the training word. For example, the word extraction algorithm is a TextRank algorithm, as shown in equation (2).

In the formula (2), Vi、Vj、VkBeing different nodes, WS (V)i) Is node ViWeight value of WjiIs node VjTo node ViEdge weight of (1), In (V)i) To point to node ViSet of all nodes of, Out (V)j) Is node VjAll node sets pointed to, d is the adjustment factor (e.g., 0.85).

In some embodiments, the edge weight W in formula (2) is applied to the word extraction algorithm with the frequency of occurrence and prevalence information about words in the (term frequency-inverse document frequency, TF-IDF) techniquejiTherefore, when the weight value of each node is calculated, the occurrence frequency and the popularity of different words can be considered, so that the iterative calculation process in the formula (2) can be accelerated to converge. For example,processor 110 calculates the weight values for the N training words using equation (2). After the weighted values are sorted (e.g., from large to small), the first few (e.g., 50) training words are set as keywords and can be added to the list.

In step S230, the processor 110 uses the keywords of the lists to perform matching in the text to be labeled, so as to calculate the confidence values of the lists.

In some embodiments, the present disclosure uses a multiple word multiple dictionary (MSMD) algorithm for text labeling. For example, in step S220, a plurality of lists are obtained as a plurality of dictionaries D [1, …, D ], each dictionary (e.g., dictionary 1-dictionary D) being of a mutually exclusive type. Each dictionary contains a plurality of words S [1, …, S ]. In the matching process, the processor 110 will take a main word string T from the text to be labeled to determine whether each dictionary is a matching category of T one by one, for example, search whether each dictionary has a keyword completely matching with the main word string T.

For example, the processor 110 sets the keywords in the first list as a plurality of first node values (or called first template strings) of a dictionary Tree (Trie-Tree), and sets the keywords in the second list as a plurality of second node values (or called second template strings) of the dictionary Tree. In other words, all keywords are integrated into one dictionary tree.

Then, the processor 110 compares a plurality of words of the text to be labeled with the first node values and the second node values at the same time. During the matching process, the first template strings of the dictionary tree are automatically searched with the main string T of the text to be labeled each time. Each word of the main string T is compared with the first template string one by one. In one embodiment, when the main string T completely matches any of the first template strings, the processor 110 records the template string, the number of times the matched template string appears in the text to be annotated, and the position of the matched template string appearing in the text to be annotated. Similarly, each word of the main string T is compared with the second template string one by one. When the primary string T completely matches any of the second template strings, the processor 110 records the template string, the number of times the matched template string appears in the text to be annotated, and the position of the matched template string appearing in the text to be annotated.

In some embodiments, the data structures of the dictionary tree are stored in nodes with the same prefixes of strings (e.g., each character is stored in a node such that the tree height of the dictionary tree is the longest string length plus one), and thus each string corresponds to a unique node. When searching the dictionary tree according to the main string T, the dictionary tree is searched from the root node of the dictionary tree and searched from the child nodes layer by layer. On the other hand, because the index (pointer) is used to record the word string in the dictionary tree, the processor 110 uses the finite state machine control (such as the Aho-cordick algorithm), and modifies the index in the process of searching the dictionary tree in cooperation with each pre-constructed template, when any word in the main word string T fails to be searched, the finite state machine returns to the other branch of the dictionary tree to avoid repeated matching with the same prefix, so that the time for searching the main word string T can be reduced, and the efficiency for searching the dictionary tree can be improved.

It should be noted that the present disclosure is not limited to the dictionary tree algorithm, and any multi-string search algorithm falls within the scope of the present disclosure.

In addition, the dictionary tree is built by all keywords of all dictionaries according to the rule of the same prefix. Since a dictionary tree contains all the keywords of all the dictionaries, in the matching process, a main string T can be matched with all the dictionaries at the same time. Compared with the common method (i.e. keyword matching can be performed only on one dictionary at a time), the method of matching multiple dictionaries simultaneously in the present disclosure can greatly improve the efficiency of keyword matching.

Hereinafter, two dictionaries (lists) are integrated into a dictionary tree, a plurality of keywords corresponding to the first list in the dictionary tree are a plurality of first nodes, and a plurality of keywords corresponding to the second list in the dictionary tree are a plurality of second nodes.

In some embodiments, the processor 110 records the number of words matching the first node value (i.e., a first matching number) among the words of the text to be annotated, and records the number of words matching the second node value (i.e., a second matching number) among the words of the text to be annotated. Next, the processor 110 sets the first matching number as the confidence value of the first list and sets the second matching number as the confidence value of the second list.

In step S240, the processor 110 labels the label text as at least one of the categories according to the confidence value.

In some embodiments, the processor 110 maximizes the confidence value of the first list and the confidence value of the second list. For example, if the confidence value of the first list is the maximum value, the text to be labeled is labeled as the category (e.g., artificial intelligence) corresponding to the first list. For another example, if the confidence value of the second list is the maximum value, the text to be labeled is labeled as the category (e.g., block chain) corresponding to the second list. In another embodiment, the text to be labeled can also be labeled by more than one category.

Referring to fig. 3, a flow chart illustrating an information processing method according to further embodiments of the present disclosure is shown. The information processing method can further update the existing list, so that the keywords of each classification are more accurate.

In step S310, the processor 110 obtains at least one of a plurality of first keywords, second keywords, and third keywords by using the basic feature information of the words in the new text. The step of obtaining the keywords refers to the content of the aforementioned steps S210 to S220, which is not repeated herein.

In some embodiments, the processor 110 may receive the new text through the communication interface 120. The new text may be any text that may be used to train all lists, such as text already stored in the database 130, text to be labeled as described above, text that has not been utilized in the training procedure, and so forth.

In some embodiments, if the keywords that can be classified into the existing category are calculated in the new text, step S320 is executed.

In step S320, the processor 110 updates the first list corresponding to the first category according to the first keywords and/or updates the second list corresponding to the second category according to the second keywords.

In another embodiment, if a keyword (e.g., a third keyword) that cannot be classified into an existing category is calculated in the text to be labeled, step S330 is executed.

In step S330, the processor 110 creates a third list corresponding to a third category according to the third keywords.

For example, keywords detected in the text are "tablet," "display," "optical film," "glass screen," etc., and neither belong to artificial intelligence (first category) nor to a blockchain (second category). Thus, the processor 110 creates a third list corresponding to the electronic information (third category).

Referring back to FIG. 1, the information handling system 100 also includes a user interface 140. The user interface 140 is coupled to the processor 110. The user interface 140 may be a graphical user interface, a keyboard, a screen, a mouse, etc. for providing the user with the relevant operations. For example, the graphical user interface provides a plurality of lists and keywords thereof which are well established.

Referring to table one, table one is a schematic diagram of a plurality of lists and keywords thereof.

Table one: multiple lists (called dictionary file below)

In some embodiments, multiple listings of the present disclosure may provide corresponding services for different annotation requirements. For example, if the text to be labeled is a plurality of texts of YAHOO news, the text labeling system 100 can label all the YAHOO news with texts using a dictionary file such as table one, and please refer to the above description. For example, the first news article is labeled as an article related to "blockchain" and "big data," while the second article is labeled as an article related to "semiconductor.

In other embodiments, if the text to be labeled is a plurality of texts of east news, the user interface 140 may be configured to receive an operation instruction for the processor 110 to perform the category modification. For example, artificial intelligence (first category) may be modified to an intelligent appliance (fourth category) such that the intelligent appliance contains all keywords of artificial intelligence. By analogy, the blockchain (second category) may be modified to e-commerce (fifth category) such that e-commerce contains all keywords of the blockchain.

In other embodiments, the user interface 140 provides a user (e.g., domain experts) with the ability to evaluate whether the listings and keywords of the dictionary file are correct and whether the classified text is also correctly labeled. If the inappropriate portions are found, the experts in the respective fields can also correct the erroneous portions through the user interface 140 to avoid the situation of duplicate labels or inconsistent standards.

In this way, the information processing system 100 of the present disclosure can be compatible with text providers with different labeling requirements after completing a stage of training to create a dictionary file. Therefore, when the annotation service is provided for different text providers, the existing dictionary file can be applied to different text providers without performing training (perhaps only fine tuning) on the dictionary file for each text provider again. In other words, by extracting and replacing the dictionary classification and the input text, the conversion can be rapidly carried out in different fields and data sources, and the working efficiency is improved.

In some embodiments, the text of multiple (e.g., 195) corporate websites is labeled based on the five category labels in the dictionary file of Table one. The text of a pre-designed portion (e.g., between 15) of the corporate web site has been classified into partial tags, so the text labeling step described above is performed on the remaining portion (e.g., between 80) of the corporate web site. For example, the training process (e.g., the aforementioned steps S210 to S225) is performed on the labeled website text in the 15 th interval to obtain a dictionary file (e.g., table one). Then, by using the labeling steps (e.g., the aforementioned steps S230 to S240), the website text of the 80 companies is labeled with keywords, so as to obtain a labeling result with a first accuracy.

On the other hand, the optimization step (e.g., step S310 to step S330) may be performed using the website texts of 80 companies, and the classification of the dictionary file and the keywords thereof are trained again to obtain the optimized dictionary file. Then, the text labeling steps (e.g., steps S230 to S240) are performed again on the remaining part (e.g., 100) of the company website, and then a labeling result with a second accuracy higher than the first accuracy can be obtained. By analogy, the method can be continuously optimized, so that the dictionary file can be optimized by text labeling each time, and the accuracy of next text labeling is improved.

In summary, the information processing system and the information processing method disclosed in the present disclosure provide a highly flexible text labeling method, which uses basic feature information to find new words and combines a word frequency reverse file frequency with a word extraction algorithm to improve the efficiency of setting keywords. The present disclosure may continually train and refine the classification of the dictionary, as opposed to the typical text labeling that requires human effort to complete. In addition, the automatic labeling mode can simultaneously realize online data labeling and data protection, and avoid the problem of data leakage caused by manual labeling.

The foregoing outlines features of several embodiments so that those skilled in the art may better understand the implementations of the present disclosure. Those skilled in the art should appreciate that the present disclosure may be readily utilized as a basis for designing or modifying other processes and structures for carrying out the same purposes and/or achieving the same advantages of the embodiments introduced herein. Those skilled in the art should also realize that such equivalent constructions do not depart from the spirit and scope of the present disclosure, and that they may make various changes, substitutions, and alterations herein without departing from the spirit and scope of the present disclosure.

13页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种基于情感词典实体的情感得分的分析处理方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!