Method, device and equipment for determining sensitive information and storage medium

文档序号：1378986 发布日期：2020-08-14 浏览：6次中文

阅读说明：本技术 一种确定敏感信息的方法、装置、设备及存储介质 (Method, device and equipment for determining sensitive information and storage medium ) 是由李瑞男狄潇然张亚泽于 2020-04-27 设计创作，主要内容包括：本申请公开了一种确定敏感信息的方法、装置、设备及存储介质,包括：获取文本信息中的多个目标分词,并确定该多个目标分词中各个分词分别对应的敏感场景,然后,根据各个敏感场景分别对应的分词数量,可以确定文本信息是否为敏感信息。这样,即使属于敏感信息的文本信息中不包含显著敏感词,也可以是根据文本信息中各个分词对应的敏感场景来识别出该敏感信息。(The application discloses a method, a device, equipment and a storage medium for determining sensitive information, wherein the method comprises the following steps: the method comprises the steps of obtaining a plurality of target participles in text information, determining sensitive scenes corresponding to the participles in the target participles respectively, and determining whether the text information is sensitive information or not according to the number of the participles corresponding to the sensitive scenes respectively. Therefore, even if the text information belonging to the sensitive information does not contain obvious sensitive words, the sensitive information can be identified according to the sensitive scenes corresponding to the participles in the text information.)

1. A method of determining sensitive information, the method comprising:

acquiring a plurality of target word segments in the text information;

determining a sensitive scene corresponding to each participle in the plurality of target participles;

and determining whether the text information is sensitive information or not according to the number of the participles corresponding to each sensitive scene.

2. The method according to claim 1, wherein the determining whether the text information is sensitive information according to the number of the participles respectively corresponding to each sensitive scene comprises:

determining a first target sensitive scene to which a first word segmentation belongs, wherein the first word segmentation is any one word segmentation in the plurality of target word segmentations;

determining a second target-sensitive scene having a dependency relationship with the first target-sensitive scene, the second target-sensitive scene comprising one or more sensitive scenes;

determining the number of the participles belonging to the second target sensitive scene in the target participle set as the number of the dependent participles corresponding to the first participle;

calculating the sum of the number of the dependent participles corresponding to each participle in the target participles to obtain a first sum value;

and when the first sum is larger than a sensitivity threshold, determining that the text information is sensitive information.

3. The method of claim 2, further comprising:

calculating the sum of the number of second target sensitive scenes corresponding to each participle in the plurality of target participles to obtain a second sum value;

and calculating the sensitivity threshold according to the second sum.

4. The method of claim 3, wherein calculating the sensitivity threshold based on the second sum comprises:

and calculating the sensitivity threshold according to the second sum and a preset adjusting factor.

5. The method of claim 1, wherein the obtaining a plurality of target participles in the text message comprises:

performing word segmentation on the text information to obtain a word segmentation set, wherein the word segmentation set comprises a plurality of words;

and matching the participles in the participle set with the non-significant sensitive words in a first preset word bank, and determining the successfully matched participles as the target participles to obtain the plurality of target participles.

6. The method of claim 1, wherein the determining the sensitive scene corresponding to each of the plurality of target participles comprises:

and determining the sensitive scene corresponding to each participle in the target participles according to the corresponding relation between the pre-established participles and the sensitive scene.

7. The method of claim 1, further comprising:

matching each participle in the text information with a significant sensitive word in a second preset word bank;

and when the participles in the text information have participles matched with the obvious sensitive words in the second preset lexicon, determining the text information as sensitive information.

8. An apparatus for determining sensitive information, the apparatus comprising:

the acquisition module is used for acquiring a plurality of target word segments in the text information;

the first determining module is used for determining a sensitive scene corresponding to each participle in the target participles;

and the second determining module is used for determining whether the text information is sensitive information according to the number of the participles corresponding to each sensitive scene.

9. An apparatus, comprising: a processor and a memory;

the memory for storing instructions or computer programs;

the processor, for executing the instructions or the computer program, performs the method of determining sensitive information of any of claims 1-7.

10. A computer-readable storage medium, characterized in that it comprises instructions or a computer program which, when run on a computer, cause the computer to carry out the method of determining sensitive information according to any of the preceding claims 1-7.

Technical Field

The present application relates to the field of information processing technologies, and in particular, to a method, an apparatus, a device, and a storage medium for determining sensitive information.

Background

With the development of the internet, a great amount of text information exists on the network, such as articles expressing own opinions and opinions of netizens aiming at events, news and the like. And some sensitive information, such as violence, political bias, etc., may be present in the text information. In order to sanitize the network environment, it is often necessary to filter some sensitive information in the network.

Currently, some words that can be significantly sensitive are often used to determine whether text information is sensitive information. When the text information contains the obvious sensitive words, the text information can be judged to be sensitive information, and the sensitive information is filtered from a plurality of text information. However, in practical applications, the effect of filtering sensitive information from text information based on the significant sensitive words is poor, and a large amount of sensitive information still exists in the remaining text information and cannot be filtered.

Disclosure of Invention

The embodiment of the application provides a method and a device for determining sensitive information, so as to determine the sensitive information in text information, and further realize the filtering of the sensitive information in the text information.

In a first aspect, an embodiment of the present application provides a method for determining sensitive information, where the method includes:

acquiring a plurality of target word segments in the text information;

determining a sensitive scene corresponding to each participle in the plurality of target participles;

and determining whether the text information is sensitive information or not according to the number of the participles corresponding to each sensitive scene.

In a possible implementation manner, the determining whether the text information is sensitive information according to the number of the segmented words respectively corresponding to each sensitive scene includes:

determining a first target sensitive scene to which a first word segmentation belongs, wherein the first word segmentation is any one word segmentation in the plurality of target word segmentations;

determining a second target-sensitive scene having a dependency relationship with the first target-sensitive scene, the second target-sensitive scene comprising one or more sensitive scenes;

determining the number of the participles belonging to the second target sensitive scene in the target participle set as the number of the dependent participles corresponding to the first participle;

calculating the sum of the number of the dependent participles corresponding to each participle in the target participles to obtain a first sum value;

and when the first sum is larger than a sensitivity threshold, determining that the text information is sensitive information.

In one possible embodiment, the method further comprises:

calculating the sum of the number of second target sensitive scenes corresponding to each participle in the plurality of target participles to obtain a second sum value;

and calculating the sensitivity threshold according to the second sum.

In a possible implementation, the calculating the sensitivity threshold according to the second sum includes:

and calculating the sensitivity threshold according to the second sum and a preset adjusting factor.

In one possible implementation, the obtaining a plurality of target segmented words in the text information includes:

performing word segmentation on the text information to obtain a word segmentation set, wherein the word segmentation set comprises a plurality of words;

In a possible implementation manner, the determining a sensitive scene corresponding to each of the target participles includes:

and determining the sensitive scene corresponding to each participle in the target participles according to the corresponding relation between the pre-established participles and the sensitive scene.

In one possible embodiment, the method further comprises:

matching each participle in the text information with a significant sensitive word in a second preset word bank;

and when the participles in the text information have participles matched with the obvious sensitive words in the second preset lexicon, determining the text information as sensitive information.

In a second aspect, an embodiment of the present application further provides an apparatus, where the apparatus includes:

the acquisition module is used for acquiring a plurality of target word segments in the text information;

the first determining module is used for determining a sensitive scene corresponding to each participle in the target participles;

and the second determining module is used for determining whether the text information is sensitive information according to the number of the participles corresponding to each sensitive scene.

In one possible implementation, the second determining module includes:

the first determining unit is used for determining a first target sensitive scene to which a first word segmentation belongs, wherein the first word segmentation is any one word segmentation in the plurality of target word segmentations;

a second determining unit, configured to determine a second target-sensitive scene having a dependency relationship with the first target-sensitive scene, where the second target-sensitive scene includes one or more sensitive scenes;

a third determining unit, configured to determine the number of the participles belonging to the second target sensitive scene in the target participle set, where the number of the participles is used as the number of dependent participles corresponding to the first participle;

the calculation unit is used for calculating the sum of the number of the dependent participles corresponding to each participle in the target participles to obtain a first sum value;

and the fourth determining unit is used for determining that the text information is sensitive information when the first sum is greater than a sensitivity threshold.

In a possible embodiment, the apparatus further comprises:

the first calculation module is used for calculating the sum of the number of second target sensitive scenes corresponding to each participle in the plurality of target participles to obtain a second sum value;

and the second calculation module is used for calculating the sensitivity threshold according to the second sum.

In a possible implementation manner, the second calculating module is specifically configured to calculate the sensitivity threshold according to the second sum and a preset adjustment factor.

In a possible implementation, the obtaining module includes:

the word segmentation unit is used for segmenting words of the text information to obtain a word segmentation set, and the word segmentation set comprises a plurality of words;

and the matching unit is used for matching the participles in the participle set with the non-significant sensitive words in a first preset word bank, and determining the successfully matched participles as the target participles to obtain the plurality of target participles.

In a possible implementation manner, the first determining module is specifically configured to determine, according to a correspondence between pre-established participles and sensitive scenes, the sensitive scenes corresponding to the respective participles in the plurality of target participles.

In a possible embodiment, the apparatus further comprises:

the matching module is used for matching each participle in the text information with the obvious sensitive word in a second preset word bank;

and the third determining module is used for determining the text information as sensitive information when the participle matched with the obvious sensitive word in the second preset lexicon exists in each participle in the text information.

In a third aspect, an embodiment of the present application further provides an apparatus, including: a processor and a memory;

the memory for storing instructions or computer programs;

the processor, configured to execute the instructions or the computer program, performs the method for determining sensitive information according to any one of the above first aspects.

In a fourth aspect, the present application further provides a computer-readable storage medium, which includes instructions or a computer program, when the computer-readable storage medium runs on a computer, the computer is caused to execute the method for determining sensitive information according to any one of the first aspect.

In the implementation manner of the embodiment of the application, a plurality of target participles in text information are obtained, sensitive scenes corresponding to the participles in the plurality of target participles are determined, and then whether the text information is sensitive information or not can be determined according to the number of the participles corresponding to the sensitive scenes. It can be seen that, in this embodiment, whether the text information is the sensitive information is determined according to the number of the participles corresponding to each sensitive scene in the text information, for example, when the number of the participles in a certain sensitive scene is large, it may be considered that the content described by the text information has a high possibility as the content describing the sensitive scene, so that the text information may be considered as the sensitive information, and thus, even if the text information belonging to the sensitive information does not include a significant sensitive word, the sensitive information may be identified according to the sensitive scene corresponding to each participle in the text information.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments described in the present application, and other drawings can be obtained by those skilled in the art according to the drawings.

FIG. 1 is a schematic diagram of an application scenario in an embodiment of the present application;

FIG. 2 is a schematic flow chart illustrating a method for determining sensitive information according to an embodiment of the present disclosure;

FIG. 3 is a schematic structural diagram of an apparatus for determining sensitive information according to an embodiment of the present disclosure;

fig. 4 is a schematic hardware structure diagram of an apparatus in an embodiment of the present application.

Detailed Description

At present, when determining whether text information in a network belongs to sensitive information, words contained in the text information are generally matched with significant sensitive words, if matching is successful, the text information is determined to be sensitive information, and if matching is unsuccessful, that is, the text information does not contain significant sensitive words, the text information is determined not to be sensitive information.

However, the inventor has found that even if a part of text information does not contain a significant sensitive word, the part of text information may still belong to sensitive information. For example, assuming that the text message is embodied as "take a business to do away with him tonight" and the text message is characterized by the meaning of using a weapon to make a person (i.e., "he" in the text message) lose life, the description belonging to the violent nature should be determined as the sensitive message. However, any of the words "this night", "take", "family", "dry" and "he" included in the text message is not suitable as a significantly sensitive word, for example, the meaning of "dry" in the other sentence "dry this bottle of wine this night" means "drink", and in practical applications, usually does not belong to sensitive contents (other words are similar). Therefore, the method for determining the sensitive information in the text information based on the significant sensitive words cannot filter out the text information which does not contain the significant sensitive words but belongs to the sensitive information, so that more sensitive information still exists in the remaining text information.

Based on this, the embodiment of the present application provides a method for determining sensitive information, which aims to determine text information that does not contain significant sensitive words but belongs to sensitive information. Specifically, a plurality of target participles in the text information are obtained, sensitive scenes corresponding to the participles in the target participles are determined, and then whether the text information is sensitive information or not can be determined according to the number of the participles corresponding to the sensitive scenes. It can be seen that, in this embodiment, whether the text information is the sensitive information is determined according to the number of the participles corresponding to each sensitive scene in the text information, for example, when the number of the participles in a certain sensitive scene is large, it may be considered that the content described by the text information has a high possibility as the content describing the sensitive scene, so that the text information may be considered as the sensitive information, and thus, even if the text information belonging to the sensitive information does not include a significant sensitive word, the sensitive information may be identified according to the sensitive scene corresponding to each participle in the text information.

As an example, the embodiment of the present application may be applied to an exemplary application scenario as shown in fig. 1. In this scenario, the user 101 may edit and upload corresponding text information, such as a opinion regarding a certain news event, on the terminal 102; the terminal 102 may transmit the text information edited by the user to the server 103; the server 103 may obtain a plurality of target participles in the text information (e.g., determine the target participles in a chinese participle manner), and determine a sensitive scene corresponding to each participle in the plurality of target participles, so as to determine whether the text information edited by the user 101 is sensitive information according to the number of the participles corresponding to each sensitive scene, and feed back the determination result to the terminal 102; if the terminal 102 determines that the text information is sensitive information according to the determination result, it may prompt the user 101 to refuse to upload the text information, or prompt the user 101 to edit the text information again.

It is to be understood that the above scenario is only one example of a scenario provided in the embodiment of the present application, and the embodiment of the present application is not limited to this scenario. For example, in other possible application scenarios, the terminal 102 may perform the above-mentioned process of determining the sensitive information without being implemented by the server 103; alternatively, the above-mentioned process of determining the sensitive information is completed by the terminal 102 and the server 103 in cooperation, and the like. In summary, the present application may be applicable in any applicable scenario and is not limited to the scenario examples described above.

In order to make the aforementioned objects, features and advantages of the present application more comprehensible, various non-limiting embodiments accompanying the present application examples are described below with reference to the accompanying drawings. It is to be understood that the embodiments described are only a few embodiments of the present application and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Referring to fig. 2, fig. 2 shows a flowchart of a method for presenting a bullet screen in an embodiment of the present application, where the method may be applied to the server 103 shown in fig. 1, may also be applied to the terminal 102, or is completed by the terminal 102 and the server 103 in a cooperative manner, and the method specifically may include:

s201: and acquiring a plurality of target word segmentation in the text information.

In practical application, the text information published by the user on the network or the text information that needs to be checked before uploading to the network can be used as the text information that needs to be determined whether the text information belongs to the sensitive information in this embodiment. Thus, after determining that the text message belongs to sensitive information, the text message can be deleted from the network or the message can be rejected from being uploaded to the network.

In this embodiment, for text information that needs to determine whether the text information belongs to sensitive information, a plurality of target participles in the text information may be obtained first, where the target participles may be all participles included in the text information or part of participles included in the text information.

In an exemplary embodiment, the text information may be subjected to a word segmentation process, for example, when the text information is chinese text information, a sentence in the text information may be segmented by using an ANSJ chinese word segmentation tool, so that a word segmentation set may be obtained, where the obtained word segmentation set generally includes a plurality of words. Then, each participle in the participle set can be respectively matched with the non-significant sensitive word in the pre-established first preset lexicon, and the successfully matched participle is determined as the target participle, so that a plurality of target participles are obtained, for example, the target participle matched with the non-significant sensitive word in the text information can be determined by a multi-mode character string matching algorithm.

The non-significant sensitive word described in this embodiment refers to that whether text information including the word belongs to sensitive information cannot be determined according to a single word, and as the word "dry" in the foregoing example, the word may be a word belonging to sensitive information or a word belonging to non-sensitive information, so that whether the text information is sensitive information cannot be determined based on the word alone, and the non-significant sensitive word described in this embodiment refers to a word of this type. Correspondingly, the significantly sensitive word described in the present application means that the text information including the word can be determined to belong to sensitive information based on only one single word, and for such word, the significantly sensitive word can be used as the significantly sensitive word in the present application.

In practical application, a first preset word bank may be established in advance, and each participle forming the first preset word bank may be the above-mentioned non-significant sensitive word, which may be determined manually by a technician according to experience of practical application, for example, the non-significant sensitive word may include the above-mentioned "dry" word, and the like, and of course, in other embodiments, the "dry" word may also be directly determined as the significant sensitive word, and the like. Moreover, the participles in the first preset lexicon can be correspondingly added, deleted, modified and the like according to the needs of practical application.

It should be noted that the target segmented word determined by matching with the non-significant sensitive word in the first preset word library may be a partial segmented word in the text message, for example, taking "take a family and a business to do away with him" this night still serves as an example, the determined target segmented word may be "family" to do away "and the rest of the segmented words" take a family and a business "and" he "in the text message may not serve as the target segmented word in this embodiment. In this way, it is possible to determine whether the text message belongs to the sensitive message based on only the determined target participle ("house-partner" to "dry out"), without performing determination based on all the participles in the text message, so that the amount of calculation required for determining the sensitive message can be reduced. Of course, in other possible embodiments, all the participles in the text information may be used as the target participles, that is, whether the text information belongs to the sensitive information is determined based on all the participles in the text information.

S202: and determining a sensitive scene corresponding to each participle in the target participles.

In this embodiment, the sensitive scene may be, for example, one or more of a violence scene, a pornography scene, a political scene, a negation scene, an illegal independence scene, a split scene, a weapon scene, a location scene, a vehicle scene, a professional scene, and the like. In practical applications, the sensitive scene may further include other possible scenes, and this enumeration in this embodiment is only used for simple description, and is not used to limit possible implementations of the sensitive scene in this embodiment.

For each target participle, there may be a respective sensitive scene corresponding thereto. For example, the sensitive scenes corresponding to the target participles may include a first target sensitive scene to which the target participles belong and a second target sensitive scene which has a dependency relationship with the first target sensitive scene. Then, for each target participle, a first target sensitive scene and a second target sensitive scene corresponding to the target participle may be determined. In practical applications, the second object-sensitive scene may include one or more sensitive scenes, and the second object-sensitive scene may be the same as the first object-sensitive scene or different from the first object-sensitive scene.

The first target sensitive scene and the second target sensitive scene have a dependency relationship, which means that when the word segmentation belonging to the first target sensitive scene and the second target sensitive scene occurs in the text information at the same time, the text information has a high possibility that the text information belongs to the sensitive information. For example, if the target participle is a "cannonball", the target participle belongs to a "weapon scene", and it is usually difficult to directly determine whether the text information belongs to sensitive information based on the "cannonball" in the text information, but if the text information also includes contents such as a place, a vehicle, and a profession, the text information usually has a high possibility of belonging to the sensitive information, and therefore, in this embodiment, while determining a first target sensitive scene (weapon scene) to which the "cannonball" belongs, a second target sensitive scene (such as a place scene, a vehicle scene, and a profession scene) that has a dependency relationship with the first target sensitive scene may also be determined.

In specific implementation, the sensitive scene corresponding to each participle in the multiple target participles can be determined according to the pre-established corresponding relationship between the participles and the sensitive scene. The corresponding relationship can be preset and adjusted by corresponding technicians according to experience in practical application.

S203: and determining whether the text information is sensitive information or not according to the number of the participles corresponding to each sensitive scene.

In this embodiment, after the sensitive scene corresponding to each target participle is determined, the number of the corresponding target participles in each sensitive scene may be counted, for example, if a is present in the text information and belongs to a first target sensitive scene and corresponds to a second target sensitive scene, and B is present and belongs to the second target sensitive scene and corresponds to the first target sensitive scene, the number of the corresponding participles in the first target sensitive scene is 2 (that is, includes two participles a and B), and the number of the corresponding participles in the second target sensitive scene is also 2. Then, whether the text information is sensitive information or not can be determined according to the number of the participles corresponding to each sensitive scene.

It can be understood that if the number of the participles corresponding to a certain sensitive scene (which may be used as both a first target sensitive scene and a second target sensitive scene) is large, this indicates that there are many contents in the text information that are describing the sensitive scene, that is, the probability that the text information belongs to the sensitive information corresponding to the sensitive scene is higher, and at this time, it may be determined that the text information belongs to the sensitive information.

In an exemplary embodiment, the number of segmented words corresponding to each sensitive scene may be compared with a preset threshold, when there is at least one sensitive scene whose number of segmented words is greater than the preset threshold, it may be determined that the text information belongs to the sensitive information, and when none of the sensitive scenes has a number of segmented words greater than the preset threshold, it may be determined that the text information does not belong to the sensitive information.

In addition, another exemplary implementation manner for determining whether the text information is sensitive information according to the number of the participles corresponding to the sensitive scene is provided in the embodiments of the present application. Specifically, for any word segmentation (hereinafter referred to as a first word segmentation) in the target word segmentation, a first target sensitive scene to which the first word segmentation belongs may be determined, and a second target sensitive scene having a dependency relationship with the first target sensitive scene may be determined, where the second target sensitive scene includes one or more sensitive scenes; then, determining the number of the participles belonging to a second target sensitive scene in the target participle set, and taking the determined number as the number of the dependent participles corresponding to the first participle, namely the number of the participles belonging to the second target sensitive scene, so that the number of the dependent participles corresponding to each participle in the target participle can be obtained; then, the sum of the number of dependent participles corresponding to each participle in the target participles can be calculated to obtain a first sum, the first sum is compared with a sensitivity threshold, when the first sum is larger than the sensitivity threshold, the text information can be determined to be sensitive information, and when the first sum is not larger than the sensitivity threshold, the text information can be determined not to be sensitive information.

For example, for a text to be detected, according to a first preset word bank, matching all non-significant sensitive words word _1, word _2, say, and word _ m included in the text, setting the number of second target sensitive scenes corresponding to each non-significant sensitive word as a1, a2, say, and am, respectively, where words of the non-significant sensitive words appearing in a sentence and belonging to the second target sensitive scenes corresponding to word _1 have word _1_ reply _ counts (that is, the number of dependent participles corresponding to word _ 1), and words of the non-significant sensitive words appearing in the sentence and belonging to the second target sensitive scenes corresponding to word _2 have word _2_ reply _ counts, and words of the non-significant sensitive words appearing in the sentence and belonging to the second target sensitive scenes corresponding to word _ n have word _ n _ reply. The sum t of the number of the dependent participles corresponding to the n participles is:

t＝word_1_rely_counts+word_2_rely_counts+…+word_m_rely_counts

at this time, if T is greater than the sensitivity threshold T, it is determined that the text to be detected belongs to the sensitive information, and if T is not greater than the sensitivity threshold T, it is determined that the text to be detected does not belong to the sensitive information.

In a further possible embodiment, the sensitivity threshold may also be calculated according to the text information, so that different sensitivity thresholds may be calculated based on different (length and/or content) text information, thereby improving the flexibility and universality of implementation of the scheme.

As an example of determining the sensitivity threshold, the sum of the number of the second target scenes corresponding to each of the plurality of target participles may be calculated to obtainTo a second sum and calculating a sensitivity threshold based on the second sum. Such as taking the product of the second sum and a preset coefficient as a sensitivity threshold, etc. For example, another specific implementation manner of determining the sensitivity threshold according to the second sum is provided in this embodiment, and the sensitivity threshold T may specifically be determined by T ═ (a)₁+a₂+...+a_m) Is calculated by a2 rate, wherein a₁、a₂、...、a_mThe number of second target sensitive scenes corresponding to each of the target participles (assuming that the plurality of target participles include m participles), such as a₁Number of second target sensitive scenes corresponding to word _1, a_mThe number of the second target sensitive scenes corresponding to word _ m; rate is a preset adjustment factor. That is, the sensitivity threshold may be specifically obtained by multiplying a value obtained by squaring the second sum by a preset adjustment factor.

Further, when the text information does not include the significant sensitive word, it can be determined whether the text information belongs to the sensitive information through the above process, and if the text information includes the significant sensitive word, the text information can be directly determined as the sensitive information, and at this time, it is not necessary to determine whether the text information is the sensitive information according to the sensitive scene corresponding to each word segmentation. Specifically, after the target text is determined, each participle in the text information can be matched with a significant sensitive word in a second preset lexicon, when each participle in the text information has a participle matched with a significant sensitive word in the second preset lexicon, the text information is indicated to contain the significant sensitive word, and at the moment, the text information can be directly determined to be sensitive information; when there is no participle matching with the significant sensitive word in the second preset lexicon in the plurality of target participles, the processes of step S202 and step S203 may be continuously performed to determine whether the text information belongs to sensitive information. Similar to the establishment of the first preset lexicon, the significant sensitive words in the second preset lexicon can be determined manually by technicians according to experience of actual application and added to the second preset lexicon, and the significant sensitive words in the second preset lexicon can be correspondingly added, deleted, modified and the like according to the requirements of actual application.

In this embodiment, a plurality of target segmented words in the text information may be obtained, a sensitive scene corresponding to each segmented word in the plurality of target segmented words may be determined, and then, according to the number of segmented words corresponding to each sensitive scene, whether the text information is sensitive information may be determined. It can be seen that, in this embodiment, whether the text information is the sensitive information is determined according to the number of the participles corresponding to each sensitive scene in the text information, for example, when the number of the participles in a certain sensitive scene is large, it may be considered that the content described by the text information has a high possibility as the content describing the sensitive scene, so that the text information may be considered as the sensitive information, and thus, even if the text information belonging to the sensitive information does not include a significant sensitive word, the sensitive information may be identified according to the sensitive scene corresponding to each participle in the text information.

In addition, the embodiment of the application also provides a device for determining the sensitive information. Referring to fig. 3, fig. 3 is a schematic structural diagram illustrating an apparatus for determining sensitive information according to an embodiment of the present application, where the apparatus 300 includes:

an obtaining module 301, configured to obtain multiple target segments in text information;

a first determining module 302, configured to determine a sensitive scene corresponding to each of the multiple target participles;

the second determining module 303 is configured to determine whether the text information is sensitive information according to the number of the segmented words corresponding to each sensitive scene.

In a possible implementation, the second determining module 303 includes:

the calculation unit is used for calculating the sum of the number of the dependent participles corresponding to each participle in the target participles to obtain a first sum value;

and the fourth determining unit is used for determining that the text information is sensitive information when the first sum is greater than a sensitivity threshold.

In a possible implementation, the apparatus 300 further comprises:

and the second calculation module is used for calculating the sensitivity threshold according to the second sum.

In a possible implementation manner, the second calculating module is specifically configured to calculate the sensitivity threshold according to the second sum and a preset adjustment factor.

In a possible implementation, the obtaining module 301 includes:

the word segmentation unit is used for segmenting words of the text information to obtain a word segmentation set, and the word segmentation set comprises a plurality of words;

In a possible implementation manner, the first determining module 302 is specifically configured to determine, according to a correspondence between pre-established participles and sensitive scenes, the sensitive scenes corresponding to the respective participles in the multiple target participles.

In a possible implementation, the apparatus 300 further comprises:

the matching module is used for matching each participle in the text information with the obvious sensitive word in a second preset word bank;

It should be noted that, for the contents of information interaction, execution process, and the like between the modules and units of the apparatus, since the same concept is based on the method embodiment in the embodiment of the present application, the technical effect brought by the contents is the same as that of the method embodiment in the embodiment of the present application, and specific contents may refer to the description in the foregoing method embodiment in the embodiment of the present application, and are not described herein again.

In this embodiment, whether the text information is sensitive information is determined according to the number of the participles corresponding to each sensitive scene in the text information, for example, when the number of the participles in a certain sensitive scene is large, it may be considered that the content described by the text information has a high possibility to be the content describing the sensitive scene, so that the text information may be considered as the sensitive information, and thus, even if the text information belonging to the sensitive information does not include significant sensitive words, the sensitive information may be identified according to the sensitive scene corresponding to each participle in the text information.

In addition, the embodiment of the application also provides equipment. Referring to fig. 4, fig. 4 shows a hardware structure diagram of an apparatus in an embodiment of the present application, and the apparatus 400 may include a processor 401 and a memory 402.

Wherein the memory 402 is configured to store instructions or computer programs;

the processor 401 is configured to execute the method for determining object information in the above method embodiments according to the instructions or the computer program.

Specifically, the processor 401 may execute the following steps according to instructions or a computer program:

acquiring a plurality of target word segments in the text information;

determining a sensitive scene corresponding to each participle in the plurality of target participles;

and determining whether the text information is sensitive information or not according to the number of the participles corresponding to each sensitive scene.

In one possible implementation, the processor 401 may specifically perform the following steps according to instructions or a computer program:

determining a first target sensitive scene to which a first word segmentation belongs, wherein the first word segmentation is any one word segmentation in the plurality of target word segmentations;

determining a second target-sensitive scene having a dependency relationship with the first target-sensitive scene, the second target-sensitive scene comprising one or more sensitive scenes;

determining the number of the participles belonging to the second target sensitive scene in the target participle set as the number of the dependent participles corresponding to the first participle;

calculating the sum of the number of the dependent participles corresponding to each participle in the target participles to obtain a first sum value;

and when the first sum is larger than a sensitivity threshold, determining that the text information is sensitive information.

In one possible implementation, the processor 401 may further perform the following steps according to the instructions or the computer program:

calculating the sum of the number of second target sensitive scenes corresponding to each participle in the plurality of target participles to obtain a second sum value;

and calculating the sensitivity threshold according to the second sum.

In one possible implementation, the processor 401 may specifically perform the following steps according to instructions or a computer program:

and calculating the sensitivity threshold according to the second sum and a preset adjusting factor.

In one possible implementation, the processor 401 may specifically perform the following steps according to the computer program:

performing word segmentation on the text information to obtain a word segmentation set, wherein the word segmentation set comprises a plurality of words;

In one possible implementation, the processor 401 may specifically perform the following steps according to instructions or a computer program:

and determining the sensitive scene corresponding to each participle in the target participles according to the corresponding relation between the pre-established participles and the sensitive scene.

In one possible implementation, the processor 401 may further perform the following steps according to the computer program:

matching each participle in the text information with a significant sensitive word in a second preset word bank;

and when the participles in the text information have participles matched with the obvious sensitive words in the second preset lexicon, determining the text information as sensitive information.

It should be noted that, for the specific implementation content of the processor 401 in the foregoing device, since the implementation content is based on the same concept as that of the method embodiment in the embodiment of the present application, the technical effect brought by the implementation content is the same as that of the method embodiment in the embodiment of the present application, and the specific content may refer to the description in the foregoing method embodiment in the embodiment of the present application, and is not described herein again.

In addition, an embodiment of the present application further provides a computer-readable storage medium, where the computer-readable storage medium is used to store a computer program, where the computer program is used to execute the method for determining sensitive information described in the above method embodiment.

In the names of "first determination module", "first participle", "first calculation module", etc., mentioned in the embodiments of the present application, "first" is used merely as a name identification, and does not represent first in order. The same applies to "second", "third", "fourth", etc.

As can be seen from the above description of the embodiments, those skilled in the art can clearly understand that all or part of the steps in the above embodiment methods can be implemented by software plus a general hardware platform. Based on such understanding, the technical solution of the present application may be embodied in the form of a software product, which may be stored in a storage medium, such as a read-only memory (ROM)/RAM, a magnetic disk, an optical disk, or the like, and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network communication device such as a router) to execute the method according to the embodiments or some parts of the embodiments of the present application.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus embodiment, since it is substantially similar to the method embodiment, it is relatively simple to describe, and reference may be made to some descriptions of the method embodiment for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the modules described as separate parts may or may not be physically separate, and the parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

The above description is only an exemplary embodiment of the present application, and is not intended to limit the scope of the present application.

15页详细技术资料下载

Method, device and equipment for determining sensitive information and storage medium

相关技术

网友询问留言