Log association method and device and electronic equipment

文档序号：490600 发布日期：2022-01-04 浏览：5次中文

阅读说明：本技术 一种日志关联的方法、装置及电子设备 (Log association method and device and electronic equipment ) 是由张润滋吴复迪王星凯刘文懋顾杜娟于 2021-09-24 设计创作，主要内容包括：本申请公开一种日志关联的方法、装置及电子设备,该方法包括基于机器学习模型,获取预设时间窗口长度的多个序列对,并基于机器学习模型对序列对进行检测,当检测到序列对中的第一向量序列存在异常时,获取所述第一向量序列对应的第一预测结果,并根据第一预测结果得到与第一向量序列关联的目标日志,当检测到序列对中的第二向量序列存在异常时,获取第二向量序列对应的第二预测结果,并根据第二预测结果得到与第二向量序列关联的目标日志。基于上述方法又可以根据网络侧的网络告警,关联定位到终端侧的行为日志,同时可以根据终端侧的行为日志,关联到一种可能的网络侧的网络告警。解决难以完整准确定位与溯源对应的关联日志的问题。(The method comprises the steps of obtaining a plurality of sequence pairs with preset time window lengths based on a machine learning model, detecting the sequence pairs based on the machine learning model, obtaining a first prediction result corresponding to a first vector sequence when detecting that the first vector sequence in the sequence pairs is abnormal, obtaining a target log associated with the first vector sequence according to the first prediction result, obtaining a second prediction result corresponding to a second vector sequence when detecting that the second vector sequence in the sequence pairs is abnormal, and obtaining the target log associated with the second vector sequence according to the second prediction result. Based on the method, the behavior log positioned to the terminal side can be associated according to the network alarm of the network side, and a possible network alarm of the network side can be associated according to the behavior log of the terminal side. The problem that the associated logs corresponding to the tracing source are difficult to completely and accurately position is solved.)

1. A method of log association, the method comprising:

acquiring a plurality of sequence pairs with a preset time window length, wherein each sequence pair represents a one-to-one corresponding relation between a first vector sequence and a second vector sequence, the first vector sequence represents a plurality of sequences of network alarms subjected to vectorization processing and having the same IP address, and the second vector sequence represents a plurality of sequences of terminal entities subjected to vectorization processing and having the same host sequence number;

when the first vector sequence in the sequence pair is detected to be abnormal, a first prediction result corresponding to the first vector sequence is obtained, and a target log associated with the first vector sequence is obtained according to the first prediction result;

when the second vector sequence in the sequence pair is detected to be abnormal, a second prediction result corresponding to the second vector sequence is obtained, and a target log associated with the second vector sequence is obtained according to the second prediction result.

2. The method of claim 1, wherein prior to said obtaining a plurality of sequence pairs of a preset time window length, further comprising:

acquiring a plurality of first vector sequences and a plurality of second vector sequences;

according to a first host sequence number in the second vector sequence, obtaining a first IP address corresponding to the first host sequence number, and extracting all first vector sequences with the first IP address;

aggregating a second vector sequence corresponding to the first host sequence number and a first vector sequence corresponding to the first IP address to obtain a sequence pair data set, wherein the sequence pair data set represents a set consisting of sequence pairs, and the sequence pairs represent one-to-one correspondence between the first vector sequence and the second vector sequence;

and segmenting the sequence pair data set according to the length of the preset time window to obtain a plurality of sequence pairs with a plurality of preset time window lengths.

3. The method of claim 2, prior to the obtaining the first plurality of vector sequences and the second plurality of vector sequences, further comprising:

acquiring first data and second data, wherein the first data represent a plurality of network alarms acquired by network side equipment, and the second data represent a plurality of behavior logs acquired by a terminal side;

dividing the network alarms with the same IP address in the first data into a sequence to obtain a plurality of first sequences, wherein the first sequences represent the plurality of network alarms with the same IP address;

dividing the behavior logs with the same host identity in the second data into a sequence to obtain a plurality of second sequences, wherein the second sequences represent the plurality of behavior logs with the same host identity;

vectorizing the plurality of first sequences to obtain a plurality of first vector sequences;

and vectorizing the plurality of second sequences to obtain a plurality of second vector sequences.

4. The method as claimed in claim 1, wherein when it is detected that there is an abnormality in the first vector sequence in the sequence pair, obtaining a first prediction result corresponding to the first vector sequence, and obtaining a target log associated with the first vector sequence according to the first prediction result, comprises:

when the first vector sequence in the sequence pair is detected to be abnormal, a first prediction result corresponding to the first vector sequence is obtained, wherein the first prediction result represents a second vector sequence predicted according to the first vector sequence;

extracting N second vector sequences corresponding to the first vector sequence in the sequence pair according to the first vector sequence, wherein N is a positive integer greater than or equal to 1;

respectively carrying out similarity comparison on the N second vector sequences and the first prediction result to obtain N similar values;

sorting the N similar values according to the sizes of the similar values, and extracting M previous similar values from the N similar values, wherein M is a positive integer greater than or equal to 1;

and obtaining M corresponding second vector sequences as target logs associated with the first vector sequences according to the first M similar values.

5. The method of claim 1, wherein when it is detected that the second vector sequence in the sequence pair is abnormal, acquiring a second prediction result corresponding to the second vector sequence, and obtaining a target log associated with the second vector sequence according to the second prediction result, comprises:

when the second vector sequence in the sequence pair is detected to be abnormal, a second prediction result corresponding to the second vector sequence is obtained, wherein the second prediction result represents a first vector sequence predicted according to the second vector sequence;

and using the second prediction result as a target log associated with the second vector sequence.

6. The method of claim 5, wherein after the obtaining a second prediction corresponding to the second vector sequence when the second vector sequence in the sequence pair is detected to be abnormal, further comprising:

extracting n second vector sequences corresponding to the second vector sequences in the sequence pairs according to the second vector sequences, wherein n is a positive integer greater than or equal to 1;

respectively carrying out similarity comparison on the n second vector sequences and the second prediction result to obtain n similar values;

sorting the n similar values according to the sizes of the similar values, and extracting m previous similar values from the n similar values, wherein m is a positive integer greater than or equal to 1;

and obtaining m corresponding first vector sequences as target logs associated with the second vector sequence according to the first m similar values.

7. An apparatus for log association, the apparatus comprising:

the system comprises an acquisition module, a time window setting module and a time window setting module, wherein the acquisition module acquires a plurality of sequence pairs with preset time window length, each sequence pair represents a one-to-one corresponding relation between a first vector sequence and a second vector sequence, the first vector sequence represents a plurality of sequences of network alarms subjected to vectorization processing and provided with the same IP address, and the second vector sequence represents a plurality of sequences of terminal entities subjected to vectorization processing and provided with the same host sequence number;

the first detection module is used for acquiring a first prediction result corresponding to the first vector sequence when the first vector sequence in the sequence pair is detected to be abnormal, and obtaining a target log associated with the first vector sequence according to the first prediction result;

and the second detection module is used for acquiring a second prediction result corresponding to the second vector sequence when the second vector sequence in the sequence pair is detected to be abnormal, and obtaining a target log associated with the second vector sequence according to the second prediction result.

8. The apparatus according to claim 7, wherein the first detecting module is specifically configured to, when it is detected that the first vector sequence in the sequence pair is abnormal, obtain a first prediction result corresponding to the first vector sequence, where the first prediction result characterizes a second vector sequence predicted from the first vector sequence;

respectively carrying out similarity comparison on the N second vector sequences and the first prediction result to obtain N similar values;

and obtaining M corresponding second vector sequences as target logs associated with the first vector sequences according to the first M similar values.

9. An electronic device, comprising:

a memory for storing a computer program;

a processor for implementing the method steps of any one of claims 1-6 when executing the computer program stored on the memory.

10. A computer-readable storage medium, characterized in that a computer program is stored in the computer-readable storage medium, which computer program, when being executed by a processor, carries out the method steps of any one of claims 1-6.

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for log association, and an electronic device.

Background

With the rapid development of informatization, a centralized data aggregation platform (such as security information and event management) needs to access large-scale multi-source heterogeneous data. Since multi-source heterogeneous data can be acquired through different devices, and analysis mechanisms of the multi-source heterogeneous data acquired by the different devices are different, accurate detection and path tracing are performed on attack behaviors in order to effectively restore attack intentions of attackers, and correlation analysis of the multi-source heterogeneous data acquired by the different devices becomes a current challenge.

Specifically, the multi-source heterogeneous data may be a network log and a network alarm collected by a network-side device (e.g., IPS, WAF, etc.), or a behavior log and a behavior alarm collected by a terminal-side device (e.g., EDR, etc.). Generally, the network alarm represents a network log with a higher risk level, and the behavior alarm represents a behavior log with a higher risk level. In the actual acquisition process, the network side device generally acquires network alarms, and the terminal side device generally acquires behavior logs.

In the current multi-source heterogeneous data, in order to perform correlation analysis on a network alarm acquired by network side equipment and a behavior log acquired by terminal side equipment, a cross-network and terminal weak correlation method is provided in the existing scheme.

The weak association method comprises the following steps: and positioning the corresponding log type according to the preset association relationship between the network alarm type and the behavior log type and the preset time sequence interval. For example, when a network alarm meeting a preset time sequence interval is acquired, a corresponding behavior log type can be obtained according to the type of the acquired network alarm; when the behavior logs meeting the preset time sequence interval are collected, the corresponding network alarm type can be obtained according to the type of the collected behavior logs.

However, in the above solution, the association rule of the artificial preset value is very rough, and only a rough category (e.g. corresponding type) of the association can be obtained, and at the same time, the association rule is limited by the preset time sequence interval (e.g. when the behavior log associated with a certain network alarm is abnormally huge, only the corresponding behavior log in the preset time sequence interval can be associated).

In view of this, in the prior art, when performing association analysis on raw data acquired by different devices, it is difficult to completely and accurately locate an association log corresponding to a source.

Disclosure of Invention

The application provides a log association method, a log association device and electronic equipment, which are used for associating a behavior log positioned to a terminal side according to a network alarm of the network side and associating a possible network alarm of the network side according to the behavior log of the terminal side.

In a first aspect, the present application provides a method for log association, where the method includes:

By the method, the behavior logs traced to the terminal side can be accurately associated according to the specific network alarm of the network side, and a large number of irrelevant and normal behavior logs are shielded; meanwhile, the possible network alarm of the network side is associated according to the behavior log of the terminal side, and the expert user is effectively assisted to judge the possible network alarm.

In one possible design, before the obtaining the plurality of sequence pairs of the preset time window length, the method further includes:

acquiring a plurality of first vector sequences and a plurality of second vector sequences;

and segmenting the sequence pair data set according to the length of the preset time window to obtain a plurality of sequence pairs with a plurality of preset time window lengths.

By the method, the actual association between the network alarm of the specific network side and the behavior log of the specific terminal side can be obtained.

In one possible design, before the obtaining the first vector sequences and the second vector sequences, the method further includes:

vectorizing the plurality of first sequences to obtain a plurality of first vector sequences;

and vectorizing the plurality of second sequences to obtain a plurality of second vector sequences.

By the method, the vector sequence of the network alarm of the specific network side and the vector sequence of the behavior log of the specific terminal side can be obtained.

In one possible design, when it is detected that there is an abnormality in the first vector sequence in the sequence pair, obtaining a first prediction result corresponding to the first vector sequence, and obtaining a target log associated with the first vector sequence according to the first prediction result, includes:

respectively carrying out similarity comparison on the N second vector sequences and the first prediction result to obtain N similar values;

and obtaining M corresponding second vector sequences as target logs associated with the first vector sequences according to the first M similar values.

By the method, based on the first prediction result obtained by machine learning, the behavior logs traced to the terminal side can be accurately correlated according to the specific network alarm of the network side, a large number of irrelevant and normal behavior logs are shielded, and expert users are effectively assisted to study and judge the correlated behavior logs.

In a possible design, when it is detected that the second vector sequence in the sequence pair is abnormal, obtaining a second prediction result corresponding to the second vector sequence, and obtaining a target log associated with the second vector sequence according to the second prediction result, includes:

and using the second prediction result as a target log associated with the second vector sequence.

By the method, based on the second prediction result obtained by machine learning, a possible network alarm of the network side can be associated according to the behavior log of the terminal side, and the expert user is effectively assisted to study and judge the possible network alarm.

In one possible design, after the obtaining a second prediction result corresponding to the second vector sequence when the second vector sequence in the sequence pair is detected to have an abnormality, the method further includes:

respectively carrying out similarity comparison on the n second vector sequences and the second prediction result to obtain n similar values;

and obtaining m corresponding first vector sequences as target logs associated with the second vector sequence according to the first m similar values.

By the method, based on the second prediction result obtained by machine learning, the network alarm traced to the actually existing network side can be associated according to the behavior log of the terminal side, and the expert user is effectively assisted to study and judge the actually existing network alarm.

In a second aspect, the present application provides an apparatus for log association, the apparatus comprising:

In one possible design, before the obtaining module, obtaining a plurality of first vector sequences and a plurality of second vector sequences is further included; according to a first host sequence number in the second vector sequence, obtaining a first IP address corresponding to the first host sequence number, and extracting all first vector sequences with the first IP address; aggregating a second vector sequence corresponding to the first host sequence number and a first vector sequence corresponding to the first IP address to obtain a sequence pair data set, wherein the sequence pair data set represents a set consisting of sequence pairs, and the sequence pairs represent one-to-one correspondence between the first vector sequence and the second vector sequence; and segmenting the sequence pair data set according to the length of the preset time window to obtain a plurality of sequence pairs with a plurality of preset time window lengths.

In one possible design, before acquiring a plurality of first vector sequences and a plurality of second vector sequences before the acquiring module, acquiring first data and second data, where the first data represents a plurality of network alarms acquired by a network side device, and the second data represents a plurality of behavior logs acquired by a terminal side; dividing the network alarms with the same IP address in the first data into a sequence to obtain a plurality of first sequences, wherein the first sequences represent the plurality of network alarms with the same IP address; dividing the behavior logs with the same host identity in the second data into a sequence to obtain a plurality of second sequences, wherein the second sequences represent the plurality of behavior logs with the same host identity; vectorizing the plurality of first sequences to obtain a plurality of first vector sequences; and vectorizing the plurality of second sequences to obtain a plurality of second vector sequences.

In a possible design, the first detection module is specifically configured to, when it is detected that the first vector sequence in the sequence pair is abnormal, obtain a first prediction result corresponding to the first vector sequence, where the first prediction result represents a second vector sequence predicted according to the first vector sequence; extracting N second vector sequences corresponding to the first vector sequence in the sequence pair according to the first vector sequence, wherein N is a positive integer greater than or equal to 1; respectively carrying out similarity comparison on the N second vector sequences and the first prediction result to obtain N similar values; sorting the N similar values according to the sizes of the similar values, and extracting M previous similar values from the N similar values, wherein M is a positive integer greater than or equal to 1; and obtaining M corresponding second vector sequences as target logs associated with the first vector sequences according to the first M similar values.

In a possible design, the second detection module is specifically configured to, when it is detected that the second vector sequence in the sequence pair is abnormal, obtain a second prediction result corresponding to the second vector sequence, where the second prediction result represents a first vector sequence predicted according to the second vector sequence; and using the second prediction result as a target log associated with the second vector sequence.

In one possible design, the second detection module is further configured to extract, according to the second vector sequence, n second vector sequences corresponding to the second vector sequence in the sequence pair, where n is a positive integer greater than or equal to 1; respectively carrying out similarity comparison on the n second vector sequences and the second prediction result to obtain n similar values; sorting the n similar values according to the sizes of the similar values, and extracting m previous similar values from the n similar values, wherein m is a positive integer greater than or equal to 1; and obtaining m corresponding first vector sequences as target logs associated with the second vector sequence according to the first m similar values.

In a third aspect, the present application provides an electronic device, comprising:

a memory for storing a computer program;

a processor for implementing the above-mentioned method steps for detecting an object with abnormal motion state when executing the computer program stored in the memory.

In a fourth aspect, the present application provides a computer-readable storage medium having stored thereon a computer program which, when being executed by a processor, carries out the above-mentioned method steps of detecting an object with abnormal motion state.

For each of the second to fourth aspects and possible technical effects of each aspect, please refer to the above description of the first aspect or the possible technical effects of each of the possible solutions in the first aspect, and no repeated description is given here.

Drawings

FIG. 1 is a flow chart of a method for log association provided herein;

FIG. 2 is a schematic diagram of a second vector sequence generation provided herein;

FIG. 3 is a schematic diagram of generating a plurality of sequence pairs according to the present application;

FIG. 4 is a schematic diagram of a log association apparatus provided in the present application;

fig. 5 is a schematic diagram of a structure of an electronic device provided in the present application.

Detailed Description

The embodiment of the application provides a log association method and device and electronic equipment, and solves the problem that when association analysis is performed on original data acquired by different equipment, it is difficult to completely and accurately locate an associated log corresponding to tracing.

The method provided by the embodiment of the application is further described in detail with reference to the attached drawings.

Referring to fig. 1, an embodiment of the present application provides a method for log association, where a specific flow is as follows:

step 101: vectorization processing is carried out on the acquired first data and second data to obtain a plurality of first vector sequences corresponding to the first data and a plurality of second vector sequences corresponding to the second data;

first data and second data are obtained, wherein the first data are data of network alarms acquired through network side equipment, and the second data are data of behavior logs acquired through a terminal side.

The network alarm in the first data at least comprises a source IP address, a destination IP address, an alarm type, a time stamp and the like of the network alarm.

The types of the behavior logs in the second data at least comprise a process behavior type, a file operation type, a registry operation type and the like, and fields contained in each behavior log at least comprise an active entity, a destination entity, a log type and a timestamp.

For example, if the source entity in the behavior log created by a process corresponds to the parent process name and the destination entity corresponds to the child process name, that is, the log type is created for the process, and the timestamp is the time of creating the log.

Notably, each behavior log also contains at least one of a source entity or a destination entity.

After the first data and the second data are obtained, grouping division is respectively carried out on the network alarms in the first data and the behavior logs in the second data, finally, the network alarms with the same IP address in the first data are divided into a plurality of first sequences, and the behavior logs with the same host identification in the second data are divided into a plurality of second sequences.

For the first data, firstly, the network alarm in the first data is grouped and divided according to the IP address (namely, the source IP address or the destination IP address) of the network alarm: and then according to the extracted time stamp of the network alarm, forming a new sequence, namely a first sequence, by the alarm types corresponding to the network alarms with any IP address according to the sequence of the time stamp loading time.

It is emphasized that the first data may be divided into a plurality of first sequences, i.e. the sets of alarm sequence data forming the first data. Here, the network alarms in each first sequence have the same IP address, while each first sequence may be composed of alarm types corresponding to a plurality of network alarms, and correspond to different IP addresses between each first sequence.

For the second data, firstly, the behavior log in the second data is converted into a traceability graph consisting of entity points and entity associated edges through a traceability graph template. That is, in the second data, a complete tracing graph is generated according to all the behavior logs on each host (i.e., having the same host identity), and a plurality of second sequences are generated by adopting a random walk mode for the formed tracing graph.

It should be noted that, the tracing graph template takes the source entity of the behavior log as a starting point and the destination entity as an end point, and an edge from the starting point to the end point is formed between two points. The random walk may adopt any probability model or other methods, and in the embodiment of the present application, any probability model in the random walk is not limited.

In addition, in the process of generating a complete tracing graph by using the template, information such as a process name, a file name, a start item name and the like of the behavior log needs to be reserved, and complete path information of the behavior log needs to be reserved.

And, the entity of the behavior log can be uniquely determined according to the path information and the name information in the behavior log. For example, a single entity of a behavioral log may be represented by "C: \ windows \ system32\ svchost. exe".

Specifically, referring to fig. 2, in the second data, the behavior log of the host corresponding to the first host identifier is extracted, specifically, referring to the table shown in fig. 2, each behavior log is displayed from the source entity, the destination entity, the log type, and the timestamp, and here, a complete traceability graph is formed by extracting the source entity and the destination entity of the behavior log.

Then, for a complete tracing graph formed, a random walk mode is adopted, and a sequence formed by Na random walks is generated by taking the length L as a limit. Two wandering approaches are listed in the traceability graph shown in fig. 2, one approach wanders three entity points, and one approach wanders four entity points.

According to the two types of walking manners, fig. 2 also lists sequences formed by 2 random walks, and if a sequence formed by a first random walk is taken as an example, fig. 2 specifically includes:

“C:\windows\system32\svchost.exe,C:\windows\user\word.exe,C:\user\tmp\downloader.php,192.168.1.1”。

wherein, C: \ windows \ system32\ svchost. exe "," C: \ windows \ user \ word. exe "," C: \ user \ tmp \ downloader. php "and" 192.168.1.1 "are respectively corresponding to four different associated entity nodes in the tracing graph.

Further, as explained above, in the embodiment of the present application, without limiting any probability model in random walks, there may be a repeated sequence in the sequence formed by the generated Na random walks, and therefore, it is necessary to perform de-duplication on the sequence formed by the Na random walks to obtain a plurality of non-repeated fourth sequences, that is: the behavioral entity sequence corpus of the host.

Step 102: performing polymerization segmentation on the first vector sequence and the second vector sequence to obtain a plurality of sequence pairs with a plurality of preset time window lengths;

after obtaining a plurality of first sequences of the first data and a plurality of second sequences of the second data by the preprocessing method, respectively performing vectorization processing on the plurality of first sequences and the plurality of second sequences: vectorizing the plurality of first sequences to obtain a plurality of first vector sequences; and vectorizing the plurality of second sequences to obtain a plurality of second vector sequences.

For the plurality of first sequences, a category natural language processing method is adopted, and taking a single first sequence as an example, the alarm type corresponding to each network alarm in the first sequence may be regarded as a word, and here, the single first sequence may include a plurality of network alarms, so that the single first sequence may correspond to an article composed of word sequences.

At this time, through the pre-trained first model, the alarm type corresponding to each network alarm in the first sequence can be regarded as a word, each word is converted into a vector, that is, each alarm type is converted into a dense vector, and a first vector sequence of the first sequence subjected to vectorization processing is obtained.

Specifically, the pre-trained first model may be a vectorization expression model of an alarm type of a network alarm learned according to a word embedding method such as word2 vector. By adopting the first model, the vectorization expression of the output single alarm type, namely the single dense vector, can be obtained according to the alarm type of the input single network alarm.

In addition, if the distance between the vectors corresponding to the two alarm types is smaller, the semantics between the two alarm types are closer, namely, the two alarm types are more similar.

For the plurality of fourth vectors, also using a category natural language processing method, taking a single second sequence as an example, the file path and the file name corresponding to each entity in the second sequence may be taken as a word, where the single second sequence may include a plurality of entities, and thus the single second sequence may correspond to an article composed of word sequences.

At this time, each entity is converted into a dense vector according to the entity corresponding to each behavior log in the second sequence through the pre-trained second model, and a second vector sequence of the second sequence after vectorization processing is obtained.

Specifically, the second pre-trained model may be a vectorized expression model of an entity of a behavior log learned according to a word embedding method of word2vector and the like. By adopting the second model, the vectorization expression of the entity of the output single behavior log, namely the single dense vector, can be obtained according to the entity of the input single behavior log.

In addition, if the distance between the vectors corresponding to the entities of the two behavior logs is smaller, the semantics between the two alarm types are closer, namely, the entities representing the two behavior logs are more similar.

Then, after obtaining the plurality of first vector sequences and the plurality of second vector sequences by the above method, aggregation is performed according to a correspondence between the IP address of the conventional network alarm and the terminal entity (i.e., the above entity), and an actual association relationship between the first vector sequences and the second vector sequences is established.

The method comprises the steps of firstly obtaining a first host serial number of a certain second vector sequence, finding a first host where the second vector sequence is located, and then extracting the first vector sequence with the IP address by finding the IP address associated with the first host. Then, the second vector sequence corresponding to the first host and the extracted first vector sequence are divided into a group. And completing the aggregation between all the obtained first vector sequences and all the obtained second vector sequences by the same method to finally obtain a sequence pair data set.

It is noted that the relationship obtained by aggregation here may be a one-to-many, many-to-one, many-to-many relationship, that is: one first vector sequence corresponds to a plurality of second vector sequences, a plurality of first vector sequences corresponds to one second vector sequence, and a plurality of first vector sequences corresponds to a plurality of second vector sequences. The sequence pair data set may thus comprise a plurality of sequence pairs, in particular represented as a one-to-one correspondence between a first vector sequence and a second vector sequence.

For example, referring to fig. 3, a network alarm sequence (first vector sequence), three terminal entity sequences (second vector sequence) are identified, where one first vector sequence corresponds to three second vector sequences, so that the aggregation result can be represented as three correspondences in the corresponding data set.

Further, the data set of the sequence pair is segmented according to the preset time window length to obtain a final data set of the sequence pair, and in order to ensure the integrity of the obtained sequence pair, a sequence pair (a sequence pair with an empty first vector sequence) which has not generated any alarm within the preset time window length needs to be detected, and if the sequence pair is found, the sequence pair is discarded.

By the method, a plurality of sequence pairs with preset time window length are obtained.

Step 103: acquiring a plurality of sequence pairs with preset time window length;

in this embodiment, each sequence pair may represent a one-to-one correspondence relationship between a first vector sequence and a second vector sequence, where the first vector sequence is used to represent a sequence of a plurality of vectorized network alarms having the same IP address, and the second vector sequence is used to represent a sequence of a plurality of vectorized terminal entities having the same host sequence number;

step 104: when the first vector sequence in the sequence pair is detected to be abnormal, a first prediction result corresponding to the first vector sequence is obtained, and a target log associated with the first vector sequence is obtained according to the first prediction result;

and detecting whether the first vector sequence is abnormal in the sequence pair through a pre-trained third model, and if so, acquiring a first prediction result of the abnormal first vector sequence through a fourth model, wherein the first prediction result is a second vector sequence predicted according to the abnormal first vector sequence.

Then, according to the first prediction result, similarity comparison is carried out on the actual M second vector sequences corresponding to all the sequence pairs where the abnormal first vector sequence is located, M similar values can be obtained, the similar values are sorted, the sequences with higher similarity are arranged in front, and the first N similar values are taken to correspond to the actual N second vector sequences to serve as target logs related to the abnormal first vector sequence. Wherein M, N is a positive integer of 1 or more.

Specifically, the third model is a baseline model generated by a Local anomaly Factor (LOF) model, which may be trained based on a plurality of first sequences, and is used to detect whether the first vector sequence is abnormal.

The fourth model may be a translation model in which a data set is trained based on sequences, and a first vector sequence and a second vector sequence generated by a machine translation model such as Seq2Seq are associated with each other. The fourth model can obtain a second vector sequence of one output according to the input first vector sequence; and obtaining the first vector sequence with the most possible output according to the input second vector sequence. I.e. the output sequence is the predicted result of the fourth model.

The above-mentioned similarity comparison strategy may adopt an information entropy-based sequence comparison algorithm, a traversal-based sequence comparison algorithm, and the like. Taking the sequence comparison algorithm based on the information entropy as an example, the similarity value in the embodiment of the present application is the relative entropy: then the greater the similarity value, the greater the difference; the smaller the similarity value, the smaller the difference, so the ordering for the similarity value under this method may be in ascending order.

The target log associated with the abnormal first vector sequence is the N second vector sequences obtained according to the first prediction result, and then N key terminal entity sequences can be obtained.

Step 105: when the second vector sequence in the sequence pair is detected to be abnormal, a second prediction result corresponding to the second vector sequence is obtained, and a target log associated with the second vector sequence is obtained according to the second prediction result.

And detecting whether the second vector sequence is abnormal in the sequence pair through a pre-trained fifth model, if so, acquiring a second prediction result of the abnormal second vector sequence through the fifth model, and taking the second prediction result as the first vector sequence predicted according to the abnormal second vector sequence.

Specifically, the fifth model is a baseline model generated by a Local anomaly Factor (LOF) model, which may be trained based on a plurality of second sequences, and is used here to detect whether the second vector sequence is abnormal.

Furthermore, according to a second prediction result of the abnormal second vector sequence, similarity comparison can be performed on the actual m first vector sequences corresponding to all the sequence pairs in which the abnormal second vector sequence is located, m similar values can be obtained, the similar values are sorted, the sequences with higher similarity are arranged in front, and the first n similar values are taken as target logs associated with the abnormal second vector sequence and correspond to the actual n first vector sequences. Wherein m and n are positive integers greater than or equal to 1.

The fourth model and the similarity comparison strategy can be specifically described in step 104.

The target log associated with the abnormal second vector sequence is n first vector sequences obtained according to the second prediction result, and then n network alarm sequences most similar to the second prediction result can be obtained.

By the method provided by the embodiment of the application, the problem that when correlation analysis is carried out on the original data collected by different devices, correlation logs corresponding to tracing sources are difficult to completely and accurately position in the prior art is solved.

In a big data platform convergence platform of a security operation center, the following two parts are used to illustrate the direct technical effects brought by the embodiment of the present application:

in the first aspect, according to the given network alarm of the network side, the abnormal network alarm can be quickly and effectively positioned, the abnormal network alarm is associated to the entity log of the key terminal side, and the behavior path and the log content of the key entity log can be traced.

In a second aspect, according to a given behavior log at a terminal side, an entity of an abnormal behavior log, such as a process, a file, etc., can be quickly and effectively located to obtain a possible network alarm sequence, and through the possible network alarm sequence, an operator can be assisted in judging a network alarm type at the network side possibly caused by the abnormal behavior log at the terminal side, and particularly, in the case of only collecting the terminal log, a prejudgment basis of a malicious network behavior is brought to an expert.

Based on the same inventive concept, the present application further provides a log association apparatus, configured to associate a behavior log located at a terminal side according to a network alarm at a network side, and associate a possible network alarm at the network side according to the behavior log at the terminal side, so as to solve the problem in the prior art that it is difficult to completely and accurately locate an association log corresponding to a source tracing when performing association analysis on original data acquired by different devices, and effectively assist an expert user in studying and judging the behavior log or the network alarm, as shown in fig. 4, where the apparatus includes:

the obtaining module 401 obtains a plurality of sequence pairs with a preset time window length, where each sequence pair represents a one-to-one correspondence relationship between a first vector sequence and a second vector sequence, the first vector sequence represents a plurality of sequences of network alarms subjected to vectorization processing and having the same IP address, and the second vector sequence represents a plurality of sequences of terminal entities subjected to vectorization processing and having the same host sequence number;

a first detection module 402, configured to, when it is detected that the first vector sequence in the sequence pair is abnormal, obtain a first prediction result corresponding to the first vector sequence, and obtain a target log associated with the first vector sequence according to the first prediction result;

the second detecting module 403, when it is detected that the second vector sequence in the sequence pair is abnormal, obtains a second prediction result corresponding to the second vector sequence, and obtains a target log associated with the second vector sequence according to the second prediction result.

In a possible design, before the obtaining module 401, obtaining a plurality of first vector sequences and a plurality of second vector sequences is further included; according to a first host sequence number in the second vector sequence, obtaining a first IP address corresponding to the first host sequence number, and extracting all first vector sequences with the first IP address; aggregating a second vector sequence corresponding to the first host sequence number and a first vector sequence corresponding to the first IP address to obtain a sequence pair data set, wherein the sequence pair data set represents a set consisting of sequence pairs, and the sequence pairs represent one-to-one correspondence between the first vector sequence and the second vector sequence; and segmenting the sequence pair data set according to the length of the preset time window to obtain a plurality of sequence pairs with a plurality of preset time window lengths.

In a possible design, before the obtaining module 401 obtains a plurality of first vector sequences and a plurality of second vector sequences, obtaining first data and second data is further included, where the first data represents a plurality of network alarms collected by a network side device, and the second data represents a plurality of behavior logs collected by a terminal side; dividing the network alarms with the same IP address in the first data into a sequence to obtain a plurality of first sequences, wherein the first sequences represent the plurality of network alarms with the same IP address; dividing the behavior logs with the same host identity in the second data into a sequence to obtain a plurality of second sequences, wherein the second sequences represent the plurality of behavior logs with the same host identity; vectorizing the plurality of first sequences to obtain a plurality of first vector sequences; and vectorizing the plurality of second sequences to obtain a plurality of second vector sequences.

In a possible design, the first detecting module 402 is specifically configured to, when it is detected that the first vector sequence in the sequence pair is abnormal, obtain a first prediction result corresponding to the first vector sequence, where the first prediction result represents a second vector sequence predicted according to the first vector sequence; extracting N second vector sequences corresponding to the first vector sequence in the sequence pair according to the first vector sequence, wherein N is a positive integer greater than or equal to 1; respectively carrying out similarity comparison on the N second vector sequences and the first prediction result to obtain N similar values; sorting the N similar values according to the sizes of the similar values, and extracting M previous similar values from the N similar values, wherein M is a positive integer greater than or equal to 1; and obtaining M corresponding second vector sequences as target logs associated with the first vector sequences according to the first M similar values.

In a possible design, the second detecting module 403 is specifically configured to, when it is detected that the second vector sequence in the sequence pair is abnormal, obtain a second prediction result corresponding to the second vector sequence, where the second prediction result represents a first vector sequence predicted according to the second vector sequence; and using the second prediction result as a target log associated with the second vector sequence.

In one possible design, the second detection module 403 is further configured to extract, according to the second vector sequence, n second vector sequences corresponding to the second vector sequence in the sequence pair, where n is a positive integer greater than or equal to 1; respectively carrying out similarity comparison on the n second vector sequences and the second prediction result to obtain n similar values; sorting the n similar values according to the sizes of the similar values, and extracting m previous similar values from the n similar values, wherein m is a positive integer greater than or equal to 1; and obtaining m corresponding first vector sequences as target logs associated with the second vector sequence according to the first m similar values.

Based on the device, the behavior log positioned to the terminal side is associated according to the network alarm of the network side, and meanwhile, the possible network alarm of the network side is associated according to the behavior log of the terminal side. The problem that when correlation analysis is carried out on original data collected by different devices, correlation logs corresponding to tracing sources are difficult to completely and accurately position in the prior art is solved, and expert users are effectively assisted to study and judge behavior logs or network alarms.

Based on the same inventive concept, an embodiment of the present application further provides an electronic device, where the electronic device may implement the function of the log-associated apparatus, and with reference to fig. 5, the electronic device includes:

at least one processor 501 and a memory 502 connected to the at least one processor 501, in this embodiment, a specific connection medium between the processor 501 and the memory 502 is not limited in this application, and fig. 5 illustrates an example where the processor 501 and the memory 502 are connected through a bus 500. The bus 500 is shown in fig. 5 by a thick line, and the connection manner between other components is merely illustrative and not limited thereto. The bus 500 may be divided into an address bus, a data bus, a control bus, etc., and is shown with only one thick line in fig. 5 for ease of illustration, but does not represent only one bus or one type of bus. Alternatively, the processor 501 may also be referred to as a controller, without limitation to name a few.

In the embodiment of the present application, the memory 502 stores instructions executable by the at least one processor 501, and the at least one processor 501 can execute the log association method discussed above by executing the instructions stored in the memory 502. The processor 501 may implement the functions of the various modules in the apparatus shown in fig. 4.

The processor 501 is a control center of the apparatus, and may connect various parts of the entire control device by using various interfaces and lines, and perform various functions and process data of the apparatus by operating or executing instructions stored in the memory 502 and calling data stored in the memory 502, thereby performing overall monitoring of the apparatus.

In one possible design, processor 501 may include one or more processing units and processor 501 may integrate an application processor that handles primarily operating systems, user interfaces, application programs, and the like, and a modem processor that handles primarily wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 501. In some embodiments, processor 501 and memory 502 may be implemented on the same chip, or in some embodiments, they may be implemented separately on separate chips.

The processor 501 may be a general-purpose processor, such as a Central Processing Unit (CPU), digital signal processor, application specific integrated circuit, field programmable gate array or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof, that may implement or perform the methods, steps, and logic blocks disclosed in embodiments of the present application. A general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the log association method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware processor, or implemented by a combination of hardware and software modules in the processor.

Memory 502, which is a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules. The Memory 502 may include at least one type of storage medium, and may include, for example, a flash Memory, a hard disk, a multimedia card, a card-type Memory, a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Programmable Read Only Memory (PROM), a Read Only Memory (ROM), a charge Erasable Programmable Read Only Memory (EEPROM), a magnetic Memory, a magnetic disk, an optical disk, and so on. The memory 502 is any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to such. The memory 502 in the embodiments of the present application may also be circuitry or any other device capable of performing a storage function for storing program instructions and/or data.

By programming the processor 501, the code corresponding to the log association method described in the foregoing embodiment may be solidified into the chip, so that the chip can execute the steps of the log association method of the embodiment shown in fig. 1 when running. How to program the processor 501 is well known to those skilled in the art and will not be described in detail herein.

Based on the same inventive concept, the present application also provides a storage medium storing computer instructions, which when executed on a computer, cause the computer to perform the log association method discussed above.

In some possible embodiments, the various aspects of the log association method provided by the present application may also be implemented in the form of a program product comprising program code for causing a control apparatus to perform the steps in the log association method according to various exemplary embodiments of the present application described above in this specification, when the program product is run on a device.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, apparatus, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

19页详细技术资料下载

Log association method and device and electronic equipment

相关技术

网友询问留言