Threat detection platform for real-time detection, characterization, and remediation of email-based threats

文档序号:74781 发布日期:2021-10-01 浏览:21次 中文

阅读说明:本技术 用于实时检测,表征,和补救基于电子邮件的威胁的威胁检测平台 (Threat detection platform for real-time detection, characterization, and remediation of email-based threats ) 是由 桑杰·贾亚库马尔 约书亚·布拉特曼 德米特里·切西克 艾伯哈吉特·巴格里 埃文·雷塞 S· 于 2019-12-18 设计创作,主要内容包括:传统的电子邮件过滤服务不适于识别复杂恶意电子邮件,因此可能会允许复杂恶意电子邮件误入收件箱。本文介绍的是威胁检测平台,旨在采用集成的途径检测安全威胁。例如,在接收到来自个人的表示允许访问企业员工接收到的过去电子邮件的输入后,威胁检测平台可以下载过去电子邮件,以构建理解与内部联系人(例如,其他员工)和/或外部联系人(例如,供应商)的通信常态的机器学习(ML)模型。通过将所述ML模型应用到传入的电子邮件中,威胁检测平台可以有针对性地实时识别安全威胁。(Conventional email filtering services are not suitable for identifying complex malicious emails, and thus may allow the complex malicious emails to be mistakenly entered into an inbox. Presented herein are threat detection platforms that are intended to detect security threats in an integrated approach. For example, upon receiving input from an individual indicating that access to past email received by an enterprise employee is allowed, the threat detection platform may download the past email to build a Machine Learning (ML) model that understands the normality of communication with internal contacts (e.g., other employees) and/or external contacts (e.g., suppliers). By applying the ML model to incoming emails, a threat detection platform may be targeted to identify security threats in real-time.)

1. A computer-implemented method, comprising:

establishing a connection via an application programming interface with a storage medium comprising a series of past communications received by an employee of the enterprise;

downloading, via the application programming interface, a first portion of the series of past communications corresponding to a first time interval into a local processing environment;

constructing a Machine Learning (ML) model for the employee by providing the first portion of the series of past communications as training data to the ML model;

receiving, via the application programming interface, a communication addressed to the employee; and

determining whether the communication represents a security risk by applying the ML model to the communication.

2. The computer-implemented method of claim 1, further comprising:

receiving input from an administrator associated with the enterprise indicating permission to access the series of past communications;

wherein the determining is performed in response to receiving the input.

3. The computer-implemented method of claim 1, wherein the series of past communications comprises a plurality of emails delivered to the employee.

4. The computer-implemented method of claim 1, further comprising:

examining each past communication in the first portion of the series of past communications to determine an attribute; and

providing the attributes derived from the first portion of the series of past communications as training data to the ML model.

5. The computer-implemented method of claim 1, further comprising:

determining that the communication represents a security risk based on an output generated by the ML model; and

characterizing the security risk from multiple dimensions.

6. The computer-implemented method of claim 5, wherein the plurality of dimensions comprises:

the party to be attacked is,

the vector of the attack is then calculated,

the party to be counterfeited is provided with a display,

a counterfeit policy, and

and (5) attacking the target.

7. The computer-implemented method of claim 1, wherein the storage medium is a computer server managed by an entity other than the enterprise.

8. The computer-implemented method of claim 1, wherein the first portion of the series of past communications includes all emails received by the employee during the first time interval.

9. The computer-implemented method of claim 1, further comprising:

downloading, via the application programming interface, a second portion of the series of past emails corresponding to a second time interval into the local processing environment, the second time interval preceding the first time interval; and

determining whether any emails received during the second time interval represent a security risk by applying the ML model to the second portion of the series of past emails.

10. The computer-implemented method of claim 1, further comprising:

examining the communication to determine a plurality of attributes;

a statistical description of the communication is generated,

wherein the statistical description comprises a risk score for each pair of attributes included in the plurality of attributes, each risk score based on a risk of historical communications involving the corresponding pair of attributes.

11. A non-transitory computer-readable medium having stored thereon instructions that, when executed by a processor, cause the processor to perform operations comprising:

receiving an e-mail addressed to an employee of the enterprise;

applying a first model to the email to produce a first output representing whether the email represents non-malicious email,

wherein the first model is trained using past emails addressed to the employee that have been certified as non-malicious emails;

determining that the email is likely to be a malicious email based on the first output;

applying a second model to the email to produce a second output representing whether the email represents a malicious email of a given type; and

performing an action with respect to the email based on the second output.

12. The non-transitory computer-readable medium of claim 11,

wherein the second output indicates that the email is not the malicious email of the given type, an

Wherein performing the action comprises:

forwarding the email to an inbox of the employee.

13. The non-transitory computer-readable medium of claim 11, wherein the second model is one of a plurality of models that are applied to the email in response to determining that the email is likely to be a malicious email.

14. The non-transitory computer-readable medium of claim 13, wherein each model of the plurality of models is associated with a different type of malicious email.

15. The non-transitory computer-readable medium of claim 14, wherein the plurality of models, when applied to the email, produce a plurality of outputs, and wherein the operations further comprise:

applying a third model designed to aggregate the plurality of outputs produced by the plurality of models into an understandable visualization component.

16. The non-transitory computer-readable medium of claim 11,

wherein the second output representation that the email includes a link to a hypertext markup language (HTML) resource, an

Wherein performing the action comprises:

following the link to access the HTML resource using a virtual web browser,

extracting a Document Object Model (DOM) of the HTML resource through the virtual web browser, an

The DOM is analyzed to determine whether the link represents a security threat.

17. The non-transitory computer-readable medium of claim 11,

wherein the second output indicates that the email includes a main link pointing to a resource carried by a bearer service accessible to the network, an

Wherein performing the action comprises:

following the primary link to access the resource using a virtual web browser,

examining, by the virtual web browser, the contents of the resource, discovering whether there are any secondary links to secondary resources,

for each of the secondary links there is a link,

following the secondary link to access a corresponding secondary resource using the virtual web browser, an

Analyzing the content of the corresponding secondary resource to determine whether the secondary link represents a security threat, an

A determination is made as to whether the primary link represents a security threat based on whether any of the secondary links are determined to represent a security threat.

18. The non-transitory computer-readable medium of claim 11,

wherein the second output representation that the email includes a link to a hypertext markup language (HTML) resource, an

Wherein performing the action comprises:

following the link to access the HTML resource using a virtual web browser,

capturing a screenshot of the HTML resource through the virtual web browser,

applying a computer vision algorithm designed to identify similarities between the screenshots and the library of validated login sites, and

determining whether the link represents a security threat based on an output generated by the computer vision algorithm.

19. The non-transitory computer-readable medium of claim 12,

wherein the second output indicates that the email includes an attachment, an

Wherein performing the action comprises:

opening the accessory in a secure processing environment, an

Determining whether the accessory represents a security threat based on an analysis of the accessory's content.

20. A computer-implemented method, comprising:

receiving input representing permission to access past email delivered to employees of the enterprise within a given time interval;

establishing a connection with a storage medium including the past email via an application programming interface;

downloading past emails into a local processing environment via the application programming interface; and

an Machine Learning (ML) model for identifying abnormal communication activities is constructed by providing past e-mails as training data to the ML model.

21. The computer-implemented method of claim 20, further comprising:

examining each past email downloaded into the local processing environment to identify a sender identification and a sender email address; and

populating an entry in a database such that a sender identification is associated with a corresponding sender email address identified in the past email.

22. The computer-implemented method of claim 21, further comprising:

receiving an email addressed to the employee;

examining the email to determine a sender identification and a sender email address; and

determining whether the email represents a security threat based on whether the sender identification and the sender email address identified in the email match entries in a database.

23. The computer-implemented method of claim 20, further comprising:

receiving an email addressed to the employee; and

determining whether the email represents abnormal communication activity by applying the ML model to the email.

24. The computer-implemented method of claim 23, wherein the output of the ML model when applied to the email represents that the email message represents anomalous communication activity due to the presence of a previously unknown sender identification, a previously unknown sender email address, or a combination of a previously unknown sender identification and a sender email address.

25. The computer-implemented method of claim 23, further comprising:

in response to determining that the email represents anomalous communication activity,

uploading information related to the email to a federated database to protect a plurality of enterprises from security threats.

26. A non-transitory computer-readable medium having instructions stored thereon, which, when executed by a processor, cause the processor to perform operations comprising:

collecting data related to incoming and/or outgoing emails corresponding to customers of past time intervals;

generating a communication description for the customer based on the data;

receiving an incoming email addressed to the customer;

obtaining one or more attributes of the incoming email; and

determining whether the incoming email deviates from past email activity by comparing the one or more attributes to the communication description.

27. The non-transitory computer-readable medium of claim 26, wherein the customer is an enterprise for which the communication description is generated.

28. The non-transitory computer-readable medium of claim 26, wherein the customer is an employee of an enterprise for which the communication description is generated.

29. The non-transitory computer-readable medium of claim 26, wherein the generating comprises:

obtaining at least one attribute from each email corresponding to the past time interval; and

constructing the communication description based on the obtained attributes.

30. The non-transitory computer-readable medium of claim 26, the operations further comprising:

providing the deviation in the incoming e-mail as input to a Machine Learning (ML) model; and

determining whether the incoming email represents a security risk based on output generated by the ML model.

31. The non-transitory computer-readable medium of claim 30, the operations further comprising:

performing a remedial action in response to determining that the incoming email represents a security risk.

32. The non-transitory computer-readable medium of claim 26, wherein the one or more attributes comprise a primary attribute and a secondary attribute.

33. The non-transitory computer-readable medium of claim 32, wherein the deriving comprises:

extracting the primary attributes from the incoming email; and

determining the secondary attribute based on the primary attribute and additional information associated with the customer.

34. A computer-implemented method, comprising:

receiving input indicating permission to access email delivered to employees of the enterprise;

acquiring an incoming e-mail addressed to the employee;

extracting a primary attribute from the incoming email by parsing the content of the incoming email and/or metadata associated with the incoming email;

obtaining a secondary attribute based on the primary attribute; and

determining whether the incoming email deviates from past email activity by comparing the primary and secondary attributes to a communication description associated with the employee.

35. The computer-implemented method of claim 34, further comprising:

a connection is established with an email system employed by the enterprise via an application programming interface.

36. The computer-implemented method of claim 34, wherein the communication description comprises primary and secondary attributes of past emails delivered to the employee and determined to represent secure communications.

37. The computer-implemented method of claim 36, wherein the determining comprises:

discovering that the primary attribute, the secondary attribute, or a combination of the primary attribute and the secondary attribute are not included in the communication description.

38. The computer-implemented method of claim 34, wherein the primary attribute is a sender display name, a sender username, a Sender Policy Framework (SPF) state, a domain name key identification mail (DKIM) state, a number of attachments, a number of links in the body of the incoming email, a country of origin, information in a header of the incoming email, or an identifier embedded in metadata associated with the incoming email.

39. The computer-implemented method of claim 37, further comprising:

determining that the incoming email does not represent a security risk; and

the communication description is updated by creating an entry that programmatically associates the first and second attributes.

40. A computer-implemented method, comprising:

determining, by the threat detection platform, that a communication event involving the transmission of the email is currently occurring;

obtaining, by the threat detection platform, information related to the email;

resolving, by the threat detection platform, entities involved in the communication event by examining the information; and

compiling, by the threat detection platform, corpus statistics for entities determined to be involved in the communication event.

41. The computer-implemented method of claim 40, wherein the determining is accomplished by examining incoming emails received by an email system that is programmatically integrated with the threat detection platform.

42. The computer-implemented method of claim 41, wherein programmatic integration of the threat detection platform with an email system ensures that all external and internal emails are routed through the threat detection platform for inspection.

43. The computer-implemented method of claim 40, wherein the information is obtained from the email.

44. The computer-implemented method of claim 40, further comprising:

augmenting, by the threat detection platform, the information with a manually monitored data set;

wherein the resolution is performed on the augmented information.

45. The computer-implemented method of claim 40, wherein the resolving comprises:

determining an identity of a sender based on a source of the incoming email, content of the incoming email, or metadata accompanying the incoming email; and

an identification of a recipient is determined based on a destination of the incoming email, content of the incoming email, or metadata accompanying the incoming email.

46. The computer-implemented method of claim 40, further comprising:

causing the corpus statistics to be displayed in the form of an entity risk graph.

47. The computer-implemented method of claim 46, wherein the entity comprises a sender of the email, a recipient of the email, a domain found in the email, a link found in the email, an Internet Protocol (IP) address found in metadata accompanying the email, a source of the email, a topic determined based on the email content, or any combination thereof.

48. The computer-implemented method of claim 46, wherein the entity risk graph comprises historical combinations of the entities and a respective risk score for each historical combination.

49. The computer-implemented method of claim 46, wherein each entity is represented in the entity risk graph as a separate node, and wherein each connection between a pair of nodes represents a risk of an event involving a pair of entities associated with the pair of nodes based on a past communication event.

50. A non-transitory computer-readable medium having instructions stored thereon, which, when executed by a processor, cause the processor to perform operations comprising:

acquiring an incoming e-mail addressed to an enterprise employee;

extracting features in the form of primary attributes and secondary attributes for the incoming email;

employing a Machine Learning (ML) model that consumes the extracted features to determine whether there are any miss indicators representing security threats;

generating a signature for each of the loss indicators; and

causing the database to ingest each signature for discovery of future attacks having the same characteristics.

51. The non-transitory computer-readable medium of claim 50, wherein each miss indicator is an Internet Protocol (IP) address, an email address, a Uniform Resource Locator (URL), or a domain.

52. The non-transitory computer-readable medium of claim 50, the operations further comprising: deep feature extraction is performed to reduce the likelihood of injury from complex security threats.

53. The non-transitory computer-readable medium of claim 52, wherein the performing comprises:

a deep learning model is applied to understand the content, mood, and/or tone of the incoming email.

54. The non-transitory computer-readable medium of claim 52, wherein the performing comprises:

accessing a landing page by interacting with a link embedded in the incoming email; and

the landing page is compared using computer vision algorithms to a set of known landing pages that are certified as authentic.

55. The non-transitory computer-readable medium of claim 52, wherein the performing comprises:

a crawling algorithm is employed to extract information about secondary links that are embedded in attachments to the incoming email or accessible via a website pointed to by a primary link in the incoming email.

56. A computer-implemented method, comprising:

obtaining first data associated with a first past batch of emails received by a staff member of a business;

generating a first plurality of events representing the first plurality of past emails;

obtaining second data associated with a second set of past emails that were tagged by one or more administrators,

wherein each past email in the second batch of past emails is associated with a tag that specifies a risk to the enterprise;

generating a second set of events representing the second set of past emails; and

storing the first and second batches of events in a database.

57. The computer-implemented method of claim 56, wherein the generating comprises:

converting the first data associated with each past email in the first batch of past emails into a predefined pattern defining events.

58. The computer-implemented method of claim 56, further comprising:

receiving an input representing a query for an event having a given attribute; and

the database is checked to identify events having given attributes if any.

59. The computer-implemented method of claim 58, further comprising:

determining a count of identified events; and

causing the count to be displayed on an interface that submitted the query.

60. The computer-implemented method of claim 57, further comprising:

calculating a risk metric for each past email in the first batch of past emails; and

attaching the calculated risk metric for each past email in the first batch of past emails to a respective predefined pattern.

61. The computer-implemented method of claim 60, further comprising:

receiving input representing a query to determine events that do not represent a threat to enterprise security; and

examining the database to identify events that, if any, do not represent a threat to enterprise security; and

causing the identified event to be displayed on an interface that submitted the query.

62. The computer-implemented method of claim 60, wherein the checking comprises:

parsing the database to determine whether any past emails in the first batch of past emails are associated with a risk metric that is below a threshold; and

parsing the database to determine whether any past emails in the second batch of past emails are associated with tags that represent no risk.

63. The computer-implemented method of claim 56, further comprising:

acquiring an incoming e-mail addressed to an enterprise employee;

parsing the incoming email to identify attributes of the email;

checking the database to identify any events having the attribute if any; and

evaluating a risk posed by the incoming email based on the identified event.

64. A computer-implemented method, comprising:

acquiring a series of e-mails sent to enterprise employees;

identifying a name involved in the series of emails by examining each email;

a series of signatures is created for the series of emails,

wherein each signature of the series of signatures is associated with a respective email of the series of emails, an

Wherein each signature identifies one or more entities involved in the respective email;

obtaining corpus statistics that determine the entities involved in the series of emails;

indexing the corpus statistics by date; and

storing the series of signatures and indexed corpus statistics in a date-divided data structure.

65. The computer-implemented method of claim 64, further comprising:

acquiring an incoming e-mail addressed to an enterprise employee;

identifying at least one entity involved in the incoming email by examining the incoming email; and

comparing the at least one entity to the dated data structure to determine whether the at least one entity matches any of the series of signatures.

66. The computer-implemented method of claim 65, further comprising:

determining that the at least one entity matches a signature in the series of signatures; and

evaluating a risk posed by the incoming email based on the signature.

67. The computer-implemented method of claim 66, wherein the evaluating comprises:

determining any risk that would be present with past emails corresponding to the signature; and

calculating a risk metric for the incoming email based on the determined risk of past emails.

68. The computer-implemented method of claim 65, further comprising:

determining a similarity between the at least one entity and the series of signatures by employing a Machine Learning (ML) algorithm that probabilistically compares the at least one entity with each signature in the series of signatures; and

evaluating a risk posed by an incoming email based on an output generated by the ML algorithm.

Technical Field

Various embodiments are directed to computer programs and related computer-implemented techniques for detecting email-based threats in the security field.

Background

Employees of a business organization (or simply "business") often receive malicious emails in an inbox. Some of these malicious emails are quite complex. For example, malicious email that constitutes an attack on an employee or business may be designed to bypass existing safeguards, reach the employee's inbox, and then be opened. Such emails typically arrive without the knowledge of the corporate security team.

Many employees act upon receiving malicious emails, which exposes data (e.g., her own personal data or enterprise data) to risk. For example, an employee may click on a link embedded in a malicious email, provide her credentials, send confidential information, or transfer a transfer to an unauthorized entity (also referred to as an "attacker" or "adversary") responsible for generating the malicious email. The performance of such actions may result in the installation of malware, the theft of credentials, the loss of employee email accounts (compdemise), the leakage of data, or the theft of money.

Once a breach is discovered (break), the enterprise will face serious consequences. These consequences include:

the direct cost of assuming the destruction-especially if money is directly wire-remitted to the enemy;

indirect costs of undertaking the destruction, such as infected hardware and labor to remedy the attack; and/or

Undertake fines in the event of data theft assessed by the regulatory body.

Conventional email security software is difficult to handle attacks involving complex malicious emails for a number of reasons.

First, an active adversary (active adaptation) responsible for elaborating personalized messages is often the other party to a complex malicious email. This is in contrast to less complex email-based attacks, in which a single person may send thousands or millions of ordinary, non-personalized emails in an attempt to succeed in absolute numbers. Here, each complex attack is new, unique, and personalized (e.g., to the employee or business). Thus, the employee does not observe the same complex attack multiple times.

Second, complex malicious emails do not generally include any attack signatures. As used herein, the term "attack signature" refers to features previously observed in one or more emails that are determined to be malicious. Traditional solutions typically rely on attack signatures and pattern matching, but complex malicious emails can be personalized to circumvent these traditional solutions. Moreover, some complex malicious emails do not contain any links or attachments. Rather, a complex malicious email might contain only text, such as "he, can you help i handle the task? "upon receipt of a reply, for example, the adversary may instruct the employee to wire money or share data. Further, for employees' email accounts, all email will originate from the actual email account, and thus it is extremely difficult to detect malicious activity.

Third, businesses handle large volumes of mail, and mail reception is time sensitive. For most e-mails, the decision as to whether an e-mail constitutes fraud should be made quickly, since the e-mail security software should not add delay to the e-mail stream. However, in most cases, conventional email security software will delay delivery (delivery) of emails that are determined to represent security threats indefinitely.

Fourth, enterprises will handle relatively small amounts of complex malicious emails within a given time frame. For example, a business may only observe a few examples of complex malicious emails within a week. Thus, because of the little corruption caused by complex content, there is little data that can be ingested by Machine Learning (ML) models designed to identify complex malicious emails (ingest).

Accordingly, there is a need in the security arts to create computer programs and related computer-implemented techniques to detect email-based threats and then mitigate those threats.

SUMMARY

A significant portion of the targeted attacks on businesses or their employees starts with e-mail and these security threats are constantly evolving. As discussed above, the significant need to detect and address (resolve) complex email-based threats is becoming more and more apparent. Conventional e-mail security software does not adequately address the need to accurately, quickly, and consistently detect complex malicious e-mails before they enter the inbox.

While the solution should handle a number of different attack types, there are two specific attack types that present challenges in detection and resolution. The first type of attack is email account collapse (also known as "account takeover"). This form of attack is one in which: the adversary uses the stolen credentials to access the employee's account and then uses the credentials to steal money or data from the business or sends an email from the account in an attempt to steal money or data from the business or other employee. The second type of attack is commercial email loss. This form of attack is one: an adversary impersonates (impersonate) an attack by an employee or partner (e.g., a supplier). For example, an adversary may cause an incoming (incoming) email to appear to be written by a staff (e.g., by changing the display name). This form of attack is typically intended to allow a business to pay legitimate or fictitious invoices, or to steal data.

The threat detection platform described herein is designed to collect and examine email to identify security threats faced by an enterprise. Threat detection platforms (also referred to as "email security platforms") may be designed to handle the above types of attacks, as well as other types of attacks, such as phishing (e.g., activity-based attacks), spearphishing (e.g., personalization attacks), lasso (e.g., cryptocurrency, gift cards, and wire-line lasso), financial/data theft (e.g., vendor, partner, and customer impersonation), and many other types of attacks, including those never seen before.

At a high level, the techniques described herein may be used to build a model that represents the normal email behavior of a business (or individual employees of a business), and then look for deviations (devisions) by applying the model to incoming emails to identify anomalies. By determining what constitutes normal behavioral characteristics and/or normal email content, businesses can be protected from new, complex attacks, such as employee impersonation, vendor impersonation, fraudulent invoices, email account misses, and account takeover. Moreover, normalizing, structuring, and storing data related to email may allow for the creation of other high value data sets. For example, the threat detection platform may obtain valuable information about Enterprise Resource Planning (ERP) from email data. As discussed further below, the techniques described herein may utilize machine learning, heuristics, rules, human-in-the-loop (human-in-the-loop) feedback and labeling, or some other technique to detect attacks (e.g., in real-time or near real-time) based on features extracted from a communication (e.g., email) and/or the context of the communication (e.g., recipient, sender, content, etc.).

Once a security threat is detected, remedial (remediation) actions may be taken. The remedial action deemed appropriate (if any) may depend on the type of security threat detected. For example, the threat detection platform may perform different remedial operations when discovering malicious emails containing embedded links instead of malicious emails with attachments. As part of the threat detection, identification, and remediation process, the threat detection platform may consider as input, user actions; e-mail reported by the user; machine Learning (ML) training data including manually tagged emails, historical threat information, and ratings; threat detection possibilities based on known attack type models; and heuristic means including rules for blacklisting and/or whitelisting emails that meet certain conditions.

Drawings

The various features of this technique will become more apparent to those skilled in the art upon review of the following detailed description when taken in conjunction with the accompanying drawings. Embodiments of the present technology are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar elements.

Fig. 1 illustrates how a conventional filtering service examines incoming emails to determine which, if any, should be prevented from reaching their intended (interrupted) destination.

Fig. 2 illustrates how a threat detection platform applies a multi-layered, integrated model that includes multiple sub-models to incoming emails received via the internet to determine which, if any, emails should be prevented from reaching their intended destinations.

FIG. 3 depicts an example of a system for detecting email-based threats, including a customer network (also referred to as an "enterprise network") and a threat detection platform.

FIG. 4 depicts a flow diagram of a process for detecting an email-based threat by monitoring incoming emails, determining attributes of the emails, detecting attacks based on the determined attributes, and optically performing remediation steps.

FIG. 5 depicts an example of a hierarchical diagram of possible attack types generated by a Machine Learning (ML) model for a particular customer.

FIG. 6 depicts an example of a threat detection platform including multiple analysis modules and multiple extractors (e.g., multiple primary extractors and multiple secondary extractors) that work in conjunction with one another.

Fig. 7 depicts how most incoming messages are classified as non-malicious, while a small portion of incoming messages are classified as malicious.

FIG. 8A includes a high-level schematic diagram of a detection architecture of a threat detection platform, according to some embodiments.

FIG. 8B includes an example of a more detailed process by which the threat detection platform may process data related to past emails (here, from Microsoft Windows)365 get), extracting primary attributes from past emails, generating corpus statistics based on the primary attributes, obtaining secondary attributes based on the primary attributes and the corpus statistics, training an ML model with the primary attributes and/or the secondary attributes, and then scoring the incoming emails using the ML model based on risks brought to the enterprise.

FIG. 9 depicts an example of an incoming email that may be examined by a threat detection platform.

FIG. 10A depicts how information gathered from incoming emails can be used to determine different entities.

FIG. 10B depicts an example of how the threat detection platform may perform an entity resolution procedure to determine the sender identification of an incoming email.

FIG. 11 depicts how an entity risk graph can contain historical combinations of entities found in an incoming email and risk scores associated with those historical combinations.

Fig. 12 depicts an example of an entity risk graph.

FIG. 13 provides an example matrix of stages that a threat detection platform may perform by processing data, extracting features, determining whether an event represents an attack or the like.

14A-H depict examples of different data structures that a threat detection platform may create/populate as it processes data, extracts features, determines whether an event represents an attack or the like.

FIG. 15 includes a high-level system diagram of a threat intelligence system to which a threat detection platform belongs.

FIG. 16 illustrates how a threat detection platform derives/infers attributes from data obtained from multiple sources, describes (profiles) the attributes against ML models as inputs, and then examines the outputs produced by the ML models to determine if a security threat exists.

FIG. 17 includes a high-level architectural description of a threat detection platform capable of generating/updating data for processing incoming emails in real-time via batch execution.

Fig. 18A includes a high-level diagram of a process by which a threat detection platform may perform threat intelligence.

FIG. 18B includes a high-level diagram of a process by which a threat detection platform may "produce" for determining signatures for threats posed by incoming emails.

FIG. 19A includes a high-level diagram of a process by which a threat detection platform may index corpus statistics to create a date-partitioned database of signatures and corpus statistics that may be used to more efficiently identify unsafe entities.

Fig. 19B depicts an example database that includes signature and corpus statistics.

FIG. 20 illustrates an example of how a threat detection platform detects employee account cave-in (EAC).

FIG. 21 depicts a high level flow chart of a process for scoring threats posed by incoming emails.

FIG. 22 depicts a flowchart of a process for applying a personalized Machine Learning (ML) model to emails received by employees of an enterprise to detect security threats.

FIG. 23 depicts a flowchart of a process for detecting and characterizing email-based security threats in real-time.

Fig. 24 is a block diagram illustrating an example of a processing system in which at least some of the operations described herein may be implemented.

The various embodiments depicted in the drawings are for illustrative purposes only. One skilled in the art will recognize that alternative embodiments may be employed without departing from the principles of the present technology. Thus, although specific embodiments have been illustrated in the accompanying drawings, the present technology is susceptible to various modifications.

DETAILED DESCRIPTION OF EMBODIMENT (S) OF INVENTION

The threat detection platform described herein is directed to collecting and inspecting email to identify security threats faced by an enterprise. At a high level, the techniques described herein may be used to build a model that represents the normal email behavior of an enterprise (or individual employees of an enterprise), and then look for deviations by applying the model to incoming emails to identify anomalies. By determining what constitutes normal behavioral characteristics and/or normal email content, businesses can be protected from new, complex attacks, such as employee impersonation, vendor impersonation, fraudulent invoices, email account misses, and account takeover. As discussed further below, the techniques described herein may utilize machine learning, heuristics, rules, human-mediated feedback and tagging, or some other technique to detect attacks (e.g., in real-time or near real-time) based on features extracted from a communication (e.g., email), attributes of the communication (e.g., recipient, sender, content, etc.), and/or data sets/information unrelated to the communication. For example, detecting complex attacks that plague an enterprise may require knowledge to be collected from multiple data sets. These data sets may include employee login data, security events, calendars, contact information, Human Resources (HR) information, and the like. Each of these different data sets provides a different dimension for normality of employee behavior and can be used to detect the most complex attacks.

Once a security threat is detected, remedial action may be taken. The remedial action deemed appropriate (if any) may depend on the type of security threat detected. For example, the threat detection platform may perform different remedial operations when discovering malicious emails containing embedded links instead of malicious emails with attachments.

Embodiments may be described with reference to particular network configurations, attack types, and the like. However, those skilled in the art will recognize that these features are equally applicable to other network configurations, attack types, and the like. For example, while certain embodiments may be described in the context of a spearphishing attack, related features may be used in connection with other types of attacks.

Furthermore, the techniques may be implemented using dedicated hardware (e.g., circuitry), programmable circuitry that is suitably programmed in software and/or firmware, or a combination of dedicated hardware and programmable circuitry. Accordingly, embodiments may include a machine-readable medium having instructions operable to program a computing device to perform a process for receiving input representing permission to access an email message delivered to or sent by a corporate employee within a given time interval, establishing a connection with a storage medium including the email message, downloading the email message to a local processing environment, building an ML model for identifying anomalous communication behavior based on characteristics of the email message, and so forth.

Term(s) for

Reference in the specification to "an embodiment" or "one embodiment" means that a particular feature, function, structure, or characteristic described in connection with the embodiment is included in at least one embodiment. The appearances of such phrases are not necessarily referring to the same embodiment, nor are they necessarily referring to mutually exclusive alternative embodiments.

The terms "comprises" and "comprising" are to be construed in an inclusive sense as opposed to an exclusive or exhaustive sense (i.e., "including but not limited to") unless the context clearly requires otherwise. The terms "connected," "coupled," or any variant thereof, are intended to encompass any connection or coupling, either direct or indirect, between two or more elements. The coupling/connection may be physical, logical, or a combination thereof. For example, multiple devices may be electrically or communicatively coupled to each other, although not sharing a physical connection.

The term "based on" is also to be understood in an inclusive sense, rather than an exclusive or exhaustive sense. Thus, unless otherwise specified, the term "based on" means "based at least in part on.

The term "module" generally refers to software components, hardware components, and/or firmware components. A module is generally a functional component that can generate useful data or other output based on specified inputs. The modules may be self-contained. The computer program may include one or more modules. Thus, a computer program may comprise multiple modules responsible for accomplishing different tasks or a single module responsible for accomplishing all tasks.

When referring to a list of items, the word "or" is intended to encompass all of the following interpretations: any item in the list, all items in the list, and any combination of items in the list.

The order of steps performed in any process described herein is exemplary. However, unless contrary to physical possibility, the steps may be performed in a variety of orders and combinations. For example, steps may be added or removed in the processes described herein. Similarly, steps may be replaced or reordered. Thus, any description of a process is intended to be open-ended.

Technical overview

Most email platforms provide basic filtering services. Fig. 1 illustrates how a conventional filtering service examines incoming emails to determine which, if any, should be prevented from reaching their intended destination. In some cases, a business applies an anti-spam filter to incoming emails received via the internet, while in other cases another entity (e.g., an email service) applies an anti-spam filter to incoming emails on behalf of the business. Emails received via the internet 102 may be referred to as "external emails". Meanwhile, the term "internal email" may be used to refer to those emails sent within an enterprise. An example of internal email is enterprise internal email (e.g., email from one employee to another) that is delivered directly to a recipient mailbox instead of being routed through a mail exchanger (MX) record, an external gateway, or the like.

In general, the anti-spam filter 104 is designed to use blacklists of senders, sender email addresses, and web sites that have been detected in past unknown emails and/or enterprise-defined policy frameworks to capture and quarantine malicious emails. The term "anti-spam filter" as used herein may refer to any legacy email security mechanism capable of filtering incoming emails, including a Secure Email Gateway (SEG) (also referred to as a "gateway"). For example, the enterprise 108 (or email service) may maintain a list of sender email addresses from which malicious emails were received in the past. As another example, a business may decide to enforce a policy that prohibits employees from receiving emails from a given domain. Malicious emails captured by the anti-spam filter 104 can be quarantined to remain hidden from the intended recipient, while non-malicious emails can be stored on the email server 106 (e.g., a cloud-based email server) for subsequent access by the intended recipient. An email server (also referred to as a "mail server") facilitates delivery of emails from senders to recipients. Typically, when an email moves (travel) towards a predetermined destination, it will be transmitted between a series of email servers. This family of email servers allows email to be sent between different email address domains.

Email servers can be divided into two broad categories: outgoing mail servers and incoming mail servers. The outgoing mail server may be considered a Simple Mail Transfer Protocol (SMTP) server. The incoming mail server is typically a post office protocol version 3(POP3) server or an Internet Message Access Protocol (IMAP) server. POP3 servers are known to store sent/received messages on local hard disks, while IMAP servers are known to store copies of messages on servers (although most POP3 servers may also store messages on servers). Thus, the location of the email received by the enterprise may depend on the type of incoming mail server used by the enterprise.

As mentioned above, this arrangement is not suitable for identifying complex malicious emails. Thus, conventional filtering services often allow complex malicious emails to miss an employee's inbox. Thus, described herein is a threat detection platform designed to improve upon conventional filtering services. FIG. 2 illustrates how threat detection platform 214 applies a multi-tiered integration model that includes multiple sub-models to incoming emails received via Internet 202 to determine which, if any, emails should be prevented from reaching their intended destinations.

Initially, threat detection platform 214 may receive an email addressed to a business employee. Upon receiving an email, threat detection platform 214 may apply first model 204 to the email to produce a first output representing whether the email represents malicious email. The first model may be trained using past emails addressed to employees of the enterprise that have been proven to be non-malicious emails. Accordingly, the first model 204 may be referred to as a "positive security model". The first model 204 serves as a first level of threat detection and, thus, may be tailored/designed to allow most e-mail (e.g., more than 90%, 95%, or 99% of all incoming e-mail) to reach the e-mail server 206. Typically, the first model 204 is designed such that the initial threat determination is made fairly quickly (e.g., in less than 100, 50, or 25 milliseconds). Thus, the first model 204 may be responsible for performing load shedding.

If the email cannot be certified by first model 204 as non-malicious, threat detection platform 214 may apply second model 208 to the email. For purposes of illustration, the email forwarded to the second model 204 may be referred to as a "malicious email". However, these emails can be more accurately described as potentially malicious emails because the first model 204 can only prove whether an email is non-malicious or not. Once applied to the email, the second model 208 may produce a second output that represents whether the email represents a given type of malicious email. Generally, the second model 208 is part of a model integrant that is applied to emails in response to determining that the email represents malicious email. Each model in the integrant may be associated with a different type of security threat. For example, the integrant may include various models for determining whether an email includes a query for data/funds, a link to a hypertext markup language (HTML) resource, an attachment, and so on. As discussed further below, the second model 208 may be designed to determine different faces (faces) of security threats in response to determining that an email is likely malicious. For example, the second model 208 may discover aspects of security threats, such as policies, targets, spoofed parties, vectors (vectors), and attacked parties, and then upload this information to a description associated with the intended recipient and/or enterprise.

The threat detection platform 214 may then apply a third model 210, the third model 210 designed to convert the output produced by the second model 208 into an understandable visualization component 212. In embodiments where the second model 208 is part of a model integrant, the third model 210 may aggregate (aggregate) the outputs produced by the models in the integrant, characterize the attack based on the aggregated outputs, and then convert the aggregated outputs to interpretable interpretations. For example, the third model 210 may generate a notification that identifies the type of security threat posed by the email, whether remedial action is required, and the like. As another example, the third model 210 may generate a human-readable interpretation (e.g., comprising text, graphics, or some combination thereof) using facets, model features, and/or features that trigger a determination that the most discriminative threat exists for the email and attack combination. Interpretable interpretations may be created so that a security professional responsible for handling/mitigating security threats can more easily understand why the second model 208 marks incoming emails as representing an attack.

Those skilled in the art will appreciate that the output of one model may be the entry conditions of another model. In other words, the order of the models employed by threat detection platform 214 may filter which emails are sent to which models in an attempt to reduce analysis time. Thus, threat detection platform 214 may take a layered biphase (bi-phasal) approach (appaach) to examine incoming emails.

The multi-layered ensemble model may include different types of models, such as a Gradient Boosting Decision Tree (GBDT) model, a logistic regression model, and/or a deep learning model. As discussed further below, each type of attack is generally scored by a respective model, and thus threat detection platform 214 may employ different types of models based on the type of attack detected.

This approach may be referred to as a "bi-phase approach" because it allows emails that are determined to be non-malicious to be routed to the email server 206 with little delay, while the increased time is used to analyze emails that are determined to be malicious (or at least potentially malicious).

Threat detection platform for detecting email-based threats

Fig. 3 depicts an example of a system 300 for detecting email-based threats, the system 300 including a customer network 316 (also referred to as an "enterprise network") and a threat detection platform 302. As shown in FIG. 3, threat detection platform 302 may include a description generator 304, a training module 306, a monitoring module 308, a threat detection data store 310, an analysis module 312, and a remedy 314. Some embodiments of threat detection platform 302 include a subset of these components, while other embodiments of threat detection platform 302 include additional components not shown in fig. 3.

The system 300 may be used to obtain email usage data for a customer (also referred to as a "business"), generate a description including a number of received or inferred behavioral characteristics based on the email usage data, monitor incoming emails, and, for each email, determine whether the email represents a security threat using a set of attack detectors (e.g., based on deviations from behavioral characteristics or normal content, such as by feeding the deviations into an ML model), flag a possible attack if the detectors represent the attack, and, if flagged, optionally perform one or more remedial steps on the email. The remediation step may be performed according to a customer-specified remediation policy and/or a default remediation policy. As used herein, the term "customer" may refer to a collection of users of an organization (e.g., a company or business), a business unit, an individual (e.g., associated with one or more email addresses), a team, or any other suitable threat detection platform 302. Although embodiments may be described in the context of an enterprise, those skilled in the art will recognize that the related techniques may be applied to other types of customers. As discussed further below, the system 300 may train one or more ML modules to function as a detector that is capable of detecting a variety of email attack types that may be present in an incoming email based on deviations from customer behavioral characteristics, normal email content, and the like.

In some embodiments, the system 300 detects an attack based on the entire email (e.g., including body content). However, in other embodiments, the system 300 is designed to detect an attack based only on email metadata (e.g., information about email headers, senders, etc.) or some other suitable data.

All or portions of system 300 may be implemented in an entity's email environment (e.g., customer network 316), a remote computing system (e.g., through which incoming emails and/or data about incoming emails may be routed for analysis), an entity's gateway, or another suitable location. The remote computing system may belong to or be maintained by the entity, a third party system, or another suitable user. The system 300 is openIntegrated into an entity's email system in the following manner: inline (e.g., at a secure email gateway), via an Application Programming Interface (API) (e.g., where the system is via Microsoft, for example)An API receives email data), or other suitable manner. Thus, the system 300 may supplement and/or replace other communication security systems employed by the entity.

In a first variation, the system 300 is maintained by a third party (also referred to as a "threat detection service") that has access to the email of multiple entities. In this variation, the system 300 may route the email, the extracted features (e.g., primary attribute values), the derived information (e.g., secondary attribute values), and/or other suitable communications to a remote computing system maintained/managed by a third party. For example, the remote computing system may be an instance on an Amazon Web Service (AWS). In this variation, the system 300 may maintain one or more databases for each entity including, for example, organizational charts, attribute references, and the like. Additionally or alternatively, system 300 may maintain a federated database, such as a detector database, a legitimate vendor database, and the like, that is shared among multiple entities. In this variation, the third party may maintain different instances of the system 300 for different entities, or a single instance for multiple entities. The data carried (hosted) in these instances may be obfuscated (encrypted), hashed, non-personalized (e.g., by removing Personal Identifiable Information (PII)), or secure or confidential.

In a second variation, the system is maintained by the entity whose email is being monitored (e.g., remotely or pre-set), all of the data may be carried by the entity's computing system. In this variant, data to be shared among multiple entities (e.g., detector database updates and new attack signatures) may be shared with a remote computing system maintained by a third party. Such data may be obfuscated, encrypted, hashed, non-personalized (e.g., by removing PII), or secure or confidential. However, any other suitable computing and owner configuration may be used to maintain or execute the system 300.

As shown in FIG. 3, the description generator 304, training module 306, monitoring module 308, threat detection data store 310, analysis module 312, and remediation engine 314 may be part of the threat detection platform 302. Alternatively, these components may be used and/or implemented separately. Threat detection platform 302 may be implemented by a threat detection service (also referred to as a "computer security service"), a customer (e.g., an enterprise, organization, or individual that owns an account or implements the threat detection service), an entity/individual, an entity, or an individual associated with (or on behalf of) the customer, a trusted third party, or any other service. In some embodiments, one or more aspects of system 300 may be enabled by a network-accessible computer program operable on a computer server or distributed computing system. For example, an individual may interact with threat detection platform 302 via a web browser executing on a computing device.

Customer network 316 may be an enterprise network, a mobile network, a wired network, a wireless spectrum network, or any other communication network maintained by a customer or network operator associated with the customer. As described above, a customer may be an individual, a business, or other suitable entity. For example, an enterprise may utilize the services of a computer security company to at least perform email threat detection. The enterprise may grant permission to the computer security company to monitor the customer network 316, including monitoring incoming emails at the customer network 316, analyzing potential threats in the emails, and performing some remedial action if a threat is detected. In some embodiments, the enterprise also grants permission to the computer security company to collect or receive various pieces of data about the enterprise in order to build descriptions that specify the normality, behavioral characteristics, and normal email content of the enterprise.

Threat detection data store 310 may include one or more databases in which customer data, threat analytics data, remediation policy information, customer behavioral characteristics or normalcy, normal customer email content, and other data may be stored. Number ofAccording to the method, the following steps are carried out: determined by the system 300 (e.g., computed or learned from data obtained, received, or collected from an email provider from the customer network 316 or entity), received from a user, from an external database (e.g.,or Microsoft Office) Obtained, or otherwise determined. In some embodiments, the threat detection database 310 also stores output from the threat detection platform 302, including human-readable information about actions taken and threats detected. Various other data or entities may be stored.

Customer data may include, but is not limited to, e-mail usage data; organizing organization data such as members/employees and their jobs; customer behavioral characteristics or normalcy (e.g., determined based on historical e-mail); attack history (e.g., determined based on historical e-mail, determined by applying an attribute extractor and/or analysis module to historical e-mail, etc.); an entity description; normal customer email content; an email address and/or telephone number of an organization member; identification of entities and/or individuals with which organization members often communicate internally and externally; e-mail volume per day for multiple periods of time; topics or topics discussed most frequently, and how frequently, etc.

The system 300 may optionally include a description generator 304 that generates one or more entity descriptions (also referred to as "customer descriptions") based on past emails and/or email usage data associated with the entities. In a second variation, the system 300 includes a plurality of description generators 304, each description generator 304 extracting one or more attributes of an entity description. However, system 300 may include any suitable number of description generators in any suitable configuration.

Can be as follows: entity descriptions are generated per customer, per business unit, per individual (e.g., per employee or email recipient), per email address, per organization type, or other suitable entity or group of entities. The entity description is preferably used as a reference for entity communication activities (e.g., email activities), but may be used in other ways. Further, the descriptions may be generated outside of the entities, and these descriptions may be combined across a customer base for use by all entities whose emails are monitored by the system 300. For example, a description of a trusted third party (e.g., Oracle), a representative of a trusted third party (e.g., a sales representative of Oracle), or a financial institution (e.g., having a known routing number to detect fraudulent invoice payments) may be federated across a customer base. Thus, the system 300 can build a federated network of descriptions that model businesses, suppliers, customers, or people.

The entity description may include: primary attributes, secondary attributes, or any other suitable characteristics. These values may be: a median, a mean, a standard deviation, a range value, a threshold, or any other suitable set of values (e.g., for entity description, extraction from new email, etc.). The entity description may include: time series (e.g., trends or values for a particular cycle time (e.g., months of a year)), static values, or may have other suitable contextual dependencies.

The primary attribute is preferably an attribute or feature extracted directly from the communication, but may be determined in other ways. The primary attributes may be extracted by one or more primary attribute extractors, each of which extracts one or more primary attributes from the communication, as shown in FIG. 6, but may be extracted in other manners. The primary attribute extractor may be global (e.g., shared among multiple entities), specific to one entity, or otherwise shared. Examples of main attributes include the sender's display name, the sender's username, the Sender Policy Framework (SPF) state, the domain name key mail (DKIM) state, the number of attachments, the number of links in the email body, spam or phishing metrics (e.g., the continent or country of origin), whether the data between the two fields that should match does not match, header information, or any other suitable communication data. The master attributes may optionally include metadata attributes (e.g., company Identifier (ID), message ID, session ID, personal ID, etc.).

The secondary attributes are preferably attributes that are determined from the primary attributes and/or customer data (e.g., determined from threat detection data store 310), but may be otherwise determined. Secondary attributes may be extracted, inferred, calculated, or otherwise determined. The secondary attributes may be determined by one or more secondary attribute extractors, each extracting one or more secondary attributes from the primary attributes for a given communication or entity, as shown in fig. 5, but may be determined in other ways. The secondary attribute extractor may be global (e.g., shared across multiple entities), specific to one entity, or otherwise shared. The secondary attributes may be determined from all primary attribute values from a time series of primary attribute values (e.g., where each primary attribute value may be associated with a timestamp (e.g., a transmission timestamp or a reception timestamp of an email)), from a single primary attribute value, from values of multiple primary attributes, or from any other suitable data set. Examples of secondary attributes may include: a frequency, such as a sender frequency (e.g., a sender Fully Qualified Domain Name (FQDN) frequency, a sender email frequency, etc.) or a domain frequency (e.g., an SPF status frequency for a given domain, a DKIM status frequency for a given domain, a frequency at which a system receives the same or similar email body from a given domain, a frequency at which emails are received from the domain, a frequency at which emails are sent to the domain, etc.); determining a mismatch between one or more primary attributes that should be matched; employee attributes (e.g., name, job, whether the entity is hired, whether the entity is at a high risk of attack, whether the entity is suspicious, whether the entity has been attacked, etc.); vendor attributes (e.g., vendor name, whether the vendor matches a known vendor exactly, whether there is a similar Unicode vendor, etc.); whether the communication subject includes one of a set of high-risk words, phrases, emotions, or other content (e.g., whether the communication includes financial words, credential stealing words, contractual words, non-ASCII content, attachments, links, etc.); domain information (e.g., the age of the domain, whether the domain is on a blacklist or whitelist, whether the domain is internal or external, etc.); heuristic means (e.g., whether FQDN, domain name, etc. have been previously seen by a global or entity); deviation of a primary attribute value (e.g., extracted from the communication) from a corresponding reference value (e.g., a deviation magnitude, whether the value deviates from a predetermined variance or difference threshold); or any other suitable attribute, characteristic, or variable. In some embodiments, the secondary attribute is determined as a function of the primary attribute. One example of a primary attribute is a sender email address, while one example of a secondary attribute is statistics of communication patterns from the sender address to recipients, departments, organizations, and client clusters.

The entity description may additionally or alternatively include: many customer behavior characteristics or typical email content associated with a customer. In some embodiments, description generator 304 receives email usage data from customer network 316 or threat detection data store 310. The email usage data may include, but is not limited to, information about: email addresses of employees and contacts, email content (e.g., email message body), email frequency, amount of email at a given time of day, use of HTML/font/style in the email, confidential topics and members who are explicitly or implicitly authorized to discuss these topics, spam and its characteristics, and so forth.

The entity description may be generated from: historical email data for the entity (e.g., retrieved using an API for the entity's email environment, retrieved from an email data store, etc.); newly received emails (e.g., emails received after the system connects to the physical email environment); a user input; other entities (e.g., sharing a common feature or characteristic with the entity); or based on any other suitable data. In some embodiments, description generator 304 may collect, generate, or infer one or more email usage data segments based on received customer data segments, monitoring of customer network 316 given authentication and access by the customer, or some other means.

The entity description may be generated using the same system as typical email analysis (e.g., using an attribute extractor for extracting attributes for real-time or near real-time threat detection), but may alternatively or additionally be generated using other suitable systems.

In one variation, description generator 304 generates customer descriptions by building deep descriptions of corporate email usage, membership roles and/or ratings, everyday normalcy, behavioral traits, etc. to build models of customers' looks "normal" or "typical" in terms of email usage and behavior, and, by extension and inference, what "abnormal" or "atypical" emails and/or activities may constitute in order to identify possible threats.

In some embodiments, a customer description is generated based on received, collected, and/or inferred customer data, email usage data, and other relevant information. The customer description may seek model answers about the customer, including but not limited to: what is the normal email address of each member of the organization? What topics are normally discussed by everyone, pairs, and/or department (e.g., Joe and sammantha normally discuss a product release plan, but never discuss accounting or billing topics)? What is the normal login or email sending time of each user? From which Internet Protocol (IP) address? From which geographic location a user typically logs in? Is the user set suspicious mail filter rules (e.g., an attacker hijacking an email account sometimes sets a mail filter to automatically delete emails containing certain keywords so as to hide illegal activity from the true owner of the account)? What is the normal tone or style used by each user? What intonation is used between each pair of users? What is the normal signature used by each employee (e.g., "dry cup" or "thank you")? What type of words are used much more in one department and little in another? Is the customer normally in communication with and/or paid by which suppliers/partners? Is a given user typically speaking? What is the person, pair, or entity's typical e-mail authentication status (e.g., SPF, DKIM, or domain-based message authentication, reporting, and consistency (DMARC))? When a user receives or sends a link/attachment, does the description derived from the link/attachment match the given description of the link/attachment? What is the typical characteristic of an attachment (e.g., name, extension, type, size) when an employee receives an email with an attachment?

The monitoring module 308 is used to monitor incoming emails over a network maintained by a customer. In some embodiments, the monitoring module 308 monitors incoming emails in real-time or substantially real-time. In some embodiments, monitoring module 308 is only authorized to monitor incoming emails when system 300 and/or threat detection platform 302 are authenticated and granted permission and access by customer network 316. In some embodiments, system 300 and/or threat detection platform 302 are integrated into an office suite (office suite) or email suite via an API.

The analysis module 312 is used to analyze the threats/attacks that may be present in each incoming e-mail. The analysis module 312 preferably detects attacks based on secondary attributes (e.g., for one or more communications of the entity), but may alternatively or additionally detect attacks based on primary attributes or any other suitable data. In one variation, the analysis module 312 is separate from the primary and secondary attribute extractors. In another variation, the analysis module 312 may include primary and/or secondary attribute extractors. However, the analysis module 312 may be configured additionally.

The system 300 may include one or more analysis modules 312 operating in parallel, series, or other suitable order. An example of a plurality of analysis modules 312 working in conjunction with one another is shown in FIG. 6. The set of analysis modules 312 for a given entity or communication may be: predetermined, manually determined, selected based on historical communications, selected based on operational context (e.g., fiscal quarter), or otherwise determined. In a first variation, the system 300 includes one or more analysis modules 312 of the same type or different types for each known attack type. For example, each attack type may be associated with a different analysis module 312. In a second variation, system 300 includes a single analysis module 312 for all attack types. In a third variation, the system 300 includes one or more analysis modules for each attack type (e.g., a first group for phishing attacks, a second group for spoofing attacks, etc.). In a fourth variation, the system 300 includes a cascade/tree of analysis modules 312, wherein a first layer of analysis modules classifies incoming emails having a potential attack type, and subsequent layers of analysis modules analyze whether an email has characteristics of the attack type. FIG. 5 depicts a hierarchical diagram of possible attack types generated by the ML model for a particular customer, as described above with respect to training module 306. In this example, the high-level classification includes spoofing techniques, attack vectors (attack vectors), spoofed parties, attacked parties, and attack targets. In the impersonation technology classification, the attack type may include the name of the fake (spoofing) user, the fake user's email, a compromised account, or be null due to an unknown sender. Based on the target of the attack, the attack types may include wage fraud, stealing of credentials, encouraging users to wire money, bitcoin lasso, wire money lasso, and the like.

However, system 300 may include any number of analysis modules 312 for detecting any number of attack types. In particular, the modeling approach employed by the system 300 discovers that employees, suppliers, and organizations are communicating with normalcy of behavior, and can identify attacks that have never been seen before, as well as zero-day phishing attacks.

The analysis module 312 may include or use one or more of: heuristic means, neural networks, rules, decision trees (e.g., gradient enhanced decision trees), ML-trained algorithms (e.g., decision trees, logistic regression, linear regression, etc.), or any other suitable analysis method/algorithm. The analysis module 312 may output: discrete or continuous output, such as likelihood (e.g., attack likelihood), binary output (e.g., attack/non-attack), attack classification (e.g., classification into one of a plurality of possible attacks), or any other suitable output. The analysis module 312 may: received from a database (e.g., a database of known attack patterns or fingerprints), received from a user, learned (e.g., based on data shared between multiple entities, based on communication data of entities, etc.), or otherwise determined.

Each analysis module may be specific to an attack, attack type/class, or any other suitable set of attacks. System 300 may include one or more analysis modules 312 for each attack set. In one variation, an attack set may be associated with multiple analysis modules 312, where the system 300 may dynamically select the analysis module to use (and/or which output to use) based on the performance metrics of each analysis module of a given attack set. For example, the system 300 may include a heuristic-based analysis module and an ML-based analysis module for a given attack, which are executed in parallel for each communication segment; monitoring callbacks and/or accuracy of both analysis modules (e.g., as determined from entity feedback of e-mail classification); and the analysis module with higher selective energy value performs subsequent communication analysis. The output of all modules, except the highest performance analysis module, may be hidden from the user and/or not used for email attack classification; alternatively, the output of the low performance analysis module may be used to validate the output of the highest performance analysis module, or otherwise used.

One or more of the analysis modules 312 may be entity-specific (e.g., organization, business organization, job title, individual, email address, etc.), shared among multiple entities (e.g., be a global analysis module), or customized or generic.

In one example, first, for each incoming email, the analysis module 312 (e.g., secondary attribute extractor) determines a deviation of the mail from each of a plurality of customer behavioral characteristics or normality of the content. In some embodiments, the deviation is a numerical value or percentage representing the delta between the customer behavioral characteristics and the assigned behavioral characteristics determined from the incoming email. For example, if the customer's behavioral characteristics are "joe smith sends mail almost only from js @ customer identity.com" and a mail purporting to be joe smith incoming has the email address joesmith @ genericmail.com, the deviation will be assigned a high number. Com accounts for roughly 20% of the time, the deviation will still be relatively high, but the deviation will be lower than the previous example. Second, the analysis module 312 feeds the measured deviation as input to one or more attack detectors (e.g., rule-based engine, heuristic engine, ML model, etc.), each of which may generate an output. Third, if an indication is received from one or more ML models that exceeds a deviation threshold for an email attack type, then the analysis module 312 identifies the email as a possible attack corresponding to the email attack type. Analysis module 312 may instruct the ML model to classify the deviations in the incoming e-mail as representing likely malicious e-mail or likely non-malicious e-mail, and to classify the e-mail according to the likely attack type. In some embodiments, the ML model "trip" (i.e., f (email)) exceeds a threshold that deviates from customer behavior characteristics and content normality, and then marks the email as a possible attack.

As shown in fig. 6, the outputs produced by the analysis module 312 may optionally be fed into a main detector that analyzes the outputs to produce a final classification of the traffic as either offensive or non-offensive. The primary detector may selectively output factors, rules, weights, variables, decision tree nodes, or other attack detector parameters that contribute to the classification of attacks.

Remediation engine 314 is optionally operative to perform one or more remediation processes. Remediation engine 314 is preferably implemented in response to the communication classification being an attack (e.g., by one or more analysis modules 312, by a primary detector, etc.), but may alternatively or additionally be implemented at any other suitable time. In some embodiments, the remedial step is based on or associated with a customer remedial policy. Customer remediation policies may be predefined and received by threat detection platform 302, generated based on inference, analysis, and customer data, or otherwise determined. In some embodiments, threat detection platform 302 may prompt the customer to provide one or more remedial steps or components of a remedial strategy in a variety of circumstances. The remedial steps may include, for example, moving an email to a spam folder, moving an email to a hidden folder, permanently deleting an email, performing different actions on it according to the manner in which the user operates, sending a notification to the user (e.g., employee, administrator, security team member), resetting the password of the affected employee, ending all sessions, pushing signatures to a firewall or endpoint protection system, pushing signatures to an endpoint protection system to lock one or more computing devices, etc., as shown in fig. 6. For example, upon discovering a compromised account, threat detection platform 302 may invoke one or more APIs to block the compromised account, reset a connection with a service/database, or change a password through a workflow. Additionally or alternatively, the remedial step may include moving the email from the spam folder back to the inbox (e.g., in response to determining that the email is not an attack).

In some embodiments, remediation engine 314 provides threat detection results and/or other output to the customer through, for example, customer device 318. Examples of client devices 318 include mobile phones, laptops, and other computing devices. In some embodiments, remediation engine 314 sends the output in human-readable format to threat detection platform 302 for display on an interface.

System 300 optionally includes a training module 306 that operates to train the ML model employed by analysis module 312. Each ML model preferably detects a single attack type, but may alternatively detect multiple attack types. In some embodiments, training module 306 trains the ML model by feeding training data into the ML model. The training data may include: a communication tagged by an entity (e.g., a systematically analyzed email sent to security personnel and tagged as offensive or non-offensive, as shown in fig. 6), a communication tagged by a third party, or any other suitable set of communications. In some embodiments, the customer data, ML model, and/or thresholds are different for each customer, which is the result of feeding unique customer behavior characteristics into the ML model to generate the custom analysis. In some embodiments, the training data ingested by the model comprises a tagged data set of "bad" emails received or generated by one or more components of the threat detection platform 302. In some embodiments, the tagged data set of bad emails comprises manually tagged emails. With manual tagging from, for example, a customer administrator, network operator, employee, or security service representative, a customer may be constructed with a considerable amount of malicious emails that may be used to train an ML model based on the customer. In some embodiments, the training data includes received, collected, or inferred customer data and/or email usage data. In some embodiments, the training data may include historical threats previously identified in the client inbox. In some embodiments, different ML models employed have been developed for different known types of attacks. In some embodiments, the emails are scored, weighted, or given percentages or values based on the use of these ML models. In some embodiments, if the email score exceeds a threshold of any ML model, it may be flagged unless it should not be flagged by heuristics or other element representations of the threat detection platform 302.

In some embodiments, the training data used to train the ML model may include manual input received from a customer. Organizations typically have a phishing mailbox through which employees can report emails to the security team, or through which the security team can automatically/manually reroute messages that meet certain conditions. The training data may include emails placed in these phishing mailboxes as malicious emails. In some embodiments, the manual input may include end-user actions that may be entered into the ML model. For example, if an individual moves an email for which the ML model cannot positively decide whether to discard, then the user action may be included as training data to train what actions the ML model should take in a similar context.

In different embodiments, examples of potential attack types that the ML model may be trained to detect include, but are not limited to, vendor spoofing and Legendre attacks.

In some embodiments, a plurality of heuristic data is used as an alternative to or in conjunction with the ML model to detect threats, train the ML model, infer behavioral characteristics or content normality of the client based on client data, select potential attack types for the client, or perform other threat detection tasks. In some embodiments, training one or more aspects of the ML model includes entering a plurality of heuristic data as input training data into the one or more ML models. In some embodiments, the heuristic data is utilized with respect to a rules engine that operates to decide which heuristics to apply under different circumstances. In some embodiments, the rules engine determines whether to apply machine learning or heuristic means in a particular threat detection task. In some embodiments, the one or more rules may include a blacklist and/or a whitelist for certain email conditions.

In some embodiments, any level of granularity of system 300 with respect to analysis module 312, ML models, heuristics, rules, and/or human markup input may be included. In some embodiments, "normal" and "abnormal" behavioral characteristics and content normality may be determined on a per employee, per pair, per department, per company, and/or per industry basis.

In some embodiments, the ML model may be optically refined in a number of ways during operation. In some embodiments, the monitoring module 308 monitors the customer's phishing mailbox to locate false negatives (i.e., emails that the employee subsequently reports to the security team that the ML model missed). In some embodiments, the client may override the remedial decision made by the heuristic and/or the ML model, and in response, the ML model may incorporate the feedback. In some embodiments, if a customer marks a particular feature in an email (e.g., sender email, display name, authentication status, etc.) as suspect, this may be fed back into the ML model. In some embodiments, such feedback is weighted in the ML model based on the status or reputation of the individual responsible for providing the feedback. For example, the ML model may trust the judgment of three level employees on email more than the judgment of one level employees, and may place more emphasis on their feedback in the ML model.

In some embodiments, different types of ML models may be used, including but not limited to gradient enhanced decision trees, logistic regression, linear regression, and the like. In some embodiments, the ML model is replaced with a pure rule-based engine.

FIG. 4 depicts a flow diagram of a process 400 for detecting an email-based threat by monitoring incoming emails (step 404), determining email attributes (step 405), detecting attacks based on the determined attributes (step 406), and optically performing a remediation step (step 408). In one example, process 400 may include collecting email usage data (step 401), generating an entity description based on the email usage data (step 402), monitoring incoming emails (step 404), determining deviations in the incoming emails, feeding the measured deviations into an ML model, flagging the emails as a possible attack (step 407), and performing a remediation step (step 408). Process 400 may optionally include training the ML model to detect an email attack type (step 403).

Process 400 is used to provide email-based threat detection based on a generated customer description that models normal customer behavior and normal email content, and then feeds these normal behavior characteristics and deviations from normal content as input into ML model training for malicious emails.

In some embodiments, process 400 is handled by a network-based platform (e.g., threat detection platform 302 of FIG. 3) operable on a computer server or distributed computing system. Additionally or alternatively, process 400 may be performed on any suitable computing device capable of ingesting, processing, and/or analyzing customer data and email usage data, performing ML techniques, and/or performing remedial actions.

Process 400 may be performed in parallel or in series with delivering email to an email inbox. In one variation, the process 400 is performed in parallel with delivering the email to the recipient's email inbox, where the email is removed retrospectively from the email inbox in response to determining that the email is an attack (and/or has a high likelihood of being an attack). In a second variation, process 400 is performed in-line with delivering email, where in response to determining that the email is not an attack, the email is only delivered to the recipient's email inbox. However, process 400 may be integrated into an email delivery paradigm in other ways. The method can analyze that: all incoming emails, only emails that were marked as non-attacking by the previous security system, only emails that were marked as attacking by the previous security system, or any suitable collection of emails.

As described above, monitoring of incoming emails (step 404) is preferably performed using a monitoring module (e.g., monitoring module 308 of FIG. 3), although emails may be ingested in other manners.

As described above, the email attributes are preferably determined by the extractor, but may be determined in other ways. In one example, the method includes: extracting primary attributes from an incoming email (e.g., using one or more dedicated primary attribute extractors executing in parallel), and determining secondary attributes of the email from the primary attributes and the customer data (e.g., using one or more dedicated secondary attribute extractors executing in parallel).

As mentioned above, attacks are preferably determined using one or more analysis modules, but may be determined in other ways. In one variation, the determined attributes (e.g., primary attributes or secondary attributes) may be fed into one or more analysis modules (e.g., executed in parallel or in series). In some embodiments, each analysis module is specific to an attack type, wherein the plurality of outputs from the plurality of analysis modules are further analyzed (e.g., by a primary detector) to determine whether the email is an attack. In other embodiments, the analysis module detects multiple attack types (e.g., outputs multiple output values, each corresponding to a different attack type, where the output may be a likelihood and/or confidence of the corresponding attack type), and the email may be flagged as an attack when the output value exceeds a predetermined threshold for the corresponding attack type. However, attacks may be detected in other ways.

Step 408 optically includes performing a remediation step, as described above with respect to remediation engine 314 of fig. 3, although the email may be otherwise remediated.

Step 401 includes collecting or receiving email usage data as described above with respect to description generator 304 of fig. 3.

Step 402 includes generating a customer description based on the email usage data, as described above with respect to description generator 304 of FIG. 3.

Step 403 includes training the ML model to detect an email attack type, as described above with respect to training module 306 of FIG. 3.

Step 405 includes measuring the deviation in the incoming e-mail as described above with respect to analysis module 312 of fig. 3.

Step 406 includes inputting the measured deviations into the ML model, as described above with respect to analysis module 312 of FIG. 3.

Step 407 optically includes marking the email as a possible attack, as described above with respect to analysis module 312 of fig. 3.

Integrated approach to detecting security threats

As described above, the conventional email filtering service is not suitable for identifying complex malicious emails, and thus may cause the complex malicious emails to erroneously reach the inbox of the employee. Described herein are threat detection platforms designed to detect security threats to an enterprise in an integrated approach.

Unlike traditional email filtering services, the threat detection platform may be fully integrated into an enterprise environment. For example, the threat detection platform may receive input indicating approval by an individual (e.g., an administrator associated with the enterprise or an administrator of an email service employed by the enterprise) to access an email, an active directory, a group of mail, identify a security event, a risk event, a document, and/or the like. The approval may be given through an interface generated by the threat detection platform. For example, an individual may access an interface generated by the threat detection platform and then grant access to these resources as part of the registration process.

Upon receiving the input, the threat detection platform may establish a connection with one or more storage media that include the resources via one or more Application Programming Interfaces (APIs). For example, the threat detection platform may establish a connection via the API with a computer server managed by the enterprise or some other entity on behalf of the enterprise.

The threat detection platform may then download resources from one or more storage media to build an ML model that may be used to identify email-based security threats. Thus, the threat detection platform may build a ML model based on past information to better identify security threats in real-time as emails are received. For example, the threat detection platform may ingest incoming and/or outgoing emails corresponding to the past six months, and then the threat detection platform may build an ML model that may understand the normality of communications with internal contacts (e.g., other employees) and/or external contacts (e.g., suppliers) of the enterprise. Thus, real threats, rather than man-made threats, may be used to identify whether incoming emails represent a security threat.

This approach allows the threat detection platform to employ an effective ML model almost immediately upon receiving approval from the enterprise to deploy it. Most standard integration solutions, such as anti-spam filters, can only have access from now on in a timely manner (i.e., after receiving approval). Here, however, the threat detection platform may employ a look-back approach to develop an immediately effective personalized ML model. Moreover, this approach enables a threat detection platform to identify security threats residing in employee inboxes through a repository of past emails (repositors).

The API-based approach described above provides a consistent, standard way to view all email handled by a business (or other entity on behalf of the business, such as an email service). This includes internal to internal e-mail not visible in standard integration solutions. For example, SEG integration, which occurs through mail exchange (MX) logging, can only see incoming emails from external sources. The only way to make an email from an internal source visible to SEG integration is to reroute the email from the outside through the gateway.

The threat detection platform may design an ML model to classify emails that are determined to be potential security threats into a plurality of categories. Fig. 7 depicts how most incoming messages are classified as non-malicious, while a small portion of incoming messages are classified as malicious. For example, here, nearly 99.99% of incoming messages have been classified as non-malicious and are therefore immediately forwarded to the appropriate inbox. However, threat detection platforms have found three types of security threats: (1) an email account breach (EAC) attack; (2) advanced attacks; and 3) spam attacks. In some embodiments, the threat detection platform employs a single ML model that is capable of classifying these different types of security threats. In other embodiments, the threat detection platform employs multiple ML models, each capable of classifying a different type of security threat.

FIG. 8A includes a high-level schematic diagram of a detection architecture of a threat detection platform, according to some embodiments. Initially, the threat detection platform will determine that an event has occurred or is occurring. One example of an event is the receipt of an incoming email. As described above, the threat detection platform may be programmably integrated with email services employed by the enterprise such that all external emails (e.g., those received from and/or transmitted to external email addresses) and/or all internal emails (e.g., those sent from one employee to another) are routed through the threat detection platform for inspection.

The threat detection platform will then execute an entity resolution procedure to identify the entity involved in the event. Typically, an entity resolution process is a multi-step process.

First, the threat detection platform will obtain information about the event. For example, if the event is the receipt of an incoming email, the threat detection platform may examine the incoming email to identify a source, sender identification, sender email address, recipient identification, recipient email address, subject, header, body content, and the like. And the threat detection platform may determine whether the incoming email includes any links, attachments, etc. FIG. 9 depicts an example of an incoming email that may be examined by a threat detection platform.

Second, the threat detection platform will resolve the entities involved in the event by examining the acquired information. FIG. 10A depicts how information gathered from an incoming email is used to determine different entities (also referred to as "features" or "attributes" of the incoming email). Some information may correspond directly to an entity. Here, for example, the identity of the sender (or purported sender) may be determined based on the source or sender name. Other information may correspond indirectly to an entity. Here, the identity of the sender (or purported sender) may be determined, for example, by applying a Natural Language Processing (NLP) algorithm and/or a Computer Vision (CV) algorithm to the subject, textual content, etc. Thus, the entity may be determined based on the incoming email, information derived from the incoming email, and/or metadata accompanying the incoming email. FIG. 10B depicts an example of how the threat detection platform performs an entity resolution procedure to determine the identity of the sender of an incoming email. Here, the threat detection platform has identified sender identification based on (1) the sender name ("Bob Roberts") derived from the incoming email, and (2) the subject matter processed by the NLP algorithm.

In some embodiments, the threat detection platform will use the content of human-monitored to augment the information acquired. For example, the characteristics of an entity may be extracted from a dataset of human monitoring of well-known brands, domains, and the like. These manually monitored data sets can be used to augment the information gathered from the enterprise's own data sets. Additionally or alternatively, in some cases, a person may be responsible for tagging entities. For example, one may be responsible for marking the landing page and/or the Uniform Resource Locator (URL) of the link found in the incoming email. Human participation may be useful when quality control is a priority, when full marking of the assessment indicators is required, and the like. For example, a person may actively select which data/entities should be used to train the ML model used by the threat detection platform.

The threat detection platform may then build, compile, and/or calculate corpus statistics for the entities involved in the determination of the event. These corpus statistics may be stored/visualized in the form of entity risk maps. As shown in fig. 11, an entity risk graph may contain historical combinations of these entities and risk scores associated with these historical combinations. Thus, the entity risk graph represents one way in which the visual threat detection platform has established, compiled, and/or computed corpus statistics types. Each node in the entity risk graph corresponds to a real-world entity, IP address, browser, etc. Thus, the entity risk graph may include a risk score for a domain detected in an incoming email, a risk score for an IP address detected in metadata accompanying the incoming email, a risk score for a sender ("employee a") communicating with a receiver ("employee B"), and so on. At the same time, each connection between a pair of nodes represents a risk that has been determined in the past events involving those nodes. Fig. 12 depicts an example of an entity risk graph.

FIG. 8B includes a more detailed example of a process by which a threat detection platform may process data related to past emails (here, from Microsoft Windows)365), extracting primary attributes from past e-mails, generating corpus statistical information based on the primary attributes, obtaining secondary attributes based on the primary attributes and the corpus statistical information, training the ML model with the primary attributes and/or the secondary attributes, and then scoring the incoming e-mails based on risks faced by the enterprise using the ML model.

FIG. 13 provides an example matrix of stages that may be performed by the threat detection platform in processing data, extracting features, determining whether an event represents an attack, etc. During the first phase, the threat detection platform may download a variety of data related to enterprise communication activities. For example, the threat detection platform may establish a connection via the API to a storage medium that includes data related to past communication activities involving the enterprise employee. The storage medium may be, for example, an email server that includes past emails sent/received by employees of the enterprise. Thus, the threat detection platform may download a variety of data into the local processing environment, such as raw emails, raw attachments, raw directory listings (e.g., Microsoft of the enterprise)Directory), original mail filters, original risk events, etc.

In a second phase, the threat detection platform may extract text, metadata, and/or signals (collectively, "extracted items") from the downloaded data. For example, the threat detection platform may use learning model parameters for text extraction to identify attachment signals in an email. The term "extracted signal," as used herein, refers to any information, raw or derived, that is used as input by the algorithm employed by the threat detection platform. Examples of extraction signals include, but are not limited to, structured data, such as an IP address, third party data or data sets, API-based integration information used by any third party tool, or other enterprise data or data sets. The extraction items may be saved in a column format, where each column is updated separately. As shown in fig. 14A, each column may be associated with one of three different conditions: (1) an extractor (e.g., authentication extraction) (2) a model application (e.g., extracting spam text model predictions); and (3) rules (e.g., extracting specific phrases defined via the rule interface). 14B-C depict examples of data structures that may be populated by the threat detection platform using extracted items.

In a third phase, the threat detection platform may identify entities involved in the communication activity. For example, if the communication activity is the receipt of an incoming email, the threat detection platform may identify a sender identification, sender email address, or subject based on the text, metadata, and/or signals extracted during the second phase. As noted above, in some cases, humans may be responsible for assisting the entity in resolving. Thus, the third phase may be performed in part by one or more persons and in part by the threat detection platform, or may be performed entirely by the threat detection platform.

In a fourth phase, the threat detection platform may generate summaries for the entities identified in the third phase (also referred to as "attributes" of the email) based on past communication activities involving those entities. That is, the threat detection platform may generate corpus statistics that represent risk scores associated with historical combinations of entities identified in the third stage. These corpus statistics may be stored/visualized in the form of an entity risk graph, as shown in fig. 12. Additionally or alternatively, these corpus statistics may be stored in one or more databases. Fig. 14D depicts an example of a database including all corpus statistics, and fig. 14E depicts an example of a database including corpus statistics related to a sender.

In a fifth stage, the threat detection platform may generate a score representing the risk of the enterprise. The score may be generated on a per communication basis, per attack type, or per entity basis. Thus, the threat detection platform may score each incoming email for the enterprise employee to determine which, if any, incoming emails should be blocked from reaching the employee's inbox. Generally, incoming emails are scored based on a miss score (cyber score), while misses are scored based on the number/type of malicious emails received. For example, the threat detection platform may include a threat detection engine and an account cave-in engine that consider incoming emails, respectively. The output produced by each engine (e.g., by score, suspicion, etc.) may be used by another engine for better detection. For example, if the account collapse engine determines that the account is within a suspicious range, the threat detection engine may more sensitively monitor all emails from the account. This may prevent an unauthorized entity (also referred to as an "attacker") from taking over an account and then using the account to launch a phishing attack. The scoring of communication activity will be further discussed below with respect to fig. 21.

In some embodiments, the threat detection platform also "supplements" the entity identified in the third phase. As used herein, the term "appending" refers to the act of appending (appending) an additional signal to a communication (e.g., an email). These additional signals may be defined at three locations: (1) code-defined extractors (e.g., secondary attributes); (2) model applications (e.g., URL extraction model, lasso model, employee impersonation model); and (3) rules (e.g., a particular white list or black list). As shown in fig. 14F, the e-mail may be augmented with a Directed Acyclic Graph (DAG) using a database, rules, and/or models to produce a final set of signals to be used for detection. FIG. 14G shows an example of an email being supplemented (i.e., an email having primary, secondary, and/or scored attributes).

In a sixth stage, the threat detection platform may compare each score to a threshold to determine how the email should be classified. For example, the threat detection platform may determine whether to classify the email as critical (borderline), suspicious, or bad. Real-time data and/or log recurrences may be used to control the thresholds used to determine how each email should be classified to determine the acceptable number of flagged messages. In some embodiments, the threshold is continuously or periodically updated to maintain the target marking rate. For example, the threat detection platform may change the threshold such that a predetermined percentage (e.g., 0.1%, 0.5%, or 1.0%) of all incoming emails are marked as critical, suspicious, or bad. The threshold for a given model may be calibrated based on an internal target for the number of false positives and/or false negatives generated by the given model. In general, increasing the threshold will result in the model having fewer false positives at the cost of more false negatives, while decreasing the threshold will result in the model having fewer false negatives at the cost of more false positives. FIG. 14H illustrates how each rule/model employed by the threat detection platform returns a score that may be adjusted by a threshold. These rules/models may correspond to a subset of the entities extracted in the second phase.

Threat intelligence

The customer may want to receive threat intelligence regarding attacks that the threat detection platform has discovered. Because the threat detection platform can monitor incoming email in real time, unique threat intelligence can be generated that is likely to detect abnormal communication activity faster than traditional email filtering services.

The threat detection platform may be designed to act as a centralized system that captures the failure Indicators (IOCs) collected from a variety of sources, including internal sources (e.g., enterprise databases) and external sources. Examples of IOCs include IP addresses, email addresses, URLs, domains, email attachments, cryptocurrency (e.g., bitcoin) addresses, and the like. The IOC database may be used for a number of different purposes. While the most important (paramount) goal is to detect incoming emails representing security threats, the enterprise may be provided with a database for ingestion by other security products, such as firewalls, security orchestration (organization), automation and response (SOAR) tools, and the like. For example, an enterprise may find it useful to provide management tools (e.g., gateways) with IOCs that are considered malicious to help protect employees from future threats, unwise choices, and so forth. As another example, an enterprise may expose employee accounts associated with an IOC for further review (e.g., to determine if the employee accounts have been lost). Additionally or alternatively, the threat management platform may be programmed to infer threat conditions for each IOC. For example, the threat management platform may classify each IOC as representing a phishing, malware, or failed operation.

Many businesses may consider it sufficient to check for malicious email activity (campaigns) and the contained employee accounts exposed by the threat detection platform. However, some enterprises have begun to monitor IOCs to handle security threats in real time. For example, an enterprise may monitor IOCs collected from incoming emails through a threat detection platform to determine appropriate response and/or proactive measures to prevent these IOCs from re-entering their environment in the future.

At a high level, threat detection platforms may be designed to perform a variety of tasks, including:

ingest threat intelligence from different types of sources, such as:

IOC inferred based on statistical information of previously seen attacks (e.g., number of good or bad emails sent from the same source IP address);

IOC based on detected attacks (e.g., lost domain and phishing links); and

omicron an internal security analyst appointed by the enterprise; and

export of threat intelligence (e.g., as a database used on lines in checking incoming emails, or as a feed to be ingested by other security threat products);

embodiments of the threat detection platform may also be designed to allow IOC to be enabled/disabled on a per-enterprise basis. For example, an enterprise may upload a list of IOCs that are specifically used when examining its email to a threat detection platform. Moreover, the threat detection platform may annotate IOCs with a likelihood that those IOCs that may be malicious are supported. Thus, the threat detection platform may be designed to flag those emails that are determined to be malicious, as well as those that are likely to be malicious. In some embodiments, the threat detection platform can set time limits on each IOC to avoid permanent blacklisting. For example, if a given website is found to be carrying a phishing website, the threat detection platform may capture the given website as an IOC for a specified period of time, after which it checks whether the given website is still carrying a phishing website.

FIG. 15A includes a high level system diagram of a threat intelligence system of which the threat detection platform is a part. As shown in fig. 15, IOCs may be generated/identified by a number of different sources. These sources include incoming emails, URLs, domains, external feeds (e.g., from another security threat product), internal security analysts, and so forth.

The threat detection platform may override (overlay) the IOC with a discovered attack (e.g., by checking incoming emails). That is, the threat detection platform may attempt to match the IOC with the corresponding attack, so that the score computed for each attack may be attributed to the appropriate IOC. Thereafter, the threat detection platform may filter the IOCs (e.g., based on the scores that have been attributed to them), and then use the filtered IOCs (and corresponding scores) to further enhance the ability to detect security threats.

In some embodiments, the threat detection platform may leverage the ecosystem of its multiple enterprises to provide federated capabilities. For example, the threat detection platform may build a central vendor database throughout its environment to determine a list of vendors and learn what constitutes the normal behavior of each vendor. For example, the central provider database may specify the email endpoints used by each provider, the accountants responsible for sending invoices for each provider, the invoice software used by each provider, the routing/bank account number of each provider, the origination location of each provider's invoice, and the like. As another example, the threat detection platform may build a central threat database throughout its environment to determine the most notable list of entities (e.g., IP addresses, URLs, domains, email addresses) in a send attack. A central threat database may be helpful because it allows the threat detection platform to apply knowledge harvested from one enterprise throughout the ecosystem. As another example, the threat detection platform may automatically monitor an inbox to which employees are instructed to forward suspicious emails. When the threat detection platform discovers a malicious email that was missed by its ML model, the threat detection platform may automatically pull (pull) the malicious email from all other inboxes in the enterprise that discovered the malicious email. In addition, the threat detection platform may use its federated ecosystem to pull malicious emails from inboxes of other enterprises.

Generally, threat detection platforms are designed such that data sets can be computed, tracked, and added to a modeling pipeline in which ML models are developed, trained, and the like. Each data set can be easily regenerated, updated, and searched/viewed. As described above, the data set may be compiled through an interface generated by the threat detection platform. For example, to train the ML model, a person may tag different elements included in the dataset. Examples of databases accessible to the threat detection platform include:

a supplier database, which includes a set of common suppliers from which the enterprise receives emails. Examples of suppliers include AmericanLloyd’sAnd the like. In the provider database, each provider may associate a specification name, a list of security domains (e.g., domains to which emails are linked, domains that receive emails, domains that work with the provider), a list of aliases, a list of regular expressions (e.g., "employees served via a third party"), or other suitable notations (signifier), among others. The threat detection platform may use the provider database to whitelist known good/safe domains for which the provider sent email, or perform other types of email scoring or analysis.

A domain database, which includes a set of top-level domains. For each domain, the threat detection platform may track some additional data. For example, the threat detection platform may determine whether the domain has been whitelisted as a security domain, whether the domain corresponds to a bearer service, and whether the domain is a redirector. Moreover, the threat detection platform may determine the description of Google's SafeBrowsingAPI about the domain (if any), the frequency with which the domain is included in emails received by the enterprise, how much tagged data can be seen, which cached Whois data is available for the domain, etc.

A Whois registrant database that includes information about each registrant that is derived from Whois data stored in the domain database.

A URL database comprising URL-level data derived from links included in emails received by the enterprise. For each URL, the threat detection platform may populate the entry using: com, data obtained via the SafeBrowsing API of Google, or statistical information about how often a URL is seen by a business.

Employee database, which includes enterprise employee information. Generally, a threat detection platform maintains a separate employee database for each business whose emails are monitored. For each employee, the threat detection platform may use the following fill entries: corporate identifiers (identifiers), names, employee identifiers, aliases, common email addresses (e.g., certified business email addresses and personal email addresses), Lightweight Directory Access Protocol (LDAP) roles, and the number of attacks observed against employees.

A tag database (also referred to as a "feedback database") that includes tagged data used to build aggregate feedback for each business, employee, etc. The entries may include aggregated feedback for email addresses, domains, links, normalized/hashed bodies (hashed bodies), and the like. For example, an entry in the tag database may specify that 15 of the 30 tags from an email of "A @ explicit.com" have been marked as positive for an attack (positive), or that 10 of the 11 tags in an email containing a link to http:// xyz.com "have been marked as positive for an attack.

As described above, the enterprise may monitor IOCs collected from incoming emails through a threat detection platform to identify appropriate responsive and/or proactive measures to prevent these IOCs from re-entering their environment in the future. By quickly exposing the IOCs, the threat detection platform may alert the enterprise to improve security posture against security threats. FIG. 15B depicts an example of an interface through which an enterprise may examine IOCs discovered by a threat detection platform.

In some embodiments, the threat detection platform provides the ability to extract and/or derive IOCs. For example, through the interface shown in FIG. 15B, an enterprise may export information (also referred to as "threat intelligence") related to these IOCs into an administrative tool to improve its ability to detect/handle these security threats in the future. The threat detection platform may format the information (e.g., into a machine-readable form) so that it is readable and sharable. For example, the information may be formatted according to the trusted automation exchange (TAXII) specifications for structured threat information expression (STIX) and index information. In general, STIX will indicate what type of threat intelligence is formatted, and TAXII will define how underlying information is relayed.

A schema (shema) may be employed to ensure that threat intelligence is interpreted in a consistent manner. For a given IOC, the scheme (scheme) may represent:

observable output (e.g., actual URL, IP address, domain, or account);

classification (e.g., whether the IOC is private or public);

type (e.g., whether the IOC is a URL, IP address, domain, or account);

severity (e.g., whether IOC poses a low, medium, high, or very high threat);

confidence measures (e.g., scores on a scale of 0-100, representing the confidence that the IOC represents a security threat);

the time of observation; and/or

Traffic Light Protocol (TLP) metrics, indicating how widely underlying information should be shared. As shown in fig. 15B, some of this data may be presented on an interface for viewing by the enterprise. For example, the interface may allow an enterprise to easily rank IOCs by severity level so that those IOCs representing the greatest threat can be handled.

FIG. 16 illustrates how a threat detection platform derives/infers attributes from data obtained from multiple sources, describes (profiles) the attributes against ML models as inputs, and then examines the outputs produced by the ML models to determine if a security threat is present. As shown in FIG. 16, these attributes may be provided as input to a variety of ML models associated with different types of attacks. For example, features herein related to body styles (e.g., HTML, signatures, phone numbers, etc.) of incoming emails can be fed into an ML model designed to detect internal employee EAC attacks, system EAC attacks, external EAC attacks, employee impersonation attacks, vendor impersonation attacks, and partner impersonation attacks.

FIG. 17 includes a high-level architectural depiction of a threat detection platform capable of generating/updating data for processing incoming emails in real-time via batch execution. Batch processing may be particularly helpful in facilitating real-time processing to further enhance the threat detection capabilities of the threat detection platform. This concept, which may be referred to as near real-time scoring, may be used for computationally intensive detection tasks, such as processing attachments that have been attached to incoming emails.

Threat intelligence may represent a core backbone of a long-term strategy for handling email-based security threats. For example, an enterprise may employ a threat detection platform to better understand threats to its security in a number of ways. First, the threat detection platform may examine corpus statistics to detect instances of employee account loss (EAC). For example, given a series of login activities and email activities, how often good events and/or bad events are detected for a particular attribute (e.g., IP address, sender email address, sender location, etc.). Second, the threat detection platform may examine corpus statistics to determine what constitutes normal/abnormal communication activity based on attributes of emails associated with the business. Third, the threat detection platform may generate a set of "bad entities" or "malicious entities" that the enterprise may programmatically access to trigger actions in their respective environments. For example, the enterprise may configure its firewall based on the set of rogue entities. Examples of entities include employees, brands, vendors, domains, locations, and the like. Fourth, the threat detection platform may generate and/or react to signatures that are deemed malicious in near real-time (e.g., within minutes) to obtain the necessary data. Fifth, given the attributes of the risk event, the threat detection platform may identify past risk events that contain this attribute. By analyzing these past risk events, the threat detection platform may better understand whether the attribute is associated with a risk event that is ultimately determined to be safe or malicious. A particular module (also known as a "graphical browser") may be responsible for visually displaying the determination of how these past risk events affect risk.

At a higher level, a threat detection platform may be described as analyzing risk events (or simply "events") to discover threats to an enterprise. An example of a risk event is the receipt or transmission of an email. Another example of a risk event is a login activity or some other communication with a cloud-based mail provider. Another example of a risk event is the creation of a mail filter. The maliciousness of a given risk event may be related to the maliciousness of the entity associated with the given risk event. For example, the mail filter would correspond to a business employee, an email with an invoice would be received from a supplier, and so on. All of these entities are connected to each other through any connection (e.g., the sender of the email will work for the provider and the employee will send the email to other employees of the enterprise). The term "signature" as used herein refers to a combination of one or more attributes that categorize a risk event. Signatures may be key to counting risk events with specific combinations of attributes.

Fig. 18A includes a high-level schematic diagram of a process by which a threat detection platform may perform threat intelligence. As shown in fig. 18A, data may be obtained from several different inputs (also referred to as "sources"). Herein, the configuration data includes definitions of risk event attributes that are to be tracked by the threat detection platform. For example, the configuration data may include instructions/algorithms that prompt the threat detection platform to "listen" for risk events associated with a given display name and a given sender email address. Domain-specific raw data (e.g., incoming emails with attributes) may also be obtained by the threat detection platform. In some embodiments, the user is allowed to provide functionality to extract/map risk events to their attributes.

The event ingester module (or simply "event ingester") may be responsible for converting raw data into internal patterns of risk events. The schema may be designed to hold multiple risk events regardless of their type (e.g., email, login activity, mail filter). The statistics builder module (or simply "statistics builder") may be responsible for mapping signatures for attribute dimensions of a date range to counts of risk events.

FIG. 18B includes a high-level diagram of a process by which a threat detection platform may "produce" signatures for use in determining threats posed by incoming emails. Initially, a real-time scoring module (also referred to as a "RT scorer") may process raw data related to incoming emails. The processed data associated with each incoming email may be passed to a counting service (also referred to as a "counting system") that converts the processed data into processed risk events.

In addition, each incoming email tagged by the front end (e.g., via an interface generated by the threat detection platform) may be passed to a counting service that converts the tagged email into a processed risk event. These tags may indicate whether the incoming e-mail represents a security threat. Thus, the processed risk events resulting from the tagged emails can be associated with a security risk metric.

The processed risk events created by the counting service may be stored in a database (e.g., a Redis distributed database). This database can be queried for signatures. For example, a query may be submitted against a white list of signatures determined to not represent a security threat. As another example, a query may be submitted for a count of signatures having a given attribute or combination of attributes.

Instant signature and corpus statistics

As described above, embodiments of the threat detection platform may be designed to detect security threats by examining behavior, identity, and content rather than metadata, links, domains, signatures, and the like. However, it may be advantageous to consider such information to detect security threats in a more accurate, consistent, and efficient manner (e.g., in terms of time and resources).

Several different components of the threat detection platform may extract values from this information. Examples of such components include:

a database that takes signatures periodically and then uses these signatures to detect attacks;

ML models, designed to take signatures periodically and then exploit them in a probabilistic way to detect attacks;

algorithms that can aggregate activities historically considered safe or normal to be provided as input to the ML model;

a trawling module (also known as a "trawler"), capable of creating new signatures by examining raw data from the past;

an ML model designed to infer general rules for detecting URL-based attacks by examining past emails with insecure URLs; and

an ML model designed to periodically check the signatures (or the raw data from which those signatures were derived) to detect changes in the communication pattern (e.g., determined based on the subject line, sender address, etc.).

For example, suppose an email is received from an address (attacker @ badsite1.com) that has not been seen previously, which contains an attachment with an unknown website (badsite2.net) link. The threat detection platform may separately and immediately identify all emails received from the address as potentially representing a security threat and all emails including a link to the website as potentially representing a security threat. This can be done without requiring a person to view the unknown web site.

The key to achieving this includes (1) updating corpus statistics in a timely (i.e., non-batch) manner, and (2) having corpus statistics indexed by date. FIG. 19A includes a high-level diagram of a process by which a threat detection platform may index corpus statistics to create a date-divided database of signatures and corpus statistics that may be used to more efficiently identify unsafe entities. Such a process allows identification of unsafe entities via exact matches with signatures residing in the database and possible matches with signatures as determined by the ML model. Furthermore, this two-branch (bi-rooted) approach to identifying unsafe entities allows the threat detection platform to react faster to attacks involving unsafe and potentially unsafe domains (also referred to as "lost domains") and attacks utilizing secure domains such as dropbox.

Fig. 19B depicts an example of a database including signature and corpus statistics. It may be desirable to divide by date so that the database can be used to handle the message without further knowledge training. For example, the database may be updated in near real-time based on output generated by a real-time scoring module (e.g., the RT scorer of fig. 18B) and/or tags input via an interface (e.g., received by the front end of fig. 18B). As described above, the database may be populated/backfilled based on past emails associated with a given time interval (e.g., 3,6, 9, or 12 months).

Conceptually, the threat detection platform may organize the data into one or more data structures. For example, in the case of corpus statistics, each business may be given a single table. These tables may have a "N" number of rows, where N is a relatively fixed integer. For example, the table for corpus statistics may include 270 rows if the threat detection platform is interested in tracking data values for 270 days, 365 rows if the threat detection platform is interested in tracking data values for 365 days, and so on. Similarly, the threat detection platform may assign each enterprise a single table for signatures. However, the number of rows in these tables typically varies with the new signatures found in the incoming e-mail.

Employee Account Fall

FIG. 20 illustrates an example of how a threat detection platform detects employee account cave-in (EAC). At a high level, the threat detection platform may learn about the enterprise by identifying the launch points of the enterprise (e.g., Virtual Private Network (VPN) and IP addresses), determining which launch points are considered normal, and then employing personalized, enterprise-based learning to detect security threats. For example, the threat detection platform herein examines raw data (e.g., in the form of mail filters, logins, risk events, and phishing messages) and aggregated data (e.g., in the form of corpus statistics, logins corpus statistics, and secondary databases) to discover one or more user missing signals.

The threat detection platform then employs a plurality of detectors to score the user's missing signal. Each score may represent a degree to which the user-lost signal corresponds to a likelihood that the employee account has lost. Thus, a user-lost signal may be discovered on a per-user basis (e.g., for each employee of the enterprise).

The threat detection platform may detect the EAC instance by comparing the user activity to the scored user miss signal. For example, based on the location and/or frequency of login, the threat detection platform may discover that a given user's account may have been lost. However, the threat detection platform does not necessarily need to take immediate action. For example, the threat detection platform may determine what action, if any, to take based on which user miss signals represent abnormal behavior, the scores of these user miss signals, and so on. For example, if the relevant user miss signal has a high score, the threat detection platform may immediately take action to prevent further access to the account, but if the relevant user miss signal has a low score, the threat detection platform may continue to monitor the account.

Method for accurately scoring

The term "accurate scoring" encompasses a combination of several concepts discussed further above. FIG. 21 depicts a high level flow chart of a process 2100 for scoring threats posed by incoming emails.

First, the threat detection platform may employ one or more ML models (e.g., deep learning models) to consume (consume) the entirety of the features extracted for incoming emails from the primary and secondary attributes to identify potential security threats (step 2101). In general, these ML models may be referred to as "ML detectors". In some embodiments, a real-time proportional-integral-derivative (PID) controller is used to adjust the threshold for each entity whose email is being monitored to account for variations in attack type, email content, etc. (landscapes). The threshold ensures that the ML model has high accuracy and continues to maintain high accuracy over time. To cover general attack scenarios, the threat management platform may employ a combination of federated and enterprise-specific ML models that are able to capture nuances of enterprises of complex attacks (e.g., spearphishing attacks).

Second, the threat detection platform may collect signatures of the IOCs in real-time to determine the nature of any security threats identified by the ML detector (step 2102). Examples of IOCs include IP addresses, email addresses, URLs, domains, and the like. For a zero hour attack, the IOC can be extracted from the email because the email-based attack is identified by the ML detector. These IOCs can be automatically ingested into the database in real time as "signatures". Thereafter, the signature can be used in conjunction with an ML detector to discover future attacks with the same characteristics.

Third, the threat detection platform may perform deep feature extraction to identify a zero hour attack (step 2103). Identifying a zero hour attack requires more intensive content analysis to understand the nuances of possible attacks. For example, the deep learning submodel may be applied to understand the text, content, mood, and/or tone of an email. As another example, to find a phishing page, a landing page embedded in a link of an email may be compared to a set of known landing pages using computer vision. As another example, a web crawl may be performed to extract information about deep links (e.g., links embedded in attachments or links that may access linked websites) to discover instances of deep phishing.

Threat detection, estimation, and remediation

FIG. 22 depicts a flowchart of a process 2200 for applying a personalized Machine Learning (ML) model to emails received by employees of a business to detect security threats. Initially, the threat detection platform may establish a connection with a storage medium that includes first data related to past emails received by employees of the enterprise (step 2201). The first data may include the past email itself, or information related to the past email, such as a primary attribute or a secondary attribute. In some embodiments, the threat detection platform establishes a connection with the storage medium via an Application Programming Interface (API). In such embodiments, the threat detection platform may not establish a connection with the storage medium until an input is received from an administrator associated with the enterprise indicating approval to access the first data.

The threat detection platform may download a first portion of the first data into the local processing environment (step 2202). For example, the threat detection platform may download all emails received by employees during the past 3 months, 6 months, 12 months, etc. The threat detection platform may then build a personalized ML model for the employee based on the first portion of the first data (step 2203). For example, the threat detection platform may parse (pare) each email included in the first data to find one or more attributes, which the threat detection platform may then provide to the ML model as input for training. Examples of attributes include sender name, sender email address, subject, etc. Because the personalized ML model is trained using past emails received by employees, normal communication habits can be established immediately after deployment.

Thereafter, the threat detection platform may receive second data related to the email received by the employee (step 2204). The threat detection platform may determine whether the email represents a security risk by applying a personalized ML model to the second data (step 2205). Such an action will result in the personalized ML model producing an output that represents whether the email represents an attack or not. For example, the output may specify whether the email is suspicious, or the output may specify that the email is not in compliance with past communication activities of the employee.

In some cases, based on the output, the threat detection platform will determine that the email represents an attack (step 2206). In this case, the threat detection platform may characterize the attack along multiple dimensions (also referred to as "faces") (step 2207). For example, the threat detection platform may determine an identity of the attacked party, an attack vector, an identity of the spoofed party, a spoofing policy, and/or an attack target.

Other steps may also be included in some embodiments. For example, the threat detection platform may download a second portion of the first data corresponding to the local processing environment. The second portion of the first data may correspond to a different time interval than the first portion of the first data. For example, the first portion of the first data may include all emails received by the employee within the past 6 months, and the second portion of the first data may include all emails received by the employee 6-12 months ago. The threat detection platform may then determine whether any emails included in the second portion of the first data represent a security risk by applying the personalized ML model to the second portion of the first data.

FIG. 23 depicts a flowchart of a process 2300 for detecting and characterizing email-based security threats in real-time. Initially, the threat detection platform may receive an email addressed to a business employee (step 2301). The threat detection platform may then apply the first model to the email to generate a first output representing whether the email represents malicious email (step 2302). The first model may be trained using past emails addressed to employees (and possibly other employees) that have been proven to be non-malicious emails. Thus, the first model may be referred to as a "positively secured model". The first model serves as a first level of threat detection and thus may be tailored/designed to allow most e-mail (e.g., more than 90%, 95%, or 99% of all incoming e-mail) to reach the intended destination.

In some cases, based on the first output, the threat detection platform will determine that the email represents a malicious email (step 2303). In this case, the threat detection platform may apply a second model to the email to produce a second output that represents whether the email represents a given type of malicious email (step 2304). As described above, the second model may be one or more models that are applied to the email in response to determining that the email represents malicious email. Thus, the threat detection platform may apply a plurality of models to the email to produce a plurality of outputs, and each model of the plurality of models may correspond to a different type of malicious email.

The threat detection platform may then determine whether to remedy the email based on the second output (step 2305). That is, the threat detection platform may determine which should be performed based on the second outputAction (if present). For example, if the second output representation email includes a link to an HTML resource, the threat detection platform may follow the link to access the HTML resource using a virtual web browser, extract a Document Object Model (DOM) for the HTML resource through the virtual web browser, and analyze the DOM to determine whether the link represents a security risk. As another example, if the second output representation email is included to a bearer service (e.g., Google) accessible by the accessible networkOr Microsoft Windows) A primary link of the hosted resource that the threat detection platform may follow to access the resource using a virtual web browser, examine the content of the resource through the virtual web browser to discover if there are any secondary links to secondary resources, follow each secondary link through the virtual web browser to analyze the content of the corresponding secondary resource, and determine whether the primary link represents a security threat based on whether any secondary links are determined to represent a security threat. As another example, if the second output representation email includes a link to an HTML resource, the threat detection platform may follow the link to access the HTML resource using a virtual web browser, capture a screenshot of the HTML resource through the virtual web browser, apply a Computer Vision (CV) algorithm designed to identify (identity) a similarity between the screenshot and a library of certified log-in websites, and determine whether the link represents a security threat based on output generated by the CV algorithm. As another example, if the second output indicates that the email includes an attachment, the threat detection platform may open the attachment in a secure processing environment and then determine whether the attachment represents a security threat based on an analysis of the attachment content. For example, the threat detection platform may use a headerless browser instance running on a stand-alone computer server (also referred to as a "sandbox computer server") to review attachments (e.g., by generating screenshots of their contentA graph) rather than opening an attachment directly on a computing device associated with an email recipient. Further, the threat detection platform may examine any links included in the attachment as described above.

In some embodiments, the threat detection platform is further configured to apply a third model designed to produce an understandable visualization component based on the second output (step 2306). In embodiments where the second model is part of a model integrant applied by the threat detection platform, the third model may aggregate outputs generated by the models in the integrant, characterize the attack based on the aggregated outputs, and then convert the aggregated outputs into understandable visualization components. For example, the third model may generate a notification that identifies the type of security threat posed by the email, whether remedial action is required, and the like. Understandable visualization components may be created so that a security professional responsible for handling/mitigating security threats may more easily understand why an incoming email is marked as representing an attack.

Processing system

Fig. 24 is a block diagram illustrating an example of a processing system 2400 in which at least some of the operations described herein may be implemented. For example, some components of processing system 2400 may be carried on a computing device that includes a threat detection platform (e.g., threat detection platform 214 of fig. 2). As another example, some of the components of processing system 2400 may be carried on a computing device that is queried by a threat detection platform to obtain email, data, and the like.

The processing system 2400 may include one or more central processing units ("processors") 2402, a main memory 2406, a non-volatile memory 2410, a network adapter 2412 (e.g., a network interface), a video display 2418, input/output devices 2420, control devices 2422 (e.g., a keyboard and pointing device), a drive unit 2424 including a storage medium 2426, and a signal generation device 2430 communicatively coupled to the bus 2416. Bus 2416 is shown to represent an abstraction of one or more physical buses and/or point-to-point connections, connected by appropriate bridges, adapters, or controllers. Thus, bus 2416 may include a system bus, a Peripheral Component Interconnect (PCI) bus or a PCI Express bus, an end-to-end bus technology (HyperTransport) or Industry Standard Architecture (ISA) bus, a Small Computer System Interface (SCSI) bus, a Universal Serial Bus (USB), an IIC (I2C) bus, or an Institute of Electrical and Electronics Engineers (IEEE) standard 1394 bus (also known as a "Firewire").

The processing system 2400 may be connected to a desktop computer, a tablet computer, a Personal Digital Assistant (PDA), a mobile phone, a game console, a music player, a wearable electronic device (e.g., a watch or fitness tracker), a network-connected ("smart") device (e.g., a television or home assistant device), a virtual/augmented reality system (e.g., a head-mounted display), or another electronic device capable of executing a set of instructions (sequential or otherwise) that specify actions to be taken by the processing system 2400.

While the main memory 2406, non-volatile memory 2410, and storage medium 2426 (also referred to as "machine-readable medium") are illustrated as a single medium, the terms "machine-readable medium" and "storage medium" should be taken to include a single medium or multiple media (e.g., a centralized/distributed database, and/or associated caches and servers) that store the one or more sets of instructions 2428. The terms "machine-readable medium" and "storage medium" shall also be taken to include any medium that is capable of storing, encoding or carrying a set of instructions for execution by processing system 2400.

In general, the routines executed to implement the disclosed embodiments, may be implemented as part of an operating system or a specific application, component, program, object, module, or sequence of instructions (collectively referred to as a "computer program"). The computer program typically contains one or more instructions (e.g., instructions 2404, 2408, 2428) for a plurality of time settings in a plurality of memories and storage devices of the computing device. When read and executed by the one or more processors 2402, the instructions cause the processing system 2400 to perform operations to perform elements relating to aspects of the present disclosure.

Moreover, while embodiments have been described in the context of a fully functional computing device, those skilled in the art will appreciate that the various embodiments are capable of being distributed as a program product in a variety of forms. The present disclosure applies regardless of the particular type of machine or computer readable medium used to actually carry out the distribution.

Machine-readable storage media, machine-readable media, or other examples of computer-readable media include recordable-type media (e.g., volatile and non-volatile memory devices 2410, floppy and other removable disks, hard disk drives, optical disks (e.g., compact disk read-only memories (CD-ROMs), Digital Versatile Disks (DVDs))), and transmission-type media (e.g., digital and analog communication links).

Network adapter 2412 enables processing system 2400 to communicate data with entities external to processing system 2400 over a network 2414 via communication protocols that are any communication protocols supported by processing system 2400 and external entities. The network adapters 2412 may include network adapter cards, wireless network interface cards, routers, access points, wireless routers, switches, multi-layer switches, protocol converters, gateways, bridges, bridge routers, hubs, digital media receivers, and/or repeaters.

The network adapter 2412 may include a firewall to control (gobern) and/or manage permissions to access/proxy data in a computer network and to track different trust levels between different machines and/or applications. A firewall may be any number of modules having any combination of hardware and/or software components capable of enforcing a predetermined set of access rights (e.g., to regulate traffic and resource sharing between these entities) between a particular set of machines and applications, machines and machines, and/or applications and applications. Additionally, the firewall may manage and/or access control lists that specify permissions, including permissions for access and operation of objects by individuals, machines, and/or applications, and the environment in which the permissions rights are located.

The techniques described herein may be implemented in programmable circuitry (e.g., one or more microprocessors), software and/or firmware, dedicated hardwired (i.e., non-programmable) circuitry, or a combination of such forms. The application specific circuitry may be in the form of one or more Application Specific Integrated Circuits (ASICs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), or the like.

Comments

The foregoing description of various embodiments of the claimed subject matter has been presented for the purposes of illustration and description. It is not intended to be exhaustive or to limit the claimed subject matter to the precise form disclosed. Many modifications and variations will be apparent to practitioners skilled in the art. The embodiments were chosen and described in order to best describe the principles of the invention and its practical application, to thereby enable others skilled in the relevant art to understand the claimed subject matter, various embodiments, and with various modifications as are suited to the particular use contemplated.

While specific embodiments describe certain embodiments and the best mode contemplated, no matter how detailed a specific embodiment is, the techniques can be practiced in many ways. The embodiments may vary considerably in their implementation details, while still being encompassed by the present description. Particular terminology used when describing certain features or aspects of various embodiments should not be taken to imply that the terminology is being redefined herein to be restricted to any specific characteristics, features, or aspects of the technology with which that terminology is associated. In general, the terms used in the following claims should not be construed to limit the technology to the specific embodiments disclosed in the specification, unless the terms are explicitly defined herein. Accordingly, the actual scope of the technology encompasses not only the disclosed embodiments, but also all equivalent ways of practicing or implementing the embodiments.

The language used in the specification has been principally selected for readability and instructional purposes. It may not be selected to delineate or circumscribe the subject matter. Thus, it is intended that the scope of the technology be limited not by this specific embodiment, but rather by any claims based on the authorization of this application. Accordingly, the disclosure of various embodiments is intended to be illustrative, but not limiting, of the scope of the present technology, as set forth in the following claims.

Examples of the embodiments

The examples provided herein are for illustrative purposes only. One skilled in the art will recognize that each example may be combined with any other example, unless contrary to physical possibilities.

1. A computer-implemented method, comprising:

establishing a connection via an application programming interface with a storage medium comprising a series of past communications received by an employee of the enterprise;

downloading, via the application programming interface, a first portion of the series of past communications corresponding to a first time interval into a local processing environment;

constructing a Machine Learning (ML) model for the employee by providing the first portion of the series of past communications as training data to the ML model;

receiving, via the application programming interface, a communication addressed to the employee; and

determining whether the communication represents a security risk by applying the ML model to the communication.

2. The computer-implemented method of example 1, further comprising:

receiving input from an administrator associated with the enterprise indicating permission to access the series of past communications;

wherein the determining is performed in response to receiving the input.

3. The computer-implemented method of example 1, wherein the series of past communications includes a plurality of emails delivered to the employee.

4. The computer-implemented method of example 1, further comprising:

examining each past communication in the first portion of the series of past communications to determine an attribute; and

providing the attributes derived from the first portion of the series of past communications as training data to the ML model.

5. The computer-implemented method of example 1, further comprising:

determining that the communication represents a security risk based on an output generated by the ML model; and

characterizing the security risk from multiple dimensions.

6. The computer-implemented method of example 5, wherein the plurality of dimensions comprises:

the party to be attacked is,

the vector of the attack is then calculated,

the party to be counterfeited is provided with a display,

a counterfeit policy, and

and (5) attacking the target.

7. The computer-implemented method of example 1, wherein the storage medium is a computer server managed by an entity other than the enterprise.

8. The computer-implemented method of example 1, wherein the first portion of the series of past communications includes all emails received by the employee during the first time interval.

9. The computer-implemented method of example 1, further comprising:

downloading, via the application programming interface, a second portion of the series of past emails corresponding to a second time interval into the local processing environment, the second time interval preceding the first time interval; and

determining whether any emails received during the second time interval represent a security risk by applying the ML model to the second portion of the series of past emails.

10. The computer-implemented method of example 1, further comprising:

examining the communication to determine a plurality of attributes;

a statistical description of the communication is generated,

wherein the statistical description comprises a risk score for each pair of attributes included in the plurality of attributes, each risk score based on a risk of historical communications involving the corresponding pair of attributes.

11. A non-transitory computer-readable medium having stored thereon instructions that, when executed by a processor, cause the processor to perform operations comprising:

receiving an e-mail addressed to an employee of the enterprise;

applying a first model to the email to produce a first output representing whether the email represents non-malicious email,

wherein the first model is trained using past emails addressed to the employee that have been certified as non-malicious emails;

determining that the email is likely to be a malicious email based on the first output;

applying a second model to the email to produce a second output representing whether the email represents a malicious email of a given type; and

performing an action with respect to the email based on the second output.

12. The non-transitory computer-readable medium of example 11,

wherein the second output indicates that the email is not the malicious email of the given type, an

Wherein performing the action comprises:

forwarding the email to an inbox of the employee.

13. The non-transitory computer-readable medium of example 11, wherein the second model is one of a plurality of models that are applied to the email in response to determining that the email is likely to be a malicious email.

14. The non-transitory computer-readable medium of example 13, wherein each model of the plurality of models is associated with a different type of malicious email.

15. The non-transitory computer-readable medium of example 14, wherein the plurality of models, when applied to the email, produce a plurality of outputs, and wherein the operations further comprise:

applying a third model designed to aggregate the plurality of outputs produced by the plurality of models into an understandable visualization component.

16. The non-transitory computer-readable medium of example 11,

wherein the second output representation that the email includes a link to a hypertext markup language (HTML) resource, an

Wherein performing the action comprises:

following the link to access the HTML resource using a virtual web browser through which a Document Object Model (DOM) of the HTML resource is extracted, and

the DOM is analyzed to determine whether the link represents a security threat.

17. The non-transitory computer-readable medium of example 11,

wherein the second output indicates that the email includes a main link pointing to a resource carried by a bearer service accessible to the network, an

Wherein performing the action comprises:

following the primary link to access the resource using a virtual web browser,

examining, by the virtual web browser, the contents of the resource, discovering whether there are any secondary links to secondary resources,

for each of the secondary links there is a link,

following the secondary link to access a corresponding secondary resource using the virtual web browser, an

Analyzing the content of the corresponding secondary resource to determine whether the secondary link represents a security threat, an

A determination is made as to whether the primary link represents a security threat based on whether any of the secondary links are determined to represent a security threat.

18. The non-transitory computer-readable medium of example 11,

wherein the second output representation that the email includes a link to a hypertext markup language (HTML) resource, an

Wherein performing the action comprises:

following the link to access the HTML resource using a virtual web browser, capturing a screenshot of the HTML resource through the virtual web browser,

applying a computer vision algorithm designed to identify similarities between the screenshots and the library of validated login sites, and

determining whether the link represents a security threat based on an output generated by the computer vision algorithm.

19. The non-transitory computer-readable medium of example 12,

wherein the second output indicates that the email includes an attachment, an

Wherein performing the action comprises:

opening the accessory in a secure processing environment, an

Determining whether the accessory represents a security threat based on an analysis of the accessory's content.

20. A computer-implemented method, comprising:

receiving input representing permission to access past email delivered to employees of the enterprise within a given time interval;

establishing a connection with a storage medium including the past email via an application programming interface; downloading past emails into a local processing environment via the application programming interface; and constructing a Machine Learning (ML) model for identifying abnormal communication activities by providing past e-mails as training data to the ML model.

21. The computer-implemented method of example 20, further comprising:

examining each past email downloaded into the local processing environment to identify a sender identification and a sender email address; and

populating an entry in a database such that a sender identification is associated with a corresponding sender email address identified in the past email.

22. The computer-implemented method of example 21, further comprising:

receiving an email addressed to the employee;

examining the email to determine a sender identification and a sender email address; and

determining whether the email represents a security threat based on whether the sender identification and the sender email address identified in the email match entries in a database.

23. The computer-implemented method of example 20, further comprising:

receiving an email addressed to the employee; and

determining whether the email represents abnormal communication activity by applying the ML model to the email.

24. The computer-implemented method of example 23, wherein the output of the ML model when applied to the email represents that the email message represents abnormal communication activity due to the presence of a previously unknown sender identification, a previously unknown sender email address, or a combination of a previously unknown sender identification and a sender email address.

25. The computer-implemented method of example 23, further comprising:

in response to determining that the email represents anomalous communication activity,

uploading information related to the email to a federated database to protect a plurality of enterprises from security threats.

26. A non-transitory computer-readable medium having instructions stored thereon, which, when executed by a processor, cause the processor to perform operations comprising:

collecting data related to incoming and/or outgoing emails corresponding to customers of past time intervals;

generating a communication description for the customer based on the data;

receiving an incoming email addressed to the customer;

obtaining one or more attributes of the incoming email; and

determining whether the incoming email deviates from past email activity by comparing the one or more attributes to the communication description.

27. The non-transitory computer-readable medium of example 26, wherein the customer is an enterprise for which the communication description is generated.

28. The non-transitory computer-readable medium of example 26, wherein the customer is an employee of an enterprise for whom the communication description is generated.

29. The non-transitory computer-readable medium of example 26, wherein the generating comprises: obtaining at least one attribute from each email corresponding to the past time interval; and constructing the communication description based on the obtained attributes.

30. The non-transitory computer-readable medium of example 26, the operations further comprising:

providing the deviation in the incoming e-mail as input to a Machine Learning (ML) model; and

determining whether the incoming email represents a security risk based on output generated by the ML model.

31. The non-transitory computer-readable medium of example 30, the operations further comprising:

performing a remedial action in response to determining that the incoming email represents a security risk.

32. The non-transitory computer-readable medium of claim 26, wherein the one or more attributes comprise a primary attribute and a secondary attribute.

33. The non-transitory computer-readable medium of example 32, wherein the obtaining comprises: extracting the primary attributes from the incoming email; and

determining the secondary attribute based on the primary attribute and additional information associated with the customer.

34. A computer-implemented method, comprising:

receiving input indicating permission to access email delivered to employees of the enterprise;

acquiring an incoming e-mail addressed to the employee;

extracting a primary attribute from the incoming email by parsing the content of the incoming email and/or metadata associated with the incoming email;

obtaining a secondary attribute based on the primary attribute; and

determining whether the incoming email deviates from past email activity by comparing the primary and secondary attributes to a communication description associated with the employee.

35. The computer-implemented method of example 34, further comprising:

a connection is established with an email system employed by the enterprise via an application programming interface.

36. The computer-implemented method of example 34, wherein the communication description comprises primary and secondary attributes of past emails delivered to the employee and determined to represent secure communications.

37. The computer-implemented method of example 36, wherein the determining comprises:

discovering that the primary attribute, the secondary attribute, or a combination of the primary attribute and the secondary attribute are not included in the communication description.

38. The computer-implemented method of example 34, wherein the primary attribute is a sender display name, a sender username, a Sender Policy Framework (SPF) state, a domain name key identification mail (DKIM) state, a number of attachments, a number of links in the body of the incoming email, a country of origin, information in a header of the incoming email, or an identifier embedded in metadata associated with the incoming email.

39. The computer-implemented method of example 37, further comprising:

determining that the incoming email does not represent a security risk; and

the communication description is updated by creating an entry that programmatically associates the first and second attributes.

40. A computer-implemented method, comprising:

determining, by the threat detection platform, that a communication event involving the transmission of the email is currently occurring;

obtaining, by the threat detection platform, information related to the email;

resolving, by the threat detection platform, entities involved in the communication event by examining the information; and

compiling, by the threat detection platform, corpus statistics for entities determined to be involved in the communication event.

41. The computer-implemented method of example 40, wherein the determining is accomplished by examining incoming emails received by an email system that is programmatically integrated with the threat detection platform.

42. The computer-implemented method of example 41, wherein programmatic integration of the threat detection platform with an email system ensures that all external and internal emails are routed through the threat detection platform for inspection.

43. The computer-implemented method of example 40, wherein the information is obtained from the email.

44. The computer-implemented method of example 40, further comprising:

augmenting, by the threat detection platform, the information with a manually monitored data set;

wherein the resolution is performed on the augmented information.

45. The computer-implemented method of example 40, wherein the resolving comprises:

determining an identity of a sender based on a source of the incoming email, content of the incoming email, or metadata accompanying the incoming email; and

an identification of a recipient is determined based on a destination of the incoming email, content of the incoming email, or metadata accompanying the incoming email.

46. The computer-implemented method of example 40, further comprising:

causing the corpus statistics to be displayed in the form of an entity risk graph.

47. The computer-implemented method of example 46, wherein the entity comprises a sender of the email, a recipient of the email, a domain found in the email, a link found in the email, an Internet Protocol (IP) address found in metadata accompanying the email, a source of the email, a topic determined based on the email content, or any combination thereof.

48. The computer-implemented method of example 46, wherein the entity risk graph includes historical combinations of the entities and a respective risk score for each historical combination.

49. The computer-implemented method of example 46, wherein each entity is represented in the entity risk graph as a separate node, and wherein each connection between a pair of nodes represents a risk of an event involving a pair of entities associated with the pair of nodes based on a past communication event.

50. A non-transitory computer-readable medium having instructions stored thereon, which, when executed by a processor, cause the processor to perform operations comprising:

acquiring an incoming e-mail addressed to an enterprise employee;

extracting features in the form of primary attributes and secondary attributes for the incoming email;

employing a Machine Learning (ML) model that consumes the extracted features to determine whether there are any miss indicators representing security threats;

generating a signature for each of the loss indicators; and

causing the database to ingest each signature for discovery of future attacks having the same characteristics.

51. The non-transitory computer-readable medium of example 50, wherein each miss indicator is an Internet Protocol (IP) address, an email address, a Uniform Resource Locator (URL), or a domain.

52. The non-transitory computer-readable medium of example 50, the operations further comprising:

deep feature extraction is performed to reduce the likelihood of injury from complex security threats.

53. The non-transitory computer-readable medium of example 52, wherein the performing comprises:

a deep learning model is applied to understand the content, mood, and/or tone of the incoming email.

54. The non-transitory computer-readable medium of example 52, wherein the performing comprises:

accessing a landing page by interacting with a link embedded in the incoming email; and comparing the landing page with a set of known landing pages that are certified as authentic using computer vision algorithms.

55. The non-transitory computer-readable medium of example 52, wherein the performing comprises:

a crawling algorithm is employed to extract information about secondary links that are embedded in attachments to the incoming email or accessible via a website pointed to by a primary link in the incoming email.

56. A computer-implemented method, comprising:

obtaining first data associated with a first past batch of emails received by a staff member of a business;

generating a first plurality of events representing the first plurality of past emails;

obtaining second data associated with a second set of past emails that were tagged by one or more administrators,

wherein each past email in the second batch of past emails is associated with a tag that specifies a risk to the enterprise;

generating a second set of events representing the second set of past emails; and storing the first and second batches of events in a database.

57. The computer-implemented method of example 56, wherein the generating comprises:

converting the first data associated with each past email in the first batch of past emails into a predefined pattern defining events.

58. The computer-implemented method of example 56, further comprising:

receiving an input representing a query for an event having a given attribute; and

the database is checked to identify events having given attributes if any.

59. The computer-implemented method of example 58, further comprising:

determining a count of identified events; and

causing the count to be displayed on an interface that submitted the query.

60. The computer-implemented method of example 57, further comprising:

calculating a risk metric for each past email in the first batch of past emails; and attaching the calculated risk metric for each past email in the first batch of past emails to the respective predefined schema.

61. The computer-implemented method of example 60, further comprising:

receiving input representing a query to determine events that do not represent a threat to enterprise security; and

examining the database to identify events that, if any, do not represent a threat to enterprise security; and

causing the identified event to be displayed on an interface that submitted the query.

62. The computer-implemented method of example 60, wherein the checking comprises:

parsing the database to determine whether any past emails in the first batch of past emails are associated with a risk metric that is below a threshold; and

parsing the database to determine whether any past emails in the second batch of past emails are associated with tags that represent no risk.

63. The computer-implemented method of example 56, further comprising:

acquiring an incoming e-mail addressed to an enterprise employee;

parsing the incoming email to identify attributes of the email;

checking the database to identify any events having the attribute if any; and

evaluating a risk posed by the incoming email based on the identified event.

64. A computer-implemented method, comprising:

acquiring a series of e-mails sent to enterprise employees;

identifying a name involved in the series of emails by examining each email;

a series of signatures is created for the series of emails,

wherein each signature of the series of signatures is associated with a respective email of the series of emails, an

Wherein each signature identifies one or more entities involved in the respective email;

obtaining corpus statistics that determine the entities involved in the series of emails; indexing the corpus statistics by date; and

storing the series of signatures and indexed corpus statistics in a date-divided data structure.

65. The computer-implemented method of example 64, further comprising:

acquiring an incoming e-mail addressed to an enterprise employee;

identifying at least one entity involved in the incoming email by examining the incoming email; and

comparing the at least one entity to the dated data structure to determine whether the at least one entity matches any of the series of signatures.

66. The computer-implemented method of example 65, further comprising:

determining that the at least one entity matches a signature in the series of signatures; and

evaluating a risk posed by the incoming email based on the signature.

67. The computer-implemented method of example 66, wherein the evaluating comprises:

determining any risk that would be present with past emails corresponding to the signature; and

calculating a risk metric for the incoming email based on the determined risk of past emails.

68. The computer-implemented method of example 65, further comprising:

determining a similarity between the at least one entity and the series of signatures by employing a Machine Learning (ML) algorithm that probabilistically compares the at least one entity with each signature in the series of signatures; and

evaluating a risk posed by an incoming email based on an output generated by the ML algorithm.

86页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:服务信任状态

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!

技术分类