System and method for detecting and predicting faults in an industrial process automation system

文档序号：214445 发布日期：2021-11-05 浏览：16次中文

阅读说明：本技术 用于检测和预测工业过程自动化系统中的故障的系统和方法 (System and method for detecting and predicting faults in an industrial process automation system ) 是由 B.辛哈 A.巴塔查亚 M.塞沙德里于 2020-03-24 设计创作，主要内容包括：用于检测和预测工业过程自动化系统中的故障的系统和方法使用趋势数据来预测警告并允许在问题发生之前采取动作。该系统和方法提供随时间推移改进的故障/失效预测,因为为相关的系统组件集收集了更多的经验数据。该系统和方法可以标识过程自动化系统的组件之间的关系；标识并收集对系统配置的改变；标识并收集数据以告知可靠性和预测模型；为一个或多个组件开发特定于域的预测模型,该预测模型允许基于组件的失效或劣化预测；开发系统预测模型,该系统预测模型利用可靠性和关键性关系、基于组件的预测和操作参数来预测部分或整个过程自动化系统的健康状况；提供划分优先顺序的警告系统；以及标识组件的根本失效原因。(Systems and methods for detecting and predicting faults in industrial process automation systems use trend data to predict alerts and allow actions to be taken before problems occur. The system and method provide improved fault/failure prediction over time because more empirical data is collected for a relevant set of system components. The system and method can identify relationships between components of a process automation system; identifying and collecting changes to the system configuration; identifying and collecting data to inform reliability and predictive models; developing a domain-specific prediction model for one or more components, the prediction model allowing for component-based failure or degradation prediction; developing a system prediction model that predicts a health of a portion or an entire process automation system using reliability and criticality relationships, component-based predictions, and operating parameters; providing a prioritized warning system; and identifying a root cause of failure of the component.)

1. A monitoring system for an industrial plant, comprising:

one or more processors;

a storage unit communicatively coupled to the one or more processors and having stored thereon processor-executable instructions that, when executed by the one or more processors, cause the monitoring system to:

performing a process of inputting data files for the industrial plant, the data files containing data relating to nodes in the industrial plant, the data in each data file being in a different data format;

performing a process of extracting the data from the data file, the extracted data including a timestamp, a device name, a device health, and message content;

performing a process of converting the timestamp, device name, device health, and message content from the data file to a homogeneous format;

performing a process of extracting features from the converted timestamps, device names, device health, and message content using machine learning to identify the features; and

performing a process of identifying a node in the industrial plant that is experiencing an alarm using machine learning to identify the node experiencing the alarm, the alarm indicating that the node has failed or will fail within a specified time.

2. The monitoring system of claim 1, wherein the processor-executable instructions further cause the monitoring system to perform a process of constructing a network topology for the nodes in the industrial plant, the network topology establishing a hierarchy for the nodes in the industrial plant.

3. The monitoring system of claim 1, wherein the processor-executable instructions further cause the monitoring system to perform a process of identifying a root cause of the alarm using machine learning to identify the root cause.

4. The monitoring system of claim 3, wherein the processor-executable instructions further cause the monitoring system to perform a process of estimating a probability of the root cause using machine learning to calculate the probability.

5. The monitoring system of claim 4, wherein the processor-executable instructions further cause the monitoring system to perform a process that displays a time to failure of the alarm based on the probability of the root cause.

6. The monitoring system of claim 5, wherein the processor-executable instructions further cause the monitoring system to perform a process of graphically displaying a severity level of the alarm based on the time to failure of the alarm and/or an impact of the alarm on plant operation.

7. The monitoring system of claim 1, wherein the processor-executable instructions further cause the monitoring system to perform a process of graphically displaying all data within a specified time period for the nodes in the industrial plant experiencing an alarm.

8. The monitoring system of claim 1, wherein the processor-executable instructions further cause the monitoring system to perform a process of identifying all nodes in the industrial plant that are experiencing an alarm using machine learning to graphically display the nodes.

9. The monitoring system of claim 1, wherein the processor-executable instructions further cause the monitoring system to perform a process of identifying a corrective action for the alarm based on captured knowledge to extract the corrective action using machine learning, the captured knowledge including a maintenance log for the industrial plant.

10. The monitoring system of claim 1, wherein the processor-executable instructions further cause the monitoring system to extract features by performing a process that applies feature extraction rules to the converted timestamps, device names, device health, and message content using machine learning.

11. A method for monitoring an industrial plant, comprising:

inputting data files for the industrial plant, the data files containing data relating to nodes in the industrial plant, the data in each data file being in a different data format;

extracting the data from the data file, the extracted data including a timestamp, a device name, a device health, and message content;

converting the timestamp, device name, device health, and message content from the data file to a homogeneous format;

extracting features from the converted timestamps, device names, device health, and message content using machine learning to identify the features; and

identifying a node in the industrial plant that is experiencing an alarm indicating that the node has failed or will fail within a specified time using machine learning to identify the node experiencing the alarm.

12. The method of claim 11, further comprising constructing a network topology for the nodes in the industrial plant, the network topology establishing a hierarchy for the nodes in the industrial plant.

13. The method of claim 11, further comprising identifying a root cause of the alert using machine learning to identify the root cause.

14. The method of claim 13, further comprising estimating a probability of the root cause using machine learning to calculate the probability.

15. The method of claim 14, further comprising displaying a time-to-failure of the alarm based on the probability of the root cause.

16. The method of claim 15, further comprising graphically displaying a severity level of the alarm based on the time to failure of the alarm and/or an impact of the alarm on plant operation.

17. The method of claim 11, further comprising graphically displaying all data over a specified time period for the nodes in the industrial plant that are experiencing an alarm.

18. The method of claim 11, further comprising graphically displaying all nodes in the industrial plant experiencing an alarm using machine learning to identify the node.

19. The method of claim 11, further comprising identifying corrective actions for the alarm using machine learning to extract the corrective actions according to captured knowledge, the captured knowledge comprising a maintenance log for the industrial plant.

20. The method of claim 11, further comprising applying feature extraction rules to the converted timestamps, device names, device health, and message content using machine learning.

21. A computer-readable medium storing computer-readable instructions for causing one or more processors to perform the method of any one of claims 11 to 20.

Technical Field

Aspects of the present disclosure generally relate to industrial process automation and control systems. More particularly, aspects of the present disclosure relate to systems and methods for detecting and predicting faults in industrial process automation systems.

Background

A typical industrial plant uses a number of interrelated and interconnected process automation systems to control and operate the plant processes. Each system typically generates data in the form of log files specific to the operation of the system. The log file provides a record of events occurring within the system (including the date and time of occurrence) as well as messages and communications between the different components of the system. Such log files allow personnel to monitor various system failures, track the root cause of any failures, and take appropriate corrective action.

Modern process automation systems generate data and error messages at extremely high rates, which results in large amounts of data being generated in a short time. The enormous amount of data can often overwhelm plant personnel and personnel attempting to monitor and interpret the data. In addition, each system generates data in a system-specific format, which is typically different from other systems, making interpretation of the data and error messages difficult. In addition, the data and error messages generated by each system tend to be highly technical, requiring plant personnel to have the expertise of that particular system. More complicated, each system maintains data and error messages in separate locations that are not typically readily discernable.

Accordingly, there is a need for improvements in the field of industrial process automation, in particular in monitoring and maintaining the health (health) of industrial process automation systems.

Disclosure of Invention

Embodiments of the present disclosure provide systems and methods for detecting and predicting faults in industrial process automation systems. This embodiment is particularly useful in industrial process automation systems that employ distributed control systems. In some embodiments, the systems and methods use trend data to predict alerts and allow actions to be taken before a problem occurs. The system and method provide improved fault/failure prediction over time because more empirical data is collected for the relevant set of system components. The system and method can identify interrelationships between components of the process automation system; identifying and collecting changes to the system configuration; identifying and collecting data to inform reliability and predictive models; developing a domain-specific prediction model for one or more components, the prediction model allowing for component-based failure or degradation prediction; developing a system prediction model that utilizes reliability and criticality (criticality) relationships, component-based predictions, and operating parameters to predict the health of a portion or the entire process automation system; providing a prioritized warning system; the root cause of failure of the component is identified.

A fault associated with the first one or more devices may be brought to the user's attention by displaying faults associated with other devices in the system. In some embodiments, the systems and methods herein for detecting faults in an industrial process automation system can determine and display the root cause of the displayed fault (i.e., display an indication that the fault associated with the first one or more devices is the root cause of the fault associated with other devices). In some embodiments, the systems and methods may track past and/or current system health issues solutions to verify the effectiveness of proposed future solutions.

In some embodiments, the systems and methods for detecting faults in process automation systems herein may encode and automate subject expertise in diagnosing and predicting system problems. This may reduce the need for specialized subject matter experts and increase the speed of analysis and response. In some embodiments, the systems and methods herein may place individual alerts in context based on reliability, system interaction, and criticality. The system and method may automate root cause of failure detection and additionally identify a root cause of failure that originates from another component or configuration change of the system. In some embodiments, the systems and methods may map one or more log messages and/or alerts, system data, contexts, or relationships into a human readable text excerpt (excerpt).

In some embodiments, the systems and methods herein for detecting faults in process automation systems can generate customized system reliability and warning models for one or more process automation system components; integrating system reliability and warning models based on relationships between components; performing a trend-based warning; performing solution effect prediction based on historical actions and system impacts; and performing root cause identification at a system level.

In some embodiments, the systems and methods herein for detecting faults in process automation systems can use structural views of the process automation system and its components to identify one or more critical components and connections and generate a relational database; building a reliability/warning model for one or more components and connections based on subject matter expertise; identifying and capturing relevant data for one or more components; adjusting the model according to the specific characteristics of the component/system; identifying operational and trend data for one or more components; detecting when one or more entities have an abnormal condition or a predicted abnormal condition; evaluating the root cause of the abnormal condition (e.g., the entity itself, another related entity, or a configuration change) and evaluating the impact of the condition on the system; converting the identified condition to a human-readable text snippet; and recording the one or more corrective actions and correlating them to previous alerts, patterns, and corrective actions to predict the effect of the one or more actions.

In general, in one aspect, embodiments of the present disclosure relate to a monitoring system for an industrial plant. The monitoring system includes, among other things, one or more processors and a memory unit communicatively coupled to the one or more processors. The storage unit stores processor-executable instructions that, when executed by the one or more processors, cause the monitoring system to operate as a process of inputting data files for the industrial plant, the data files containing data related to nodes in the industrial plant, the data in each data file being in a different data format. The processor-executable instructions also cause the monitoring system to execute a process that extracts the data from the data file, the extracted data including a timestamp, a device name, a device health, and message content, and cause the monitoring system to execute a process that converts the timestamp, the device name, the device health, and the message content from the data file into a homogenous format. The processor-executable instructions also cause the monitoring system to run a process that extracts features from the converted timestamps, device names, device health, and message content using machine learning to identify the features, and cause the monitoring system to run a process that identifies a node in the industrial plant that is experiencing an alarm indicating that the node has failed or will fail within a specified time using machine learning to identify the node that is experiencing the alarm.

In accordance with any one or more of the preceding embodiments, the processor-executable instructions further cause the monitoring system to operate as a process that constructs a network topology for the node in the industrial plant, the network topology establishing a hierarchy for the node in the industrial plant. In accordance with any one or more of the preceding embodiments, the processor-executable instructions further cause the monitoring system to run a process that identifies a root cause of the alarm using machine learning to identify the root cause, run a process that estimates a probability of the root cause using machine learning to calculate the probability, run a process that displays a time-to-failure (time-to-failure) of the alarm based on the probability of the root cause, and/or run a process that graphically displays a severity level of the alarm based on the time-to-failure of the alarm and/or an impact of the alarm on plant operation. In accordance with any one or more of the preceding embodiments, the processor-executable instructions further cause the monitoring system to run a process that graphically displays all data for a specified period of time for the node in the industrial plant that is experiencing an alarm and/or run a process that graphically displays all nodes in the industrial plant that are experiencing an alarm using machine learning to identify the node. In accordance with any one or more of the preceding embodiments, the processor-executable instructions further cause the monitoring system to run a process that uses machine learning to identify a corrective action for the alarm based on captured knowledge to extract the corrective action, the captured knowledge including a maintenance log for the industrial plant. In accordance with any one or more of the preceding embodiments, the processor-executable instructions further cause the monitoring system to extract features by running a process that applies feature extraction rules to the converted timestamps, device names, device health, and message content using machine learning.

In general, in another aspect, embodiments of the present disclosure relate to a method for monitoring an industrial plant. The method includes, inter alia, inputting data files for the industrial plant, the data files containing data associated with nodes in the industrial plant, the data in each data file being in a different data format; and extracting the data from the data file, the extracted data including a timestamp, a device name, a device health, and message content. The method also includes converting the timestamp, device name, device health, and message content from the data file to an isomorphic format; and extracting features from the converted timestamps, device names, device health, and message content using machine learning to identify the features. The method further includes identifying a node in the industrial plant that is experiencing an alarm indicating that the node has failed or will fail within a specified time using machine learning to identify the node that is experiencing the alarm.

In accordance with any one or more of the preceding embodiments, the method further includes constructing a network topology for the nodes in the industrial plant, the network topology establishing a hierarchy for the nodes in the industrial plant. In accordance with any one or more of the preceding embodiments, the method further comprises identifying a root cause of the alert using machine learning to identify the root cause. In accordance with any one or more of the preceding embodiments, the method further comprises estimating a probability of the root cause using machine learning to calculate the probability; displaying a time to failure of the alarm based on the probability of the root cause and/or graphically displaying a severity level of the alarm based on the time to failure of the alarm and/or an impact of the alarm on plant operation. In accordance with any one or more of the preceding embodiments, the method further includes graphically displaying all data for a specified period of time for the node in the industrial plant that is experiencing the alarm and/or identifying the node using machine learning to graphically display all nodes in the industrial plant that are experiencing the alarm. In accordance with any one or more of the preceding embodiments, the method further includes identifying a corrective action for the alarm using machine learning based on captured knowledge including a maintenance log for the industrial plant to extract the corrective action and/or applying feature extraction rules to the converted timestamp, device name, device health, and message content using machine learning.

In general, in yet another aspect, embodiments of the disclosure are directed to a computer-readable medium storing computer-readable instructions for causing the one or more processors to perform a method according to any one or more of the preceding embodiments.

Drawings

A more particular description of the disclosure briefly summarized above may be had by reference to various embodiments, some of which are illustrated in the appended drawings. While the drawings represent selected embodiments of the present disclosure, these drawings should not be considered limiting of its scope, as the disclosure may admit to other equally effective embodiments.

FIG. 1 illustrates an example industrial plant monitoring system according to an embodiment of this disclosure;

FIG. 2 illustrates an exemplary machine learning process that may be used with embodiments of the present disclosure;

FIG. 3 illustrates an exemplary method of monitoring an industrial plant according to an embodiment of the present disclosure;

FIG. 4 illustrates an exemplary network topology according to an embodiment of the present disclosure;

5A-5B illustrate an example failure analysis screen for an HMI of an example industrial plant monitor according to an embodiment of the present disclosure;

FIG. 6 illustrates a device error trend screen for an HMI of an exemplary industrial plant monitor according to an embodiment of the present disclosure; and is

FIG. 7 illustrates an aggregated alarm screen for an HMI of an exemplary industrial plant monitor according to an embodiment of the present disclosure;

FIG. 8 illustrates a detailed alarm screen for an HMI of an exemplary industrial plant monitor according to an embodiment of the present disclosure; and is

FIG. 9 illustrates an aggregated log message screen for an HMI of an exemplary industrial plant monitor according to an embodiment of the disclosure.

Like reference numerals have been used, where appropriate, to designate like elements that are common to the figures. However, elements disclosed in one embodiment may be beneficially utilized on other embodiments without specific recitation.

Detailed Description

The specification and drawings illustrate exemplary embodiments of the disclosure and are not to be considered limiting, with the claims defining the scope of the disclosure (including equivalents). Various mechanical, compositional, structural, electrical, and operational changes may be made without departing from the scope of the description and claims, including equivalents. In some instances, well-known structures and techniques have not been shown or described in detail to avoid obscuring the disclosure. Additionally, elements and their associated aspects, which are described in detail with reference to one embodiment, may be included (on trial) in other embodiments not specifically shown or described. For example, if an element is described in detail with reference to one embodiment but not with reference to the second embodiment, the element may still be claimed as being included in the second embodiment.

Referring now to fig. 1, an industrial process automation system 100 is shown for performing an industrial process, such as an oil refinery process, a chemical treatment process, a manufacturing process, and the like. The particular industrial process automation system 100 described herein is commonly referred to as a Distributed Control System (DCS). The DCS100 generally includes a plurality of field devices, one of which is indicated at 102, that perform some sub-process in the industrial process. These field devices 102 may include, for example, sensors, actuators, motors, valves, and the like. Several field devices 102 are connected to device buses that allow messages to be sent to and from the devices 102, one of which is indicated at 104. Each device bus 104 is connected to I/O modules, one indicated at 106, that link the devices 102 to a control processor 108. The I/O module 106 is responsible for regulating messages to and from the device 102 and may be, for example, a Fieldbus module (FBM). The control processor 108 operates substantially autonomously to control the operation of the device 102 and may be, for example, a Programmable Logic Controller (PLC), a Remote Terminal Unit (RTU), a Programmable Automation Controller (PAC), or the like. The control network 110 allows the control processors 108 to communicate with each other and with other systems on the network 110. Control network 110 may be implemented using ethernet switches, gigabit interface converters (GBICs), fiber optics, cables, etc., and/or may be wireless.

One or more data servers 112 and one or more application workstation/workstation processors (AW/WP) 114, as well as other components, are also connected to the network 110. The application workstation/workstation processor 114 allows plant personnel to manually perform various tasks related to the industrial process, such as running tests, configuring hardware, installing software, modifying process parameters, and the like, typically through a graphical user interface. The data server 112 automatically provides data for executing the industrial process (or portions thereof) to the control processor 108 and also automatically obtains data for monitoring the industrial process from the control processor 108. Some of the data acquired by the data server 112 may include events occurring at the control processor 108, including the date and time of occurrence, as well as messages and communications sent and received by the control processor 108. These events and messages are typically acquired by a process monitoring application running on the data server 112, which typically records the data in log files and system health data.

The log file and system health data can then be transmitted by the data server 112 over the industrial network 116 to an industrial plant diagnostic system 120 that monitors the entire industrial process. The industrial plant diagnostic system 120 provides, among other things, high-level monitoring of the individual process controllers 108, which allows plant personnel to oversee the industrial process and coordinate the operation of the various process controllers 108. In the illustrated example, the example industrial plant diagnostic system 120 has a typical system architecture that includes one or more processors 122, internal input and/or output ("I/O") interfaces 124, and memory 126, all of which are communicatively coupled and/or electrically connected to each other. The operation of these components of the system 120 is generally well known in the art and, therefore, is only briefly mentioned herein.

In general, the one or more processors 122 are adapted to execute processor-executable instructions stored in the memory 126. The I/O interface 124 allows the processor 122 to interact and communicate with external systems (and users). The communication may be accomplished through one or more communication networks, such as industrial network 110, a Wide Area Network (WAN), a Local Area Network (LAN), etc. The memory 126 is adapted to thereby provide processor-executable instructions to the processor 122 upon request. Other computing components known to those skilled in the art may also be included in the industrial plant diagnostic system 120 within the scope of the present disclosure.

As described above, the data server 112, or more precisely, the process monitoring application running thereon, may generate log files and system health data at a very high rate, thereby producing large amounts of data in a short amount of time. In addition, the data generated may be highly technical and system specific, requiring plant personnel to have the expertise of that particular system. Examples of process monitoring applications that may run on the data server 112 include SMON (System monitor), Wireshark, Multicast, Netsight, Syslog, System Audio, and data historian and archiving applications, counters from local observation databases, application workstation space reports, control processor load reports, CRC (Cyclic redundancy check) and other error checking applications, as well as System and network traps and interrupt routines known to those skilled in the art. The enormous amount of data and the high degree of skill in the data often overwhelms plant personnel that need to use the data to detect faults in the DCS100 and determine their root cause.

Thus, in accordance with an embodiment of the present disclosure, the memory 126 stores an industrial plant monitor 130, which industrial plant monitor 130 may automatically process data generated by the data server 112 to detect and predict faults in the DCS100 (and similar industrial process automation systems). The industrial plant monitor 130, among other things, may automatically aggregate log files and system health data from the various data servers 112 and analyze the effects and interactions between different components to identify possible component failures, estimate the time to failure, alert for failures, and determine possible root causes of failures. The industrial plant monitor 130 may also present the data in an intuitive, context-based format that allows plant personnel to quickly assess the potential fault condition, its criticality, its root cause, and determine possible corrective actions. In short, the industrial plant monitor 130 may provide a system level health monitor for the entire DCS100, not just for the individual devices therein.

In some embodiments, the industrial plant monitor 130 includes a plurality of functional modules that work together to provide the system level health monitoring described above. These functional modules may include, for example, a data extractor 132, a network topology builder 134, a feature extractor 136, a root cause identifier 138, a root cause probability estimator 140, a time-to-failure estimator 142, a pattern matching module 144, one or more ML algorithms 146, and an HMI application 148. While functional blocks are shown here as discrete blocks, those skilled in the art will appreciate that two or more blocks may be combined into a single block and that any single block may be divided into several constituent blocks without departing from the scope of the present disclosure. A more detailed description of the operation of the industrial plant monitor 130 is described later herein.

Referring next to FIG. 2, an exemplary machine learning process 200 is shown that may be used with the industrial plant monitor 130 in some embodiments. The machine learning process 200 may be applied by any of the functional modules 132-144 that require some form of machine learning to analyze log files and system health data from the various data servers 112. In the illustrated example, the machine learning process 200 includes a data input component 202, one or more machine learning models or algorithms 204, an automatic feedback/correction component 206, a user application 208, a manual feedback/correction component 210, and an analyzer 212. Examples of algorithms that may be used with the machine learning process 200 include Long Short-Term Memory (LSTM), random forest, decision trees, natural language processing, and the like.

In general operation, the data input component 202 receives data (e.g., log files, system health data, etc.) and, after appropriate preprocessing, feeds the data to one or more machine learning models 204. The machine learning model 204 uses machine learning and neural network processing techniques to extract relevant features (e.g., events, time of day of occurrence, error messages, etc.) from the data. The automatic feedback/correction component 206 applies rules and algorithms configured to detect errors in the output received from the machine learning model 204. These errors are used to automatically correct the model output and are fed back to the machine learning model 204 via the analyzer 212 to update the processing of the machine learning model 204. The processed output from the automatic feedback/correction component 206 is then displayed to the user for verification via the user application 208. Corrections made by the user are captured by the manual feedback/correction component 210 and fed back into the machine learning model 204 via the analyzer 212. This allows the machine learning model 204 to continually improve the evaluation and extraction of relevant features from the input data.

FIG. 3 is an exemplary operational flow diagram 300 illustrating the industrial plant monitor 130 according to an embodiment of the present disclosure. The flow diagram 300 has two main stages: a configuration phase 302 and an application phase 304. In a configuration phase 302, the industrial plant monitor 130 builds a network topology for the DCS100 and performs other software and hardware configuration tasks. In the application phase 304, the industrial plant monitor 130 performs analysis on the log files and system health data to detect faults, predict time to failure for failure, and identify root causes. The industrial plant monitor 130 continuously executes the two phases 302 and 304 as needed so they run substantially parallel to each other.

In the configuration phase 302, output from the network discovery application is obtained for the DCS100 and stored in the component and module inventory database 306. Any suitable network discovery application that may search a network like DCS100 and discover network nodes, connectivity, routing protocols, etc. may be used to provide the information stored in the component and module inventory database 306. Then, at block 308, the industrial plant monitor 130 uses the inventory to build a network topology for the DCS 100. In some embodiments, building the network topology involves aggregating all of the nodes in the DCS100 and positioning the nodes in a hierarchy based on their relationship to each other and their connectivity to each other. This information may include information that uniquely identifies each node, such as each node's IP address, MAC address, identification code (letterbug) (e.g., alphanumeric identifier), etc., as well as the numbering of any hardware, software, and firmware of each node and the network routing protocol used by each node. In the case of fault tolerant devices, information relating to the primary device and the secondary or backup device may also be collected. The industrial plant monitor 130 can then use this information to build a network topology that details how the nodes are connected to each other and how data is transmitted between the nodes in the network.

FIG. 4 shows an example of a network topology 400 of the DCS100, which is built by the industrial plant monitor 130 according to the search (ferret) output in the configuration phase 302. The network topology 400 in this example includes a plurality of branches 402 connected to one another to form the overall network topology 400. Each branch includes one or more root devices 402 (e.g., root bridges) and one or more switches 404 connected together by a mesh network 406. One or more workstations 408 are connected to one or more of the switches 404 through the mesh network 406, as are one or more control processors 410. In the case where a standby or secondary control processor (shadow control processor) is provided, the secondary control processor may also be connected to one or more of the switches 404 through the mesh network 406. The one or more control processors 410 are, in turn, connected to one or more field buses 414 that link the one or more control processors 410 to one or more field devices 416 via one or more field bus modules 418 (FBMs), such as field device system integration modules (FDSIs). Such a network topology 400 may then be stored in the network database 310 for subsequent use in the application phase 306 to detect faults and predict time to failure for failure and determine root causes.

The application phase 306 (FIG. 3) generally begins with the input of a data source into the industrial process monitor 130 at block 312. The data sources may include log files and System health data generated by process monitoring applications (e.g., SMON, Wireshark, Syslog, System editor, CRC, etc.) running on the aforementioned data servers, as well as other time series data. At block 314, the industrial process monitor 130 extracts relevant data from various data sources. Such data extraction may involve reading time series data from various data sources and loading the data into memory, and then converting the data from the various sources into a homogeneous format. Exemplary isomorphic formats may include a timestamp field, a data source field, a field for a device that generates data, a field for a message related to data, a field for a name of a device that is the subject of a message, and so forth. The extracted data can then be used to dynamically update the network topology with any devices identified from the data that are not already included in the network topology.

At block 316, the industrial process monitor 130 extracts relevant features from the extracted data for use with the aforementioned machine learning process 200 (FIG. 2). The feature extraction may involve applying various rules to the extracted data to identify relevant features from the data. The data may then be re-sampled periodically (e.g., 10 minutes, 20 minutes, 30 minutes, etc.) and relevant features extracted from the data again. Examples of rules that may be applied to the extracted data are shown in table 1 below.

Table 1: feature extractor rules

The above-described rules may be applied to data extracted (block 314) from various data sources to identify relevant features for machine learning purposes. Exemplary types of features that may be extracted include the following: device type, daily ARP count, ARP search device, total GBIC error per day count, intermittent GBIC error per day count, GBIC trend count, ReadLM error count, percentage of control processors showing errors, equipment failure, topology change per day count, intermittent topology count, bus error per day count, intermittent bus error, modulus error count, intermittent modulus error count, bind-recombine (marry-remarried) intermittent mode, and the like.

At block 318, the industrial process monitor 130 identifies a potential failure and a root cause of the failure from the extracted features (e.g., using the machine learning process 200). Such failure/root cause identification may involve training a machine learning model, such as a random forest or decision tree, to identify the root cause using historical log file data. Network topology information for root cause identification is provided by network database 310. In one example, 9-month historical log file data from an industrial process automation system (such as DCS 100) is used. From this data, 6 weeks of data were selected, with 4 weeks of data used for training and 2 weeks of data used to validate the training.

Training involves creating a feature matrix using actual data (e.g., messages) and feature labels. The matrix has dimensions of N M, where N is the number of features and M is the number of feature extraction rules (e.g., Table 1). The feature tags are derived from factory maintenance logs and input from subject matter experts. An intermediate label is created for the logically related feature set. Thus, features related to topology changes, ARP mode addition, ReadLM errors, etc. are given an intermediate tag such as "switch hardware problem". Similarly, features related to topology change increase, GBIC error, ARP search increase, etc., are labeled as "switch GBIC problem," for example. Bus access errors are marked as, for example, "bus access errors", and A-to-D failures are marked as, for example, "A-to-D device failures", while control processor recombination failures and module reset errors are marked as, for example, "control processor hardware errors". Errors such as intermittent ReadLM errors and intermittent ARP messages are marked as e.g. "fibre between switch and control processor dirty", while intermittent ARP messages of all connected devices are marked as e.g. "slow response time".

At block 320, the industrial process monitor 130 identifies a probability of failure/root cause from the identified failure/root cause (e.g., using the machine learning process 200). The probabilistic identification may involve a similar process of training a machine learning model (e.g., random forest, decision tree, etc.) using historical log file data as described above with respect to block 318. In some embodiments, a device associated with the identified root cause may also be provided as input for training the machine learning model.

At block 322, the industrial process monitor 132 predicts a time-to-failure (e.g., using the machine learning process 200) based on the failure/root cause probability determined in block 318. Such a time-to-failure prediction may involve assigning a predefined time interval for a given failure/root cause based on the probability of the given failure/root cause. The duration of the pre-failure time interval may be based on, for example, historical log files, system health data, and error data. For example, if the probability of a given failure is greater than 99%, the failure has occurred and a time-to-failure of zero days may be assigned by the industrial process monitor 130. If the probability of a given failure is 90% or higher, the failure has occurred or is imminent, and may be assigned a 24 hour time-to-failure.

If the probability of a given failure is greater than 30% but less than 90%, the industrial process monitor 130 can predict that features extracted for that failure will occur within 1 day (i.e., project or insert data forward for 1 day). The industrial process monitor 130 can then rerun the root cause and probability identification with these features to see if the probability has reached 90%. If so, a 1 day time to failure is reserved for the failure. If not, the industrial process monitor 130 increases the time to failure by one more day and repeats the process until a probability of 90% is reached. If the number of days of increase exceeds 5 days, the time to failure is not assigned.

In some embodiments, the industrial process monitor may use a Random Forest Regression (RFR) model to find the time interval before failure (while Random Forest Classification (RFC) may be used to find the root cause). Constructing an RFR involves setting the actual day of failure (e.g., as reported by a field engineer) to be the 0 th day of the failure in the training data, and then looking at features in the data whose time is traced back to, for example, the previous 5 days. Table 2 shows an exemplary training data set for the RFR model. This data is directed to component issues that cause problems in communication between the control processor and the fieldbus module. Two trends can be seen with respect to PIO bus access errors and fault tolerant MAC reset counts.

PIO bus access error	Fault tolerant MAC reset error	Number of remaining hours
			214	595	0
162	400	24
			162	200	48
100	95	72
			35	22	96

Table 2: exemplary RFR training data

At block 324, the industrial process monitor 130 performs pattern matching on the data from the knowledge capture database 326 to determine if the same or similar failure/root cause occurred before and which corrective actions were taken to address the failure. The data stored in the knowledge capture database 326 typically includes a maintenance log and records of actions previously taken by plant personnel to correct various errors that occur over time in the DCS 100. These maintenance logs and records, which may include text documents, spreadsheets, and the like, are typically maintained by plant personnel using common words and phrases. Thus, the industrial process monitor 130 uses Natural Language Processing (NLP) via the machine learning process 200 to extract relevant information from the maintenance logs and records. Natural language processing allows the industrial process monitor 130 to quickly filter out extraneous words and phrases and focus on key information. Thus, for example, if multiple different corrective actions A, B and C are taken to address a particular failure because the immediately preceding action was invalid, the industrial process monitor 130 can go directly back to the final corrective action (action C) that repaired the failure.

At block 328, the industrial process monitor 130 provides the above analysis to plant personnel in the form of an HMI (referred to herein as a dashboard). The dashboard is essentially a collection of screens that the industrial process monitor 130 can generate and display to a user that provide data to the user in an intuitive, context-based format that allows the user to quickly assess potential fault conditions, their criticality, their root cause, and determine possible corrective actions. The dashboard graphically visualizes the content of a large (possibly millions) number of log files and system health data that have been aggregated and transformed into usable, executable information. The user may be able to quickly see from the HMI/dashboard that, for example, a switch (e.g., switch TT2061) has a problem in a component (e.g., GBIC17) that may cause the switch to fail very quickly (e.g., within the next five days).

Fig. 5A-5B illustrate an exemplary failure analysis screen 500 of a dashboard of an exemplary industrial plant monitor. Screen 500 includes a switch 502, which switch 502 is connected to a control processor 506 through an optical fiber 504 via one of several ports 508 of control processor 506. The control processor 506, in turn, is connected to a fieldbus module 512 via one of several ports 510 by a bus 514 (e.g., HDLC). In the example of FIG. 5A, the industrial plant monitor has determined that a fault condition exists between one of the ports 510 of the control processor 506 and the fieldbus module 512 based on the log message from the control processor 506. Additionally, based on these log messages, the industrial plant monitor has determined that the fault condition is likely to be caused by the bus 514. In contrast, in the example of fig. 5B, the industrial plant monitor has determined that the fault condition is likely to be caused by the fiber optic cable 504 based on log messages from the control processor 506 and the switch 502.

FIG. 6 illustrates an exemplary equipment error trend screen 600 of a dashboard that may be generated and displayed by an industrial plant monitor. This screen provides a device error trend indicated at 602, which shows a graph of error counts for each log source associated with the device. The device in this example is TT2061, and a log source from which to obtain device data is indicated at 808. In some embodiments, the log sources may be color coded and/or symbol coded to distinguish between different log sources. The user may begin to increase the time interval from the error count of the screen quick look device, thereby alerting the user that a potential problem exists with the device during the time interval. Date and time information is provided at 604, and a zoom option (e.g., 1 hour, 3 hours, 6 hours, 1 day, 3 days, 1 week, etc.) is provided at 606.

FIG. 7 illustrates an exemplary aggregate alarm screen 700 of a dashboard that may be generated and displayed by an industrial plant monitor. The primary purpose of this screen is to provide a list of all alarm and potential fault conditions in the DCS, indicated at 702. In the illustrated embodiment, the list 702 includes a date, a device name, a severity indicator, and message fields containing available and executable information for each alarm in the list. In this example, the message field contains the root cause identification of the alarm along with a probability estimate for the alarm. Based on the probabilistic estimate (e.g., 39% probability of first alarm), the industrial plant monitor predicts the time to failure (e.g., within three days) of the device in question. Thus, the severity indicator may provide a severity "high" for the device. In some embodiments, the severity indicator may be color-based (e.g., red for criticality, yellow for high, orange for low, etc.), symbol-based (e.g., exclamation mark for criticality, question mark for high, dash for low, etc.), or a combination of both. Additional and/or alternative information, such as the date and time of the current analysis, indicated at 704, may be included in some embodiments. A search box 706 may be included in some embodiments to allow a user to search for alerts and fault conditions using natural language queries. Selecting one of the alerts in the list 702 (e.g., by tapping, double-clicking, etc.), such as an alert for the device 01CP21, will bring the user to the detailed alert screen for that alert.

In some embodiments, the industrial plant monitor assigns a severity level (e.g., critical, high, low, etc.) to the alarm based on the impact the alarm will have on the continuity of the plant and/or business operation. Alarms with more significant impact (e.g., potential process shutdown) are assigned a higher severity relative to alarms with less impact (e.g., reduced throughput). Thus, for example, switches and control processors may be assigned devices that are more critical relative to application workstations, FBMs, field devices, and the like. Similarly, the area controlled by the controlled processor may be assigned a high priority or a medium/low priority based on the functions performed by the area of the processor and the impact on the operation of the traffic. In some cases, the severity assignment may be made manually by an operator during system configuration, and/or the severity assignment may be made continuously by the system using a machine learning algorithm trained with historical alert training data. In either case, the ability to assign different severity levels to various alarms allows the industrial plant monitor to provide the operator with context for the alarms so that higher priority can be shifted to processes/areas in the plant that are affected by critical equipment.

FIG. 8 illustrates an exemplary detailed alarm screen 800 of a dashboard that may be generated and displayed by an industrial plant monitor. As the name implies, the screen displays detailed information about the selected alarm, including error detailed information, such as date, severity, analysis detailed information, etc., indicated at 802, and device detailed information, such as device identification code, device description, software/hardware/firmware version, etc., indicated at 804. The screen may also provide a device error trend similar to that in FIG. 7 at 806, which shows a graph 808 of error counts for each log source associated with the device. To provide context, the screen also shows the relevant network portion 810 of the network topology in which the device resides so that the user can see where the device resides within the DCS. In some embodiments, the screen also provides the option to zoom in and out on the network portion 810 as desired.

FIG. 9 illustrates an aggregate log message screen of a dashboard that can be generated and displayed by an industrial plant monitor. The screen aggregates log error messages and groups them according to log files and system health data and by time. Thus, the error message indicated at 902 is obtained from one log file, while the error message indicated at 904 is obtained from a different log file, and so on. In some embodiments, the error message may be color coded for easy viewing. From this screen, the user can quickly view error messages for all devices in the DCS that are currently experiencing a fault condition.

Accordingly, as described herein, embodiments of the present disclosure provide systems and methods for detecting and predicting faults in industrial process automation systems. Such embodiments may comprise a special purpose computer including a variety of computer hardware as described in greater detail below.

Embodiments within the scope of the present disclosure also include computer-readable media for carrying or having computer-executable instructions or data structures stored thereon. Such computer-readable media can be any available media that can be accessed by a special purpose computer and includes both computer storage media and communication media. By way of example, and not limitation, computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media is non-transitory and includes, but is not limited to, Random Access Memory (RAM), Read Only Memory (ROM), electrically erasable programmable ROM (eeprom), compact disc ROM (CD-ROM), Digital Versatile Disks (DVD) or other optical disk storage, Solid State Drives (SSD), magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to carry or store desired non-transitory information in the form of computer-executable instructions or data structures and which can be accessed by a computer. When information is transferred or provided over a network or another communications connection (either hardwired, wireless, or a combination of hardwired or wireless) to a computer, the computer properly views the connection as a computer-readable medium. Thus, any such connection is properly termed a computer-readable medium. Combinations of the above should also be included within the scope of computer-readable media. Computer-executable instructions comprise, for example, instructions and data which cause a general purpose computer, special purpose computer, or special purpose processing device to perform a certain function or group of functions.

The following discussion is intended to provide a brief, general description of a suitable computing environment in which the various aspects of the disclosure may be implemented. Although not required, aspects of the disclosure will be described in the general context of computer-executable instructions, such as program modules, being executed by computers in network environments. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. Computer-executable instructions, associated data structures, and program modules represent examples of the program code means for executing steps of the methods disclosed herein. The particular sequence of such executable instructions or associated data structures represents examples of corresponding acts for implementing the functions described in such steps.

Those skilled in the art will appreciate that aspects of the disclosure may be practiced in network computing environments with many types of computer system configurations, including personal computers, hand-held devices, multi-processor systems, microprocessor-based or programmable consumer electronics, network PCs, minicomputers, mainframe computers, and the like. Aspects of the disclosure may also be practiced in distributed computing environments where tasks are performed by local and remote processing devices that are linked (either by hardwired links, wireless links, or by a combination of hardwired or wireless links) through a communications network. In a distributed computing environment, program modules may be located in both local and remote memory storage devices.

An exemplary system for implementing aspects of the disclosure includes a special purpose computing device in the form of a conventional computer, including a processing unit, a system memory, and a system bus that couples various system components including the system memory to the processing unit. The system bus can be any of several types of bus structures including a memory bus or memory controller, a peripheral bus, and a local bus using any of a variety of bus architectures. The system memory includes computer storage media including non-volatile and volatile memory types. A basic input/output system (BIOS), containing the basic routines that help to transfer information between elements within the computer, such as during start-up, may be stored in ROM. Further, the computer may include any device (e.g., a computer, laptop, tablet, PDA, cell phone, mobile phone, smart television, etc.) capable of wirelessly receiving and transmitting an IP address from/to the internet.

The computer may also include a magnetic hard disk drive for reading from and writing to a magnetic hard disk, a magnetic disk drive for reading from or writing to a removable magnetic disk, and an optical disk drive for reading from or writing to a removable optical disk such as a CD-ROM or other optical media. The magnetic hard disk drive, magnetic disk drive, and optical disk drive are connected to the system bus by a hard disk drive interface, a magnetic disk drive interface, and an optical drive interface, respectively. The drives and their associated computer-readable media provide nonvolatile storage of computer-executable instructions, data structures, program modules and other data for the computer. Although the exemplary environment described herein employs a magnetic hard disk, a removable magnetic disk and a removable optical disk, other types of computer readable media for storing data can be used, including magnetic cassettes, flash memory cards, digital video disks, Bernoulli cartridges, RAMs, ROMs, SSDs, and the like.

Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media.

Program code devices including one or more program modules may be stored on the hard disk, magnetic disk, optical disk, ROM, and/or RAM, including an operating system, one or more application programs, other program modules, and program data. A user may enter commands and information into the computer through keyboard, pointing device, or other input devices, such as a microphone, joy stick, game pad, satellite dish, scanner, or the like. These and other input devices are often connected to the processing unit through a serial port interface that is coupled to the system bus. Alternatively, the input devices may be connected by other interfaces, such as a parallel port, game port or a Universal Serial Bus (USB). A monitor or another display device is also connected to the system bus via an interface, such as a video adapter. In addition to the monitor, personal computers typically include other peripheral output devices (not shown), such as speakers and printers.

One or more aspects of the present disclosure may be embodied in computer-executable instructions (i.e., software), routines, or functions stored as application programs, program modules, and/or program data in system memory or non-volatile memory. Alternatively, the software may be stored remotely, such as on a remote computer having remote application programs. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types when executed by a processor in a computer or other device. The computer-executable instructions may be stored on one or more tangible, non-transitory computer-readable media (e.g., hard disks, optical disks, removable storage media, solid state memory, RAM, etc.) and executed by one or more processors or other devices. As will be appreciated by one skilled in the art, the functionality of the program modules may be combined or distributed as desired in various embodiments. Additionally, the functionality may be embodied in whole or in part in firmware or hardware equivalents such as integrated circuits, application specific integrated circuits, Field Programmable Gate Arrays (FPGAs), etc.

The computer may operate in a networked environment using logical connections to one or more remote computers. The remote computers may each be another personal computer, a tablet computer, a PDA, a server, a router, a network PC, a peer device or other common network node, and typically include many or all of the elements described above relative to the computer. Logical connections include a Local Area Network (LAN) and a Wide Area Network (WAN) that are presented here by way of example and not limitation. Such networking environments are commonplace in office-wide or enterprise-wide computer networks, intranets and the Internet.

When used in a LAN networking environment, the computer is connected to the local network through a network interface or adapter. When used in a WAN networking environment, the computer may include a modem, a wireless link, or other means for establishing communications over the wide area network, such as the Internet. A modem, which may be internal or external, is connected to the system bus via the serial port interface. In a networked environment, program modules depicted relative to the computer, or portions thereof, may be stored in the remote memory storage device. It will be appreciated that the network connections shown are exemplary and other means of establishing communications over the wide area network may be used.

Preferably, the computer executable instructions are stored in a memory, such as a hard disk drive, and executed by the computer. Advantageously, the computer processor has the capability to perform all operations (e.g., execute computer-executable instructions) in real time.

The order of execution/performance of the operations in the embodiments of the present disclosure shown and described herein is not essential, unless otherwise specified. That is, the operations may be performed in any order, unless otherwise specified, and embodiments of the disclosure may include additional or fewer operations than those disclosed herein. For example, it is contemplated that executing a particular operation before, contemporaneously with, or after another operation is within the scope of aspects of the disclosure.

Embodiments of the present disclosure may be implemented with computer-executable instructions. The computer-executable instructions may be organized into one or more computer-executable components or modules. Aspects of the disclosure may be implemented with any number and organization of such components or modules. For example, aspects of the disclosure are not limited to the specific computer-executable instructions or the specific components or modules illustrated in the figures and described herein. Other embodiments of the disclosure may include different computer-executable instructions or components having more or less functionality than illustrated and described herein.

When introducing elements of aspects of the present disclosure or embodiments thereof, the articles "a," "an," "the," and "said" are intended to mean that there are one or more of the elements. The terms "comprising," including "and" having "are intended to be inclusive and mean that there may be additional elements other than the listed elements.

Having described aspects of the present disclosure in detail, it will be apparent that many modifications and variations are possible without departing from the scope of aspects of the disclosure as defined in the appended claims. As various changes could be made in the above constructions, products, and methods without departing from the scope of aspects of the disclosure, it is intended that all matter contained in the above description and shown in the accompanying drawings shall be interpreted as illustrative and not in a limiting sense.

27页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：载具管理

System and method for detecting and predicting faults in an industrial process automation system

相关技术

网友询问留言