Method and device for processing machine room abnormity based on temperature monitoring

文档序号:1413393 发布日期:2020-03-10 浏览:15次 中文

阅读说明:本技术 一种基于温度监测的机房异常的处理方法和装置 (Method and device for processing machine room abnormity based on temperature monitoring ) 是由 刘刚 于 2018-09-04 设计创作,主要内容包括:本发明提供了一种基于温度监测的机房异常的处理方法和装置。该方法包括:获取机房内各服务器的与温度相关的CPU指标和所述机房内的环境温度;判断所获取的机房环境温度是否在第一预设温度范围内,若是,则确定所述机房运行正常;若否,则根据所获取的各服务器的与温度相关的CPU指标,判断是否存在运行异常的服务器;若存在运行异常的服务器,则对所述运行异常的服务器进行异常处理;若不存在运行异常的服务器,则确定所述机房运行异常,并对所述机房进行异常处理。本发明通过直接根据机房和服务器的温度信息对机房和服务器的异常情况进行排查和处理,减少了响应时间,从而降低损失,同时减少了人工干预,从而降低人力成本。(The invention provides a method and a device for processing machine room abnormity based on temperature monitoring. The method comprises the following steps: acquiring CPU indexes related to temperature of each server in a machine room and ambient temperature in the machine room; judging whether the acquired environment temperature of the machine room is within a first preset temperature range, and if so, determining that the machine room normally operates; if not, judging whether a server with abnormal operation exists according to the acquired CPU indexes of the servers, which are related to the temperature; if the server with abnormal operation exists, performing exception handling on the server with abnormal operation; and if the server with abnormal operation does not exist, determining that the machine room has abnormal operation, and performing abnormal processing on the machine room. According to the invention, the abnormal conditions of the machine room and the server are checked and processed directly according to the temperature information of the machine room and the server, so that the response time is reduced, the loss is reduced, and meanwhile, the manual intervention is reduced, thereby reducing the labor cost.)

1. A method for processing machine room abnormity based on temperature monitoring comprises the following steps:

acquiring CPU indexes related to temperature of each server in a machine room and ambient temperature in the machine room;

judging whether the acquired environment temperature of the machine room is within a first preset temperature range, and if so, determining that the machine room normally operates;

if not, judging whether a server with abnormal operation exists according to the acquired CPU indexes of the servers, which are related to the temperature;

if the server with abnormal operation exists, performing exception handling on the server with abnormal operation;

and if the server with abnormal operation does not exist, determining that the machine room has abnormal operation, and performing abnormal processing on the machine room.

2. The method of claim 1, wherein the temperature-related CPU indicator for each server comprises a CPU temperature for each server;

judging whether a server with abnormal operation exists according to the acquired CPU indexes related to the temperature of each server, and the method comprises the following steps:

judging whether the acquired CPU temperature of each server is within a second preset temperature range or not;

if the acquired CPU temperature of a certain server is not in the second preset temperature range, determining that the server with the CPU temperature not in the second preset temperature range is abnormal in operation;

and if the acquired CPU temperature of each server is within the second preset temperature range, determining that no server with abnormal operation exists.

3. The method of claim 2, wherein the temperature-related CPU metric for each server further comprises a CPU idle time percentage for each server;

if the acquired CPU temperature of a certain server is not in the second preset temperature range, determining that the server with the CPU temperature not in the second preset temperature range is abnormal in operation, wherein the method comprises the following steps:

if the acquired CPU temperature of a certain server is not in the second preset temperature range, judging whether the CPU idle time percentage of the server of which the CPU temperature is not in the second preset temperature range is higher than a preset threshold value or not;

and if so, determining that the server of which the CPU temperature is not in the second preset temperature range is abnormal in operation.

4. The method according to claim 2 or 3, wherein the first preset temperature range is a temperature range set by people or a temperature range calculated according to historical data of the temperature of the environment of the machine room;

the second preset temperature range is a temperature range set manually or a temperature range calculated according to historical data of the CPU temperature of the server.

5. The method according to any one of claims 2-4, further comprising:

saving the acquired environmental temperature in the machine room as historical data of the environmental temperature of the machine room;

drawing a machine room environment temperature historical change curve according to the stored historical data of the machine room environment temperature, and recording abnormal events corresponding to abnormal change sections in the machine room environment temperature historical change curve and the characteristic attributes of the abnormal events;

at this time, the process of the present invention,

carrying out exception handling on the machine room, comprising the following steps:

comparing whether the change trend of the currently acquired machine room ambient temperature is the same as the change trend of the abnormal change section in the historical change curve of the machine room ambient temperature;

and if so, performing exception handling on the machine room according to the exception event corresponding to the exception change section and the characteristic attribute of the exception event.

6. The method according to any one of claims 2-5, further comprising:

saving the acquired CPU temperature of each server as historical data of the CPU temperature of each server;

drawing a CPU temperature historical change curve of each server according to the stored historical data of the CPU temperature of each server, and recording abnormal events corresponding to abnormal change sections in the CPU temperature historical change curve and characteristic attributes of the abnormal events;

at this time, the process of the present invention,

and performing exception handling on the server with the abnormal operation, wherein the exception handling comprises the following steps:

comparing whether the change trend of the CPU temperature of the server with the abnormal operation obtained currently is the same as the change trend of an abnormal change section in the historical change curve of the CPU temperature of the server;

and if so, carrying out exception handling on the server with the abnormal operation according to the abnormal event corresponding to the abnormal change section and the characteristic attribute of the abnormal event.

7. The method of claim 5 or 6, wherein the characteristic attribute of the exception event comprises a processing priority of the exception event.

8. A processing apparatus for machine room abnormity based on temperature monitoring comprises:

the temperature acquisition module is suitable for acquiring the CPU index related to the temperature of each server in the machine room and the ambient temperature in the machine room;

the computer room judgment module is suitable for judging whether the acquired computer room environment temperature is within a first preset temperature range, and if so, determining that the computer room operates normally;

the server judgment module is suitable for judging whether a server with abnormal operation exists according to the acquired CPU indexes of the servers, which are related to the temperature, if the acquired environment temperature of the machine room is not within the first preset temperature range;

the server processing module is suitable for performing exception handling on the server with abnormal operation if the server with abnormal operation exists; and

and the machine room processing module is suitable for determining that the machine room runs abnormally and processing the abnormality of the machine room if the server which runs abnormally does not exist.

9. A computer storage medium having stored thereon computer program code which, when run on a computing device, causes the computing device to execute a method of handling a room anomaly based on temperature monitoring according to any one of claims 1-7.

10. A computing device, comprising:

a processor; and

a memory storing computer program code;

the computer program code, when executed by the processor, causes the computing device to perform a method of handling a room anomaly based on temperature monitoring according to any one of claims 1-7.

Technical Field

The invention relates to the technical field of computer networks, in particular to a method for processing machine room abnormity based on temperature monitoring, a device for processing machine room abnormity based on temperature monitoring, a computer storage medium and a computing device.

Background

With the development of computer information systems, computer rooms as storage areas for core devices such as network devices and host servers are becoming increasingly important. A computer room generally refers to a place where telecommunications, internet, mobile, two-wire, electrical, and government or enterprise services are used to store servers and provide IT services to users and employees. Large-scale equipment rooms, such as IDC (internet data Center) equipment rooms, usually have thousands of cabinets, even more, with various servers and small-scale computers placed in the cabinets. In order to ensure normal operation of equipment in a machine room, the machine room needs to be maintained and operated so as to ensure that product maintenance and technical support of equipment suppliers or machine room service maintenance personnel can be timely obtained and faults can be rapidly solved under the condition that the machine room causes hardware equipment faults due to accidents and normal operation of the machine room is affected.

In the prior art, an operator usually only sets thermometers in different areas in a computer room to monitor the temperature of the computer room, and when the temperature of a certain area is monitored to be abnormal, the owner of the area is notified, and the server in the area may be down. Then, the owner of the area informs the operation and maintenance engineer, the operation and maintenance engineer carries out manual inspection on the site, troubleshooting is carried out, and fault processing is carried out correspondingly. Further, when troubleshooting is performed, for example, when the server is troubleshot, the server is logged in first in a conventional manner, if the server cannot be logged in, whether the network is normal is checked, and if the network is normal but the server cannot be logged in, the CPU index, the operation log, and the like are continuously checked. The existing fault coping mode causes overlong response time and cannot deal with emergency situations quickly. In addition, manual troubleshooting steps are complicated, and labor cost is increased.

Disclosure of Invention

In view of the above, the present invention has been made in order to provide a method for processing a machine room abnormality based on temperature monitoring, a device for processing a machine room abnormality based on temperature monitoring, a computer storage medium, and a computing device that overcome or at least partially solve the above problems.

According to an aspect of the embodiments of the present invention, a method for processing machine room abnormality based on temperature monitoring is provided, including:

acquiring CPU indexes related to temperature of each server in a machine room and ambient temperature in the machine room;

judging whether the acquired environment temperature of the machine room is within a first preset temperature range, and if so, determining that the machine room normally operates;

if not, judging whether a server with abnormal operation exists according to the acquired CPU indexes of the servers, which are related to the temperature;

if the server with abnormal operation exists, performing exception handling on the server with abnormal operation;

and if the server with abnormal operation does not exist, determining that the machine room has abnormal operation, and performing abnormal processing on the machine room.

Optionally, the CPU index of each server related to temperature includes a CPU temperature of each server;

judging whether a server with abnormal operation exists according to the acquired CPU indexes related to the temperature of each server, and the method comprises the following steps:

judging whether the acquired CPU temperature of each server is within a second preset temperature range or not;

if the acquired CPU temperature of a certain server is not in the second preset temperature range, determining that the server with the CPU temperature not in the second preset temperature range is abnormal in operation;

and if the acquired CPU temperature of each server is within the second preset temperature range, determining that no server with abnormal operation exists.

Optionally, the CPU index of each server related to temperature further includes a CPU idle time percentage of each server;

if the acquired CPU temperature of a certain server is not in the second preset temperature range, determining that the server with the CPU temperature not in the second preset temperature range is abnormal in operation, wherein the method comprises the following steps:

if the acquired CPU temperature of a certain server is not in the second preset temperature range, judging whether the CPU idle time percentage of the server of which the CPU temperature is not in the second preset temperature range is higher than a preset threshold value or not;

and if so, determining that the server of which the CPU temperature is not in the second preset temperature range is abnormal in operation.

Optionally, the first preset temperature range is a temperature range set manually or a temperature range calculated according to historical data of the environmental temperature of the machine room;

the second preset temperature range is a temperature range set manually or a temperature range calculated according to historical data of the CPU temperature of the server.

Optionally, the method further comprises:

saving the acquired environmental temperature in the machine room as historical data of the environmental temperature of the machine room;

drawing a machine room environment temperature historical change curve according to the stored historical data of the machine room environment temperature, and recording abnormal events corresponding to abnormal change sections in the machine room environment temperature historical change curve and the characteristic attributes of the abnormal events;

at this time, the process of the present invention,

carrying out exception handling on the machine room, comprising the following steps:

comparing whether the change trend of the currently acquired machine room ambient temperature is the same as the change trend of the abnormal change section in the historical change curve of the machine room ambient temperature;

and if so, performing exception handling on the machine room according to the exception event corresponding to the exception change section and the characteristic attribute of the exception event.

Optionally, the method further comprises:

saving the acquired CPU temperature of each server as historical data of the CPU temperature of each server;

drawing a CPU temperature historical change curve of each server according to the stored historical data of the CPU temperature of each server, and recording abnormal events corresponding to abnormal change sections in the CPU temperature historical change curve and characteristic attributes of the abnormal events;

at this time, the process of the present invention,

and performing exception handling on the server with the abnormal operation, wherein the exception handling comprises the following steps:

comparing whether the change trend of the CPU temperature of the server with the abnormal operation obtained currently is the same as the change trend of an abnormal change section in the historical change curve of the CPU temperature of the server;

and if so, carrying out exception handling on the server with the abnormal operation according to the abnormal event corresponding to the abnormal change section and the characteristic attribute of the abnormal event.

Optionally, the characteristic attribute of the exception event comprises a processing priority of the exception event.

Optionally, the method further comprises:

determining the distribution of the servers with abnormal operation in the machine room;

and if two or more adjacent servers run abnormally, the two or more adjacent servers are subjected to abnormal processing preferentially.

Optionally, the exception handling for the server with the abnormal operation includes at least one of:

switching the server with abnormal operation to a standby server;

alarming;

adjusting the temperature of an air conditioner in the machine room;

and closing the abnormally operated server for cooling.

Optionally, the exception handling for the machine room includes at least one of:

switching the machine room to a standby machine room;

alarming;

automatically carrying out physical fire extinguishing;

and (5) removing the fault of the air conditioning equipment.

According to another aspect of the embodiments of the present invention, there is also provided a device for processing machine room abnormality based on temperature monitoring, including:

the temperature acquisition module is suitable for acquiring the CPU index related to the temperature of each server in the machine room and the ambient temperature in the machine room;

the computer room judgment module is suitable for judging whether the acquired computer room environment temperature is within a first preset temperature range, and if so, determining that the computer room operates normally;

the server judgment module is suitable for judging whether a server with abnormal operation exists according to the acquired CPU indexes of the servers, which are related to the temperature, if the acquired environment temperature of the machine room is not within the first preset temperature range;

the server processing module is suitable for performing exception handling on the server with abnormal operation if the server with abnormal operation exists; and

and the machine room processing module is suitable for determining that the machine room runs abnormally and processing the abnormality of the machine room if the server which runs abnormally does not exist.

Optionally, the CPU index of each server related to temperature includes a CPU temperature of each server;

the server determination module is further adapted to:

judging whether the acquired CPU temperature of each server is within a second preset temperature range or not;

if the acquired CPU temperature of a certain server is not in the second preset temperature range, determining that the server with the CPU temperature not in the second preset temperature range is abnormal in operation;

and if the acquired CPU temperature of each server is within the second preset temperature range, determining that no server with abnormal operation exists.

Optionally, the CPU index of each server related to temperature further includes a CPU idle time percentage of each server;

the server determination module is further adapted to:

if the acquired CPU temperature of a certain server is not in the second preset temperature range, judging whether the CPU idle time percentage of the server of which the CPU temperature is not in the second preset temperature range is higher than a preset threshold value or not;

and if so, determining that the server of which the CPU temperature is not in the second preset temperature range is abnormal in operation.

Optionally, the first preset temperature range is a temperature range set manually or a temperature range calculated according to historical data of the environmental temperature of the machine room;

the second preset temperature range is a temperature range set manually or a temperature range calculated according to historical data of the CPU temperature of the server.

Optionally, the apparatus further comprises:

the first data storage module is suitable for storing the acquired environmental temperature in the machine room as historical data of the environmental temperature of the machine room;

the first change curve drawing module is suitable for drawing a machine room environment temperature historical change curve according to the stored historical data of the machine room environment temperature, and recording abnormal events corresponding to abnormal change sections in the machine room environment temperature historical change curve and the characteristic attributes of the abnormal events;

at this time, the process of the present invention,

the machine room processing module is further adapted to:

comparing whether the change trend of the currently acquired machine room ambient temperature is the same as the change trend of the abnormal change section in the historical change curve of the machine room ambient temperature;

and if so, performing exception handling on the machine room according to the exception event corresponding to the exception change section and the characteristic attribute of the exception event.

Optionally, the apparatus further comprises:

the second data storage module is suitable for storing the acquired CPU temperature of each server as historical data of the CPU temperature of each server;

the second change curve drawing module is suitable for drawing the CPU temperature historical change curve of each server according to the stored historical data of the CPU temperature of each server, and recording abnormal events corresponding to abnormal change sections in the CPU temperature historical change curve and the characteristic attributes of the abnormal events;

at this time, the process of the present invention,

the server processing module is further adapted to:

comparing whether the change trend of the CPU temperature of the server with the abnormal operation obtained currently is the same as the change trend of an abnormal change section in the historical change curve of the CPU temperature of the server;

and if so, carrying out exception handling on the server with the abnormal operation according to the abnormal event corresponding to the abnormal change section and the characteristic attribute of the abnormal event.

Optionally, the characteristic attribute of the exception event comprises a processing priority of the exception event.

Optionally, the apparatus further comprises:

the abnormal distribution determining module is suitable for determining the distribution of the servers with abnormal operation in the machine room;

and if two or more adjacent servers run abnormally, triggering the server processing module to preferentially process the abnormal operation of the two or more adjacent servers.

Optionally, the exception handling for the server with the abnormal operation includes at least one of:

switching the server with abnormal operation to a standby server;

alarming;

adjusting the temperature of an air conditioner in the machine room;

and closing the abnormally operated server for cooling.

Optionally, the exception handling for the machine room includes at least one of:

switching the machine room to a standby machine room;

alarming;

automatically carrying out physical fire extinguishing;

and (5) removing the fault of the air conditioning equipment.

According to a further aspect of the embodiments of the present invention, there is also provided a computer storage medium storing computer program code, which, when run on a computing device, causes the computing device to execute the method for processing machine room abnormality based on temperature monitoring according to any one of the above.

According to still another aspect of the embodiments of the present invention, there is also provided a computing device including:

a processor; and

a memory storing computer program code;

when executed by the processor, the computer program code causes the computing device to perform a method of handling a room anomaly based on temperature monitoring according to any of the preceding claims.

The method and the device for processing the machine room abnormity based on temperature monitoring provided by the embodiment of the invention firstly obtain the CPU indexes related to the temperature of each server in the machine room and the ambient temperature in the machine room; then judging whether the machine room normally operates according to the acquired environment temperature of the machine room, and if not, further judging whether a server abnormally operates exists according to the acquired CPU indexes of the servers related to the temperature; and if the server with abnormal operation exists, performing exception processing on the server with abnormal operation, and if the server with abnormal operation does not exist, performing exception processing on the machine room. The abnormal conditions of the machine room and the server are directly checked and processed according to the temperature information of the machine room and the server, so that the response time is shortened, the loss is reduced, meanwhile, the manual intervention is reduced, and the labor cost is reduced.

Furthermore, when the server is judged to be abnormal, the judgment is carried out by combining the CPU temperature of the server and the CPU idle time percentage, so that the server abnormality can be more accurately checked.

Further, the acquired room environment temperature and the CPU temperature of the server are saved as historical data, a room environment temperature historical change curve and a CPU temperature historical change curve of the server are respectively drawn according to the room environment temperature and the CPU temperature historical data, and an abnormal event corresponding to an abnormal change section in the room environment temperature historical change curve and a characteristic attribute of the abnormal event, and an abnormal event corresponding to an abnormal change section in the CPU temperature historical change curve and a characteristic attribute of the abnormal event are respectively recorded. Therefore, when the machine room and/or the server are subjected to exception handling, the handling priority of the abnormal events of the machine room and/or the abnormal events of the server can be determined by comparing whether the change trend of the currently acquired environment temperature of the machine room is the same as the change trend of the abnormal change section in the historical change curve of the environment temperature of the machine room and/or whether the change trend of the currently acquired CPU temperature of the server with abnormal operation is the same as the change trend of the abnormal change section in the historical change curve of the CPU temperature of the server, and then the machine room and/or the server can be subjected to exception handling according to the handling priority, so that huge loss caused by the fact that the emergency events are not handled in time is prevented, and the disaster tolerance capability of the machine room is improved.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

The above and other objects, advantages and features of the present invention will become more apparent to those skilled in the art from the following detailed description of specific embodiments thereof, taken in conjunction with the accompanying drawings.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

fig. 1 is a flowchart illustrating a method for processing a machine room abnormality based on temperature monitoring according to an embodiment of the present invention;

fig. 2 is a flowchart illustrating a method for processing machine room abnormality based on temperature monitoring according to another embodiment of the present invention;

fig. 3 is a schematic structural diagram of a device for processing machine room abnormality based on temperature monitoring according to an embodiment of the present invention; and

fig. 4 is a schematic structural diagram of a device for processing machine room abnormality based on temperature monitoring according to another embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The inventor finds that when the CPU of the server fails, the CPU temperature is usually abnormal, and in the current operation and maintenance of the machine room, the operation and maintenance personnel can only obtain notification of the abnormal area temperature of the machine room, and then need to manually troubleshoot the failure step by step. For example, when troubleshooting is performed on a server, a conventional method is to log in the server first, check whether a network is normal if the server cannot log in, and continue to check CPU indexes, operation logs and the like if the network is normal but the server cannot log in. This results in a too long response time to quickly cope with the emergency. In addition, manual troubleshooting steps are complicated, and labor cost is increased.

In order to solve the technical problem, an embodiment of the present invention provides a method for processing a machine room exception based on temperature monitoring. Fig. 1 shows a flowchart of a method for processing a machine room exception based on temperature monitoring according to an embodiment of the present invention. Referring to fig. 1, the method may include at least the steps of:

and S102, acquiring the CPU indexes related to the temperature of each server in the machine room and the ambient temperature in the machine room.

And step S104, judging whether the acquired environment temperature of the machine room is in a first preset temperature range, and if so, determining that the machine room normally operates.

And step S106, if not, judging whether a server with abnormal operation exists according to the acquired CPU indexes related to the temperature of each server.

And step S108, if the server with abnormal operation exists, performing abnormal processing on the server with abnormal operation.

And step S110, if the server with abnormal operation does not exist, determining that the machine room has abnormal operation, and performing exception processing on the machine room.

The processing method for the machine room abnormity based on temperature monitoring provided by the embodiment of the invention can be executed by a specified monitoring management platform. Preferably, the monitoring management Platform is designated as an IPMI (Intelligent Platform management interface) Platform. IPMI is an industry standard employed for managing peripheral devices used in Intel-based enterprise systems. IPMI is also an open free standard that users can use without paying additional fees. IPMI can span different operating systems, firmware and hardware platforms, and can intelligently monitor, control and automatically report the operating conditions of a large number of servers, such as temperature, voltage, fan operating status, power status, etc., so as to reduce the cost of the server system.

The method for processing the abnormity of the machine room based on the temperature monitoring, provided by the embodiment of the invention, comprises the steps of firstly obtaining CPU indexes of all servers in the machine room, which are related to the temperature, and the ambient temperature in the machine room; then judging whether the machine room normally operates according to the acquired environment temperature of the machine room, and if not, further judging whether a server abnormally operates exists according to the acquired CPU indexes of the servers related to the temperature; and if the server with abnormal operation exists, performing exception processing on the server with abnormal operation, and if the server with abnormal operation does not exist, performing exception processing on the machine room. The abnormal conditions of the machine room and the server are directly checked and processed according to the temperature information of the machine room and the server, so that the response time is shortened, the loss is reduced, meanwhile, the manual intervention is reduced, and the labor cost is reduced.

In the above step S102, the CPU index of each server in the computer room related to the temperature and the ambient temperature in the computer room may be obtained by specifying the monitoring management platform, particularly the IPMI platform, via each sensor (e.g., temperature sensor).

In step S104, it is determined whether the room ambient temperature is abnormal by comparing the acquired room ambient temperature with the upper limit value and the lower limit value of the first preset temperature range. The first preset temperature range mentioned here may be a temperature range in which the machine room normally operates, which is set manually, or a temperature range calculated according to historical data of the environmental temperature of the machine room.

In one embodiment, the maximum value and the minimum value in the historical data of the ambient temperature of the machine room operating normally in each designated time interval (e.g., every minute, every hour, every day, etc.) may be respectively added and averaged to obtain an average value of the maximum value and an average value of the minimum value, and then the first preset temperature range may be obtained according to the calculated average value of the maximum value and the calculated average value of the minimum value. For example, if in five days of historical data, the maximum values of the normal room ambient temperature are 24 ℃, 25 ℃, 24 ℃ and the minimum values are 20 ℃, 21 ℃, 20 ℃, 22 ℃ and 19 ℃ every day, the maximum values and the minimum values are respectively added and averaged to obtain the maximum average value of 24.4 ℃ and the minimum average value of 20.4 ℃, and further obtain the first preset temperature range of 20.4-24.4 ℃.

In another alternative embodiment, the average value of the room ambient temperature may also be obtained by adding and averaging historical data of the room ambient temperature that normally operates, and then the first preset temperature range may be obtained according to the calculated average value of the room ambient temperature and the allowable fluctuation range of the room ambient temperature. For example, the average value of the room ambient temperature is calculated to be 22.4 ℃, the allowable fluctuation range of the room ambient temperature is ± 3 ℃, and the first preset temperature range is calculated to be 19.4-25.4 ℃. It should be noted that the above calculation manner of the first preset temperature range is only exemplary, and the present invention is not limited thereto.

In step S106, if the acquired room environment temperature is not within the first preset temperature range, it is determined whether there is a server with abnormal operation according to the acquired CPU index of each server related to the temperature.

In an optional embodiment of the invention, the obtained temperature-related CPU index for each server comprises a CPU temperature for each server. Accordingly, the step of determining whether there is a server with abnormal operation according to the acquired CPU index of each server related to the temperature may be implemented as follows:

judging whether the acquired CPU temperature of each server is within a second preset temperature range or not;

if the acquired CPU temperature of a certain server is not in a second preset temperature range, determining that the server with the CPU temperature not in the second preset temperature range is abnormal in operation;

and if the acquired CPU temperature of each server is within the second preset temperature range, determining that no server with abnormal operation exists.

The CPU temperature of each server is preferably obtained by an IPMI platform using a temperature sensor installed in each server.

The second preset temperature range mentioned here may be a CPU temperature range in which the server normally operates, which is set manually, or a temperature range calculated from historical data of the CPU temperature of the server.

In one embodiment, the second preset temperature range may be obtained by respectively adding and averaging the highest value and the lowest value in the CPU temperature history data of the normally operating server in each designated time interval (e.g., every minute, every hour, every day, etc.), to obtain an average value of the highest values and an average value of the lowest values, and then according to the calculated average value of the highest values and the calculated average value of the lowest values. For example, if the maximum values of the normal CPU temperatures of the servers for each day are 63 ℃, 65 ℃, 64 ℃, 65 ℃, 62 ℃, and the minimum values are 45 ℃, 44 ℃, 46 ℃, 45 ℃, 46 ℃ in the five-day history data, the maximum values and the minimum values are respectively added and averaged to obtain a maximum average value of 63.8 ℃ and a minimum average value of 45.2 ℃, and further obtain a second preset temperature range of 45.2-63.8 ℃.

In another alternative embodiment, the CPU temperature average value may also be obtained by adding and averaging CPU temperature history data of a server that normally operates, and then the second preset temperature range may be obtained according to the calculated CPU temperature average value and the allowable fluctuation range of the CPU temperature. For example, the average value of the CPU temperature of the server is calculated to be 54.5 ℃, and the allowable fluctuation range of the CPU temperature is ± 10 ℃, so that the second preset temperature range is 44.5-64.5 ℃. It should be noted that the above calculation manner of the second preset temperature range is only exemplary, and the present invention is not limited thereto.

Further, in another optional embodiment of the present invention, the obtained temperature-related CPU index for each server includes a CPU idle time percentage for each server in addition to the CPU temperature for each server. Correspondingly, if the acquired CPU temperature of a certain server is not within the second preset temperature range, the step of determining that the server whose CPU temperature is not within the second preset temperature range is abnormal may further be implemented as:

if the acquired CPU temperature of a certain server is not in a second preset temperature range, judging whether the CPU idle time percentage of the server of which the CPU temperature is not in the second preset temperature range is higher than a preset threshold value or not;

and if so, determining that the server of which the CPU temperature is not in the second preset temperature range is abnormal in operation.

The percentage of idle time of the CPU, i.e., the idle value, is one of the important indicators indicating the operating state of the CPU. The higher the Idle value, the lower the CPU occupancy and the more Idle the CPU. When the server performs intensive operation, the temperature of the CPU is increased, and at the moment, the corresponding CPU index shows a low idle value. Therefore, by combining the CPU temperature of the server and the percentage of idle time of the CPU (i.e., the idle value), it is determined that the CPU is in a normal intensive operation state when the CPU temperature of the server is high and the CPU occupancy is also high (i.e., the idle value is low), and that the CPU is operating abnormally when the CPU temperature of the server is high and the CPU is idle (i.e., the idle value is high), so that the server abnormality can be more accurately checked.

In step S108, if it is determined that there is a server having an abnormal operation, the server having an abnormal operation is subjected to an abnormality processing.

Optionally, according to different abnormal server conditions, performing exception handling on the server with abnormal operation may include at least one of: switching the server with abnormal operation to a standby server; alarming; adjusting the temperature of an air conditioner in the machine room; and closing the abnormally operated server for cooling.

The alarm modes include but are not limited to short messages, mails, APP (application) message notification and the like.

The following is a specific example: when the server with abnormal operation is found, the temperature of the air conditioner nearest to the cabinet where the server with abnormal operation is located can be adjusted at first until the temperature of the whole machine room is reduced to a normal range. And then, if the temperature of the whole machine room cannot be reduced to a normal range only by adjusting the air conditioner, switching the abnormally operated server to a standby server, closing the abnormally operated server to cool, alarming and recording a log.

In step S110, if it is determined that there is no server with abnormal operation, it is determined that the machine room is abnormal in operation, and the machine room is subjected to exception handling.

Optionally, according to different abnormal situations of the machine room, performing exception handling on the machine room may include at least one of:

switching the machine room to a standby machine room;

alarming;

automatically carrying out physical fire extinguishing;

and (5) removing the fault of the air conditioning equipment.

Currently, most internet services adopt a local multi-service (i.e., one service is deployed in multiple machine rooms in one city) or a remote multi-service (i.e., one service is deployed in multiple machine rooms in multiple cities) architecture design. In the multi-activity architecture, each machine room is active and can bear flow in real time, any machine room with a problem can be directly cut off and directly taken over by the other machine room, and therefore the disaster tolerance capability of the machine rooms is improved. And after the machine room is determined to be abnormal, the normal operation of the service is ensured by switching to a standby machine room.

The alarm modes include but are not limited to short messages, mails, APP (application) message notification and the like.

If the machine room is on fire, the physical fire extinguishing can be automatically carried out.

If the air conditioning equipment breaks down, the machine room operation and maintenance personnel can be informed to carry out the fault removal work of the air duct, the compressor, the humidifier and the like.

In an alternative embodiment of the invention, the method may further comprise the steps of:

saving the acquired environmental temperature in the machine room as historical data of the environmental temperature of the machine room;

drawing a historical change curve of the environmental temperature of the machine room according to the stored historical data of the environmental temperature of the machine room, and recording abnormal events corresponding to abnormal change sections in the historical change curve of the environmental temperature of the machine room and the characteristic attributes of the abnormal events.

At this time, correspondingly, the step of performing exception handling on the machine room may include:

comparing whether the change trend of the currently acquired machine room ambient temperature is the same as the change trend of the abnormal change section in the historical change curve of the machine room ambient temperature;

and if so, performing exception handling on the machine room according to the exception event corresponding to the exception change section and the characteristic attribute of the exception event.

The abnormal event mentioned herein may include a fire in a machine room, a malfunction of an air conditioner, and the like.

Further, the characteristic attribute of the exception event may include a processing priority of the exception event. In practical applications, the priority of processing the abnormal event may be set according to the urgency of the abnormal event or the size of the loss it may cause.

In an alternative embodiment, the trend of the currently acquired change of the room ambient temperature may be obtained by combining the currently acquired data of the room ambient temperature with the data of the room ambient temperature acquired in a period of time before the current time point to analyze the change of the temperature. In another alternative embodiment, the temperature data of the multiple machine rooms can be obtained continuously within a specified time period, and the temperature data of the multiple machine rooms can be obtained by analyzing changes of the temperature data of the multiple machine rooms. For example, the machine room ambient temperature is acquired every second within 1 minute, 60 pieces of machine room ambient temperature data are acquired, and then the change of the 60 pieces of machine room ambient temperature data is analyzed, so that the change trend of the currently acquired machine room ambient temperature is acquired.

In an alternative embodiment of the invention, the method may further comprise the steps of:

saving the acquired CPU temperature of each server as historical data of the CPU temperature of each server;

and drawing a CPU temperature historical change curve of each server according to the stored historical data of the CPU temperature of each server, and recording abnormal events corresponding to abnormal change sections in the CPU temperature historical change curve of each server and the characteristic attributes of the abnormal events.

At this time, correspondingly, the step of performing exception handling on the server with the abnormal operation may include:

comparing whether the change trend of the CPU temperature of the server with abnormal operation obtained currently is the same as the change trend of an abnormal change section in the CPU temperature historical change curve of the server;

and if so, carrying out exception handling on the server with abnormal operation according to the abnormal event corresponding to the abnormal change section and the characteristic attribute of the abnormal event.

In an alternative embodiment, the currently acquired CPU temperature of the server with abnormal operation may be obtained by combining the currently acquired CPU temperature data of the server with the CPU temperature data of the server acquired in a period of time before the current time point to analyze the change in temperature. In another alternative embodiment, the CPU temperature data of a plurality of the servers may be acquired continuously for a specified period of time, and the CPU temperature data may be obtained by analyzing changes in the continuously acquired CPU temperature data. For example, the CPU temperature data of the server is acquired every second within 1 minute, the CPU temperature data of 60 servers is obtained, and then the change of the CPU temperature data is analyzed, so that the change trend of the currently acquired CPU temperature of the server is obtained.

Further, the characteristic attribute of the exception event may include a processing priority of the exception event. In practical applications, the priority of processing the abnormal event may be set according to the urgency of the abnormal event or the size of the loss it may cause.

When the abnormal processing is carried out on the machine room and/or the server, the processing priority of the abnormal events of the machine room and/or the abnormal events of the server is determined by comparing whether the change trend of the environment temperature of the machine room obtained currently is the same as the change trend of the abnormal change section in the history change curve of the environment temperature of the machine room and/or whether the change trend of the CPU temperature of the server with abnormal operation obtained currently is the same as the change trend of the abnormal change section in the history change curve of the CPU temperature of the server, and then the abnormal processing can be carried out on the machine room and/or the server according to the processing priority, so that the huge loss caused by the fact that the emergency abnormal events are not processed in time is prevented, and the disaster tolerance capability of the machine room is improved.

In an optional embodiment of the present invention, after determining that there is a server with an abnormal operation, the method may further include the following steps:

determining the distribution of the servers with abnormal operation in the machine room;

and if the operation of two or more adjacent servers is abnormal, the two or more adjacent servers are preferentially subjected to abnormal processing.

The case of a clustered slice of abnormally operating servers often signifies a more serious server failure, which also results in more service failures and losses. Therefore, by analyzing the distribution of the abnormally-operated servers in the machine room, the gathered abnormally-operated servers are preferentially and timely processed according to the analysis result, and larger loss is avoided.

In the above, various implementation manners of each link of the embodiment shown in fig. 1 are introduced, and an implementation process of the method for processing the machine room abnormality based on the temperature monitoring according to the present invention will be described in detail through a specific embodiment. Fig. 2 is a flowchart illustrating a method for processing a machine room abnormality based on temperature monitoring according to another embodiment of the present invention. Referring to fig. 2, the method may include at least steps S202 to S222.

Step S202, the CPU temperature and the CPU idle time percentage of each server in the machine room and the ambient temperature in the machine room are obtained.

Step S204, judging whether the acquired environment temperature of the machine room is in a first preset temperature range, and if so, determining that the machine room normally operates.

And step S206, if not, judging whether the acquired CPU temperature of each server is within a second preset temperature range.

In step S208, if the acquired CPU temperature of a certain server is not within the second preset temperature range, it is determined whether the CPU idle time percentage of the server whose CPU temperature is not within the second preset temperature range is higher than a preset threshold.

And step S210, if yes, determining that the server of which the CPU temperature is not in the second preset temperature range is abnormal in operation.

Step S212, comparing whether the variation trend of the CPU temperature of the server with abnormal operation obtained currently is the same as the variation trend of the abnormal variation section in the CPU temperature historical variation curve of the server, wherein the CPU temperature historical variation curve of the server is drawn according to the stored historical data of the CPU temperature of the server, and the abnormal event corresponding to the abnormal variation section in the CPU temperature historical variation curve of the server and the processing priority of the abnormal event are recorded.

Step S214, if the abnormal events are the same as the abnormal events in the CPU temperature historical change curve, performing exception handling on the server with abnormal operation according to the abnormal events corresponding to the abnormal change section in the CPU temperature historical change curve and the processing priority of the abnormal events; and if the difference is not the same, exception handling is carried out on the server with abnormal operation according to the convention.

Step S216, if the acquired CPU temperatures of the servers are all within the second preset temperature range, it is determined that the machine room is abnormal in operation.

Step S218, comparing whether the variation trend of the currently acquired room ambient temperature is the same as the variation trend of the abnormal variation section in a room ambient temperature historical variation curve, where the room ambient temperature historical variation curve is drawn according to the stored historical data of the room ambient temperature, and the abnormal event corresponding to the abnormal variation section in the room ambient temperature historical variation curve and the processing priority of the abnormal event are recorded.

Step S220, if the abnormal events are the same as the abnormal events in the historical temperature change curve of the machine room, performing exception handling on the machine room according to the abnormal events corresponding to the abnormal change section in the historical temperature change curve of the machine room and the processing priority of the abnormal events; and if the difference is different, exception handling is carried out on the machine room according to the convention.

Step S222, saving the acquired environmental temperature in the machine room and the CPU temperature of each server as historical data of the environmental temperature of the machine room and the CPU temperature of each server, respectively.

It should be noted that, in the embodiment of the present invention, the step S222 is executed after the step S220, and in other alternative embodiments, the step S222 may also be executed after any step of the step S202.

It should be noted that, in practical applications, all the above optional embodiments may be combined in a combined manner at will to form an optional embodiment of the present invention, and details are not described here any more.

Based on the same inventive concept, the embodiment of the invention further provides a device for processing the machine room abnormity based on temperature monitoring, which is used for supporting the method for processing the machine room abnormity based on temperature monitoring provided by any one of the embodiments or the combination thereof. Fig. 3 is a schematic structural diagram of a device for processing machine room abnormality based on temperature monitoring according to an embodiment of the present invention. Referring to fig. 3, the apparatus may include at least: the temperature acquisition module 310, the machine room judgment module 320, the server judgment module 330, the server processing module 340, and the machine room processing module 350.

The device for processing the machine room abnormity based on temperature monitoring provided by the embodiment of the invention can be realized by a specified monitoring management platform. Preferably, the monitoring management platform is designated as an IPMI platform.

The functions of the components or devices of the device for processing machine room abnormality based on temperature monitoring and the connection relationship between the components are described as follows:

the temperature obtaining module 310 is adapted to obtain a CPU index related to temperature of each server in the computer room and an ambient temperature in the computer room.

The machine room determining module 320 is connected to the temperature obtaining module 310, and is adapted to determine whether the obtained environmental temperature of the machine room is within a first preset temperature range, and if so, determine that the machine room operates normally.

The server determining module 330 is connected to the temperature obtaining module 310 and the machine room determining module 320, and is adapted to determine whether there is a server with abnormal operation according to the obtained CPU index of each server related to the temperature if the obtained environmental temperature of the machine room is not within the first preset temperature range.

The server processing module 340 is connected to the server determining module 330, and is adapted to perform exception handling on the server with an abnormal operation if the server with an abnormal operation exists.

The machine room processing module 350 is connected to the server determining module 330, and is adapted to determine that the machine room operates abnormally if there is no server that operates abnormally, and perform exception handling on the machine room.

In an alternative embodiment, the temperature-related CPU indicator for each server includes the CPU temperature for each server; at this time, correspondingly, the server determining module 330 is further adapted to:

judging whether the acquired CPU temperature of each server is within a second preset temperature range or not;

if the acquired CPU temperature of a certain server is not in a second preset temperature range, determining that the server with the CPU temperature not in the second preset temperature range is abnormal in operation;

and if the acquired CPU temperature of each server is within the second preset temperature range, determining that no server with abnormal operation exists.

In an alternative embodiment, the temperature-related CPU indicator for each server further comprises a CPU idle time percentage for each server; at this time, correspondingly, the server determining module 330 is further adapted to:

if the acquired CPU temperature of a certain server is not in a second preset temperature range, judging whether the CPU idle time percentage of the server of which the CPU temperature is not in the second preset temperature range is higher than a preset threshold value or not;

and if so, determining that the server of which the CPU temperature is not in the second preset temperature range is abnormal in operation.

In an optional embodiment, the first preset temperature range is a temperature range set by a person or a temperature range calculated according to historical data of the temperature of the environment of the machine room. The second preset temperature range is a temperature range set manually or a temperature range calculated according to historical data of the CPU temperature of the server.

In an alternative embodiment, as shown in fig. 4, the apparatus for processing machine room abnormality based on temperature monitoring shown in fig. 3 may further include a first data saving module 460 and a first variation curve drawing module 470. The first data saving module 460 is connected to the temperature obtaining module 310 and is adapted to save the obtained ambient temperature in the machine room as historical data of the ambient temperature of the machine room. The first change curve drawing module 470 is connected to the first data saving module 460, and is adapted to draw a machine room ambient temperature historical change curve according to the saved historical data of the machine room ambient temperature, and record an abnormal event corresponding to an abnormal change section in the machine room ambient temperature historical change curve and a characteristic attribute of the abnormal event.

At this time, correspondingly, the room processing module 350 is further adapted to:

comparing whether the change trend of the currently acquired machine room ambient temperature is the same as the change trend of the abnormal change section in the historical change curve of the machine room ambient temperature;

and if so, performing exception handling on the machine room according to the exception event corresponding to the exception change section in the historical change curve of the environmental temperature of the machine room and the characteristic attribute of the exception event.

In an alternative embodiment, as shown in fig. 4, the apparatus for processing machine room abnormality based on temperature monitoring shown in fig. 3 may further include a second data saving module 480 and a second variation curve drawing module 490. The second data saving module 480 is connected to the temperature obtaining module 310, and is adapted to save the obtained CPU temperature of each server as the history data of the CPU temperature of each server. The second change curve drawing module 490 is connected to the second data storage module 480, and is adapted to draw the CPU temperature history change curve of each server according to the stored CPU temperature history data of each server, and record an abnormal event corresponding to an abnormal change segment in the CPU temperature history change curve and the characteristic attribute of the abnormal event.

At this time, correspondingly, the server processing module 340 is further adapted to:

comparing whether the change trend of the CPU temperature of the server with abnormal operation obtained currently is the same as the change trend of an abnormal change section in the CPU temperature historical change curve of the server;

and if so, carrying out exception handling on the server with abnormal operation according to the exception event corresponding to the exception change section in the CPU temperature historical change curve and the characteristic attribute of the exception event.

Further, the above-mentioned characteristic attribute of the abnormal event includes a processing priority of the abnormal event.

In an alternative embodiment, as shown in fig. 4, the apparatus for processing machine room abnormality based on temperature monitoring shown in fig. 3 may further include an abnormality distribution determining module 500. The abnormal distribution determining module 500 is connected to the server determining module 330 and the server processing module 340, and is adapted to determine the distribution of the server with abnormal operation in the computer room. If two or more adjacent servers are abnormal in operation, the abnormal distribution determining module 500 triggers the server processing module 340 to preferentially perform abnormal processing on the two or more adjacent servers.

In an optional embodiment, exception handling for the server running the exception comprises at least one of: switching the server with abnormal operation to a standby server; alarming; adjusting the temperature of an air conditioner in the machine room; and closing the abnormally operated server for cooling.

In an optional embodiment, the exception handling for the machine room includes at least one of: switching the machine room to a standby machine room; alarming; automatically carrying out physical fire extinguishing; and (5) removing the fault of the air conditioning equipment.

Based on the same inventive concept, the embodiment of the invention also provides a computer storage medium. The computer storage medium stores computer program code, which, when run on a computing device, causes the computing device to execute the method for handling a room anomaly based on temperature monitoring according to any one or a combination of the above embodiments.

Based on the same inventive concept, the embodiment of the invention also provides the computing equipment. The computing device may include:

a processor; and

a memory storing computer program code;

when executed by a processor, the computer program code causes the computing device to perform a method for handling a room anomaly based on temperature monitoring according to any one or a combination of the above embodiments.

According to any one or a combination of multiple optional embodiments, the embodiment of the present invention can achieve the following advantages:

the method and the device for processing the machine room abnormity based on temperature monitoring provided by the embodiment of the invention firstly obtain the CPU indexes related to the temperature of each server in the machine room and the ambient temperature in the machine room; then judging whether the machine room normally operates according to the acquired environment temperature of the machine room, and if not, further judging whether a server abnormally operates exists according to the acquired CPU indexes of the servers related to the temperature; and if the server with abnormal operation exists, performing exception processing on the server with abnormal operation, and if the server with abnormal operation does not exist, performing exception processing on the machine room. The abnormal conditions of the machine room and the server are directly checked and processed according to the temperature information of the machine room and the server, so that the response time is shortened, the loss is reduced, meanwhile, the manual intervention is reduced, and the labor cost is reduced.

Furthermore, when the server is judged to be abnormal, the judgment is carried out by combining the CPU temperature of the server and the CPU idle time percentage, so that the server abnormality can be more accurately checked.

Further, the acquired room environment temperature and the CPU temperature of the server are saved as historical data, a room environment temperature historical change curve and a CPU temperature historical change curve of the server are respectively drawn according to the room environment temperature and the CPU temperature historical data of the server, and an abnormal event corresponding to an abnormal change section in the room environment temperature historical change curve, a characteristic attribute of the abnormal event, an abnormal event corresponding to an abnormal change section in the CPU temperature historical change curve, and a characteristic attribute of the abnormal event are respectively recorded. Therefore, when the machine room and/or the server are subjected to exception handling, the handling priority of the abnormal events of the machine room and/or the abnormal events of the server can be determined by comparing whether the change trend of the currently acquired environment temperature of the machine room is the same as the change trend of the abnormal change section in the historical change curve of the environment temperature of the machine room and/or whether the change trend of the currently acquired CPU temperature of the server with abnormal operation is the same as the change trend of the abnormal change section in the historical change curve of the CPU temperature of the server, and then the machine room and/or the server can be subjected to exception handling according to the handling priority, so that huge loss caused by the fact that the emergency events are not handled in time is prevented, and the disaster tolerance capability of the machine room is improved.

It is clear to those skilled in the art that the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and for the sake of brevity, further description is omitted here.

In addition, the functional units in the embodiments of the present invention may be physically independent of each other, two or more functional units may be integrated together, or all the functional units may be integrated in one processing unit. The integrated functional units may be implemented in the form of hardware, or in the form of software or firmware.

Those of ordinary skill in the art will understand that: the integrated functional units, if implemented in software and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computing device (e.g., a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention when the instructions are executed. And the aforementioned storage medium includes: u disk, removable hard disk, Read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disk, and other various media capable of storing program code.

Alternatively, all or part of the steps of implementing the foregoing method embodiments may be implemented by hardware (such as a computing device, e.g., a personal computer, a server, or a network device) associated with program instructions, which may be stored in a computer-readable storage medium, and when the program instructions are executed by a processor of the computing device, the computing device executes all or part of the steps of the method according to the embodiments of the present invention.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments can be modified or some or all of the technical features can be equivalently replaced within the spirit and principle of the present invention; such modifications or substitutions do not depart from the scope of the present invention.

According to an aspect of the embodiment of the invention, A1. a method for processing machine room abnormity based on temperature monitoring is provided, which comprises the following steps:

acquiring CPU indexes related to temperature of each server in a machine room and ambient temperature in the machine room;

judging whether the acquired environment temperature of the machine room is within a first preset temperature range, and if so, determining that the machine room normally operates;

if not, judging whether a server with abnormal operation exists according to the acquired CPU indexes of the servers, which are related to the temperature;

if the server with abnormal operation exists, performing exception handling on the server with abnormal operation;

and if the server with abnormal operation does not exist, determining that the machine room has abnormal operation, and performing abnormal processing on the machine room.

A2. The method of a1, wherein the temperature-related CPU indicator for each server includes a CPU temperature for each server;

judging whether a server with abnormal operation exists according to the acquired CPU indexes related to the temperature of each server, and the method comprises the following steps:

judging whether the acquired CPU temperature of each server is within a second preset temperature range or not;

if the acquired CPU temperature of a certain server is not in the second preset temperature range, determining that the server with the CPU temperature not in the second preset temperature range is abnormal in operation;

and if the acquired CPU temperature of each server is within the second preset temperature range, determining that no server with abnormal operation exists.

A3. The method of a2, wherein the temperature-related CPU indicator for each server further comprises a CPU idle time percentage for each server;

if the acquired CPU temperature of a certain server is not in the second preset temperature range, determining that the server with the CPU temperature not in the second preset temperature range is abnormal in operation, wherein the method comprises the following steps:

if the acquired CPU temperature of a certain server is not in the second preset temperature range, judging whether the CPU idle time percentage of the server of which the CPU temperature is not in the second preset temperature range is higher than a preset threshold value or not;

and if so, determining that the server of which the CPU temperature is not in the second preset temperature range is abnormal in operation.

A4. The method according to A2 or A3, wherein the first preset temperature range is a temperature range set by people or a temperature range calculated according to historical data of the environmental temperature of a machine room;

the second preset temperature range is a temperature range set manually or a temperature range calculated according to historical data of the CPU temperature of the server.

A5. The method of any one of a2-a4, further comprising:

saving the acquired environmental temperature in the machine room as historical data of the environmental temperature of the machine room;

drawing a machine room environment temperature historical change curve according to the stored historical data of the machine room environment temperature, and recording abnormal events corresponding to abnormal change sections in the machine room environment temperature historical change curve and the characteristic attributes of the abnormal events;

at this time, the process of the present invention,

carrying out exception handling on the machine room, comprising the following steps:

comparing whether the change trend of the currently acquired machine room ambient temperature is the same as the change trend of the abnormal change section in the historical change curve of the machine room ambient temperature;

and if so, performing exception handling on the machine room according to the exception event corresponding to the exception change section and the characteristic attribute of the exception event.

A6. The method of any one of a2-a5, further comprising:

saving the acquired CPU temperature of each server as historical data of the CPU temperature of each server;

drawing a CPU temperature historical change curve of each server according to the stored historical data of the CPU temperature of each server, and recording abnormal events corresponding to abnormal change sections in the CPU temperature historical change curve and characteristic attributes of the abnormal events;

at this time, the process of the present invention,

and performing exception handling on the server with the abnormal operation, wherein the exception handling comprises the following steps:

comparing whether the change trend of the CPU temperature of the server with the abnormal operation obtained currently is the same as the change trend of an abnormal change section in the historical change curve of the CPU temperature of the server;

and if so, carrying out exception handling on the server with the abnormal operation according to the abnormal event corresponding to the abnormal change section and the characteristic attribute of the abnormal event.

A7. The method of A5 or A6, wherein the characteristic attribute of the exception event includes a processing priority of the exception event.

A8. The method of any one of a1-a7, further comprising:

determining the distribution of the servers with abnormal operation in the machine room;

and if two or more adjacent servers run abnormally, the two or more adjacent servers are subjected to abnormal processing preferentially.

A9. The method of any of a1-A8, wherein exception handling of the server experiencing the operational exception includes at least one of:

switching the server with abnormal operation to a standby server;

alarming;

adjusting the temperature of an air conditioner in the machine room;

and closing the abnormally operated server for cooling.

A10. The method of any of a1-a9, wherein exception handling of the room includes at least one of:

switching the machine room to a standby machine room;

alarming;

automatically carrying out physical fire extinguishing;

and (5) removing the fault of the air conditioning equipment.

According to another aspect of the embodiments of the present invention, there is also provided a device for processing machine room abnormality based on temperature monitoring, including:

the temperature acquisition module is suitable for acquiring the CPU index related to the temperature of each server in the machine room and the ambient temperature in the machine room;

the computer room judgment module is suitable for judging whether the acquired computer room environment temperature is within a first preset temperature range, and if so, determining that the computer room operates normally;

the server judgment module is suitable for judging whether a server with abnormal operation exists according to the acquired CPU indexes of the servers, which are related to the temperature, if the acquired environment temperature of the machine room is not within the first preset temperature range;

the server processing module is suitable for performing exception handling on the server with abnormal operation if the server with abnormal operation exists; and

and the machine room processing module is suitable for determining that the machine room runs abnormally and processing the abnormality of the machine room if the server which runs abnormally does not exist.

B12. The apparatus of B11, wherein the temperature-related CPU indicator for each server includes a CPU temperature for each server;

the server determination module is further adapted to:

judging whether the acquired CPU temperature of each server is within a second preset temperature range or not;

if the acquired CPU temperature of a certain server is not in the second preset temperature range, determining that the server with the CPU temperature not in the second preset temperature range is abnormal in operation;

and if the acquired CPU temperature of each server is within the second preset temperature range, determining that no server with abnormal operation exists.

B13. The apparatus of B12, wherein the temperature-related CPU indicator for each server further comprises a CPU idle time percentage for each server;

the server determination module is further adapted to:

if the acquired CPU temperature of a certain server is not in the second preset temperature range, judging whether the CPU idle time percentage of the server of which the CPU temperature is not in the second preset temperature range is higher than a preset threshold value or not;

and if so, determining that the server of which the CPU temperature is not in the second preset temperature range is abnormal in operation.

B14. The device according to B12 or B13, wherein the first preset temperature range is a temperature range set by people or a temperature range calculated according to historical data of the environmental temperature of the machine room;

the second preset temperature range is a temperature range set manually or a temperature range calculated according to historical data of the CPU temperature of the server.

B15. The apparatus of any one of B12-B14, further comprising:

the first data storage module is suitable for storing the acquired environmental temperature in the machine room as historical data of the environmental temperature of the machine room;

the first change curve drawing module is suitable for drawing a machine room environment temperature historical change curve according to the stored historical data of the machine room environment temperature, and recording abnormal events corresponding to abnormal change sections in the machine room environment temperature historical change curve and the characteristic attributes of the abnormal events;

at this time, the process of the present invention,

the machine room processing module is further adapted to:

comparing whether the change trend of the currently acquired machine room ambient temperature is the same as the change trend of the abnormal change section in the historical change curve of the machine room ambient temperature;

and if so, performing exception handling on the machine room according to the exception event corresponding to the exception change section and the characteristic attribute of the exception event.

B16. The apparatus of any one of B12-B15, further comprising:

the second data storage module is suitable for storing the acquired CPU temperature of each server as historical data of the CPU temperature of each server;

the second change curve drawing module is suitable for drawing the CPU temperature historical change curve of each server according to the stored historical data of the CPU temperature of each server, and recording abnormal events corresponding to abnormal change sections in the CPU temperature historical change curve and the characteristic attributes of the abnormal events;

at this time, the process of the present invention,

the server processing module is further adapted to:

comparing whether the change trend of the CPU temperature of the server with the abnormal operation obtained currently is the same as the change trend of an abnormal change section in the historical change curve of the CPU temperature of the server;

and if so, carrying out exception handling on the server with the abnormal operation according to the abnormal event corresponding to the abnormal change section and the characteristic attribute of the abnormal event.

B17. The apparatus of B15 or B16, wherein the characteristic attribute of the exception event comprises a processing priority of the exception event.

B18. The apparatus of any one of B11-B17, further comprising:

the abnormal distribution determining module is suitable for determining the distribution of the servers with abnormal operation in the machine room;

and if two or more adjacent servers run abnormally, triggering the server processing module to preferentially process the abnormal operation of the two or more adjacent servers.

B19. The apparatus of any one of B11-B18, wherein the exception handling of the server running the exception comprises at least one of:

switching the server with abnormal operation to a standby server;

alarming;

adjusting the temperature of an air conditioner in the machine room;

and closing the abnormally operated server for cooling.

B20. The apparatus of any of claims B11-B19, wherein exception handling of the room comprises at least one of:

switching the machine room to a standby machine room;

alarming;

automatically carrying out physical fire extinguishing;

and (5) removing the fault of the air conditioning equipment.

According to yet another aspect of the embodiments of the present invention, there is also provided c21 a computer storage medium storing computer program code which, when run on a computing device, causes the computing device to execute the method for processing machine room abnormality based on temperature monitoring according to any one of a1-a 10.

There is also provided, in accordance with yet another aspect of an embodiment of the present invention, apparatus for computing, including:

a processor; and

a memory storing computer program code;

the computer program code, when executed by the processor, causes the computing device to perform a method of handling a temperature monitoring based room exception according to any one of a1-a 10.

24页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:用于校准温度测量装置的方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!