Correlation analysis-based virus diffusion and climate factor relationship analysis method

文档序号:1906620 发布日期:2021-11-30 浏览:16次 中文

阅读说明:本技术 一种基于相关性分析的病毒扩散与气候因素关系分析方法 (Correlation analysis-based virus diffusion and climate factor relationship analysis method ) 是由 林绍福 付钰 赵俊杰 于 2021-08-01 设计创作,主要内容包括:本发明公开了一种基于相关性分析的病毒扩散与气候因素关系分析方法,利用多元线性回归方法开展一系列验证,建立多元回归方程;利用Pearson相关系数评定各个气候因素对新增确诊人数影响的相对重要性以及各个自变量与因变量之间的相关系数,寻找其中的线性关系,判定各观测变量之间的相关性。利用修正决定系数对病毒气候因素关系模型进行效能判定,明确各个国家的多元线性回归模型与真实数据的拟合程度。本发明依托此模型预测得到新增确诊人数可以指导各国家做出不同严格等级的防控举措。此外,可以给全球的国家提出气候因素防控建议,通过针对温度、湿度等适宜病毒生存的因素来采取积极措施进行防控。(The invention discloses a correlation analysis-based virus diffusion and climate factor relationship analysis method, which utilizes a multiple linear regression method to carry out a series of verification and establish a multiple regression equation; and evaluating the relative importance of each climate factor on the influence of the newly-added confirmed people and the correlation coefficient between each independent variable and each dependent variable by using the Pearson correlation coefficient, searching the linear relation in the correlation coefficient, and judging the correlation between each observation variable. And (4) carrying out efficiency judgment on the virus climate factor relation model by using the correction decision coefficient, and determining the fitting degree of the multiple linear regression model and the real data of each country. The invention can guide each country to take prevention and control measures of different strict grades by means of newly increased diagnosed number obtained by prediction of the model. In addition, a climate factor prevention and control suggestion can be provided for countries around the world, and active measures can be taken to prevent and control the virus by aiming at factors such as temperature, humidity and the like which are suitable for virus survival.)

1. A correlation analysis-based virus diffusion and climate factor relationship analysis method is characterized by comprising the following steps: the method comprises the following steps:

step 1, data source and experimental object:

the virus related data is derived from public data sets and daily recorded data of meteorological stations of all the parts of the world collected by a meteorological data network;

step 2, data collection and pretreatment:

the collected virus data is the number of confirmed persons per day in each country, and the number of newly-increased newly-diagnosed new crowns in each country is obtained by subtracting the number of confirmed persons per day in each country from the number of confirmed persons per day in the previous day; selecting the monthly average high temperature, monthly average low temperature, sea level pressure, altitude, wind speed, rainfall, dew point temperature and relative humidity of each country as each climate factor data; for missing climate factor data of a certain day, taking an average value of the data of the previous day and the data of the next two days for filling; the climate factor data with missing continuous dates is filled with 0, so that the experimental result is prevented from being influenced; the data are processed according to the following steps of 7: 3, dividing the ratio into a training set and a test set;

step 3, constructing a multiple linear regression model:

the newly added confirmed diagnosis number New is taken as a dependent variable y, and various climatic factors comprise: average high temperature t _ max, average low temperature t _ min, sea level pressure S _ P, wind speed W _ S, altitude EI, rainfallThe quantities RF, dew point temperature DP and relative Humidity huminity are the independent variables x1, x2, x3, x4, x5, x6, x7, x8, respectively; beta is a0、β1、β2、β3、β4、β5、β6、β7、β8Unknown parameters corresponding to the independent variables; ε is called the error term; the formula of the multiple linear regression model is shown as formula 1:

y=β01x12x23x34x45x56x67x78x8+ε (1)

the formula 1 shows that the newly added confirmed diagnosis number is the weighted sum of all the climate factors, and the weight of all the climate factors is estimated by a linear regression method;

step 4, training a multiple linear regression model:

selecting 70% of observation data as a training set, and newly adding confirmed diagnosis people and eight types of climate factor data every day as training data; inputting the training set into a programmed program, and obtaining a multiple linear regression coefficient through operation so as to obtain a trained model;

step 5, model checking:

obtaining the predicted value of newly-increased number of confirmed people per day by inputting climate factor data of the test set into the constructed multiple linear regression model, and adopting correction decision coefficientCarrying out model performance judgment; the influence of the number of variables on the decision coefficient is inhibited by a method of dividing the square sum of residual errors by the degree of freedom thereof and dividing the square sum of total dispersion by the degree of freedom thereof; the calculation formula is shown in formula 2:

correction decision coefficient in equation 2The closer to 1 represents the higher the degree of fitting to the relationship between the variables, the more accurate the model effect; considering that the decision coefficient is greater than 0.5 to indicate that the model has better fitting effect;

step 6, calculating the correlation coefficient of the independent variable and the dependent variable:

the correlation between two groups of different data is described by utilizing a Pearson correlation coefficient R, and when the development trend between the two groups of different data shows weak correlation, R is more than or equal to 0 and less than 0.3; when the data between two different groups show medium correlation, the absolute R is more than or equal to 0.3 and less than 0.6; when the development trend between two different groups of data shows high correlation, 0.6 ≦ R ≦ 1; the calculation formula is shown in formula 3:

and calculating the correlation coefficient between each climate factor and the newly increased confirmed number of people in different areas by a Pearson correlation coefficient formula, and providing data support for the correlation strength analysis between the newly increased confirmed number of people and each climate factor parameter.

2. The method for analyzing relationship between virus diffusion and climate factors according to claim 1, wherein the method comprises the following steps: selecting Python as a method writing language; and in the data processing stage, data set cleaning and data set division are realized by using Pandas, and Sklearn is used for building and training a model.

3. The method for analyzing the relationship between the virus diffusion and the climate factor based on the correlation analysis as claimed in claim 1, wherein: and (4) carrying out efficiency judgment on the virus climate factor relation model by using the correction decision coefficient, and determining the fitting degree of the multiple linear regression model and the real data of each country.

Technical Field

The invention belongs to the field of data processing, relates to a multiple linear regression model technology, and particularly relates to a multiple linear regression model technology for analyzing the relationship between virus diffusion and climate factors.

Background

The outbreak of the virus can influence the life of people all over the world, the virus propagation rule is analyzed, and the implementation of virus prevention and control measures is supported, so that the method has urgent needs and important significance. The relationship between virus spread and climate factors is analyzed using a multiple linear regression model. The method is based on a Novel Coronavir 2019time series data on cases data set published by the engineering Center (CSSE) of the university of John Hopkins system science and the weather data of an air network and a China weather data network for correlation analysis.

The multivariate linear regression model is suitable for the condition that multivariable influences univariate, can accurately measure the correlation degree and regression fitting degree among all variables, and improves the effect of the prediction model. In the research, meteorological factors influence virus diffusion from multiple aspects, and the correlation degree between the meteorological factors and the virus diffusion needs to be analyzed, so that a multiple linear regression model is selected for analysis.

Currently, some researches have been made on virus diffusion and climate factors, for example, Zhu et al, by collecting the number of newly-increased daily cases and the corresponding climate factor data in eight regions seriously affected by viruses in four countries in south america, and a multivariate linear regression model is used to verify that the absolute humidity and the newly-increased daily cases have a highly significant correlation. David et al proposed the use of a Generalized Additive Model (GAM) to explore the linear and nonlinear relationships between annual average temperature compensation and confirmed cases in prefecture cities in brazil, and found that the cumulative number of confirmed cases per day decreased 4.8951% for each 1 ℃ increase in temperature. Kuldeep et al used the Sen's Slope and Man-Kendall test and Generalized Additive Model (GAM) of regression to examine the effect of daily temperature and relative humidity on morbidity within Indian countries. Lowen, Barreca andseveral studies have shown that environmental temperature plays an important role in the survival and spread of viruses.

The effects of environmental temperature and humidity on transmission and infection are supported by a great deal of research, and the selected samples are limited to local regions, so that the research is prompted to explore the influence of environmental factors on viruses in the global range, and the commonalities of the viruses are researched through global range data, which is undoubtedly closer to the real characteristics of the viruses.

Disclosure of Invention

Based on the analysis, the invention mainly adopts a multivariate linear regression analysis method to analyze the relationship between the number of newly added people in each area every day and the climate factors of the area. The whole method mainly comprises two parts: model construction and correlation coefficient analysis. The invention hopes that the virus characteristics can be guided and known through correlation coefficient analysis so as to control the virus transmission in time.

In order to achieve the purpose, the invention adopts the following technical scheme: to better implement the entire method, Python is chosen as the method writing language. And in the data processing stage, data set cleaning and data set division are realized by using Pandas, and the establishment and training of the model are mainly realized by using Sklearn. Firstly, carrying out a series of verification by using a multiple linear regression method, and establishing a multiple regression equation; and evaluating the relative importance of each climate factor on the influence of the newly-added confirmed people and the correlation coefficient between each independent variable and each dependent variable by using the Pearson correlation coefficient, searching the linear relation in the correlation coefficient, and judging the correlation between each observation variable. And (4) carrying out efficiency judgment on the virus climate factor relation model by using the correction decision coefficient, and determining the fitting degree of the multiple linear regression model and the real data of each country.

A correlation analysis-based virus diffusion and climate factor relationship analysis method mainly comprises the following steps:

step 1, data source and experimental object:

the virus related data is from a public data set of the number of people for diagnosing the virus published by the systematic science and engineering center of John Hopkins university and daily recorded data of weather stations around the world collected by a China weather data network. 65 countries with over 10000 global diagnosis confirmed from 3-22 months to 6-22 months are selected as research objects.

Step 2, data collection and pretreatment:

the collected virus data is the number of confirmed persons per day in each country, and the number of newly-increased newly-diagnosed crowns in each country is obtained by subtracting the number of confirmed persons per day and the number of confirmed persons per day in the previous day in each country. And selecting the monthly average high temperature, the monthly average low temperature, the sea level pressure, the altitude, the wind speed, the rainfall, the dew point temperature and the relative humidity of each country as each climate factor data. And for missing climate factor data of a certain day, taking an average value of the data of the previous day and the next day for filling. The climate factor data with missing continuous dates is filled with 0 to prevent the influence on the experimental results. The data are processed according to the following steps of 7: the scale of 3 is divided into a training set and a test set.

Step 3, constructing a multiple linear regression model:

the number of newly added confirmed persons (New) is taken as a dependent variable y, and various climatic factors comprise: the average high temperature (t _ max), the average low temperature per month (t _ min), the sea level pressure (S _ P), the wind speed (W _ S), the altitude (EI), the Rainfall (RF), the dew point temperature (DP), and the relative Humidity (huminity) are independent variables x1, x2, x3, x4, x5, x6, x7, x8, respectively. Beta is a0、β1、β2、β3、β4、β5、β6、β7、β8Unknown parameters corresponding to the independent variables; ε is called the error term. The formula of the multiple linear regression model is shown as formula 1:

y=β01x12x23x34x45x56x67x78x8+ε (1)

the formula 1 shows that the newly added confirmed people number is the weighted sum of all the climate factors, and the weight of all the climate factors is estimated by a linear regression method.

Step 4, training a multiple linear regression model:

because the number of data samples is small, in order to ensure that certain data is reserved for testing on the premise of having enough data for model training, 70% of observation data is selected as a training set in the experiment, namely, newly-increased diagnosed people and eight types of climate factor data are taken as training data every day in the period from 3 months 22 days to 5 months 8 days. Inputting the training set into a programmed program, and obtaining 65 national multivariate linear regression coefficients through operation so as to obtain a trained model.

Step 5, model checking:

obtaining the predicted value of newly-increased number of confirmed people per day by inputting climate factor data of the test set into the constructed multiple linear regression model, and adopting correction decision coefficientAnd (5) judging the performance of the model. The influence of the number of variables on the decision coefficient is suppressed by dividing the sum of squares of the residuals by the degrees of freedom thereof and dividing the sum of squares of the total deviations by the degrees of freedom thereof. The calculation formula is shown in formula 2:

correction decision coefficient in equation 2Closer to 1 represents a higher degree of fit to the relationship between the variables, the more accurate the model is. Consider that a decision coefficient greater than 0.5 indicates that the model has a better fit.

Step 6, calculating the correlation coefficient of the independent variable and the dependent variable:

the correlation between two groups of different data is described by utilizing a Pearson correlation coefficient R, and when the development trend between the two groups of different data shows weak correlation, R is more than or equal to 0 and less than 0.3; when the data between two different groups show medium correlation, the absolute R is more than or equal to 0.3 and less than 0.6; when the trend between the different sets of data shows a high correlation, 0.6 ≦ R ≦ 1. The calculation formula is shown in formula 3:

and calculating the correlation coefficient between each climate factor and the newly increased number of confirmed persons in different regions every day through a Pearson correlation coefficient formula, and providing data support for the analysis of the correlation strength between the newly increased number of confirmed persons in each country and each climate factor parameter.

The invention is mainly characterized in that:

the research aiming at the current virus transmission and climate factors is also limited to analyzing individual climate factors, and the influence of various types of climate factors on the virus transmission is not clear. The invention uses a multiple linear regression model to analyze the relationship between the virus spread of 65 countries and 8 climate factors. And obtaining the climate factors with stronger correlation with the newly added confirmed diagnosis number through the Pearson correlation coefficient. The correction decision coefficient is adopted to verify the performance of the model, the correction decision coefficient of the multivariate linear regression model of two thirds of countries in the sample is larger than 0.5, and the model has a good fitting effect. And inputting the climate factor parameters in the test set into the model, and predicting to obtain newly-added confirmed diagnosis people number which is in accordance with actual data. And the prediction of the number of newly added cases is only related to the parameters of the current day, so that compared with the direct long-sequence prediction, the occurrence of error transmission is effectively avoided.

The invention has higher correlation between the virus and temperature and moderation, and can provide data for global countries in the aspect of virus prevention and control to support decision making. The newly added number of confirmed people can be obtained by means of the model prediction, so that each country can be guided to take prevention and control measures of different strict levels. In addition, a climate factor prevention and control suggestion can be provided for countries around the world, and active measures can be taken to prevent and control the virus by aiming at factors such as temperature, humidity and the like which are suitable for virus survival.

Drawings

FIG. 1 is a general structure diagram of a correlation analysis-based method for studying the relationship between virus spread and climate factors according to the present invention.

FIG. 2 is a graph comparing portions of test data and model prediction data for the present invention.

FIG. 3 is a diagram showing the correlation between some climatic factors and the number of newly added confirmed persons.

Detailed Description

The present invention will be described in further detail below with reference to specific embodiments and with reference to the attached drawings.

The invention provides a correlation analysis-based model of virus diffusion and climate factor relationship, which specifically comprises the following steps:

the hardware equipment used by the invention comprises 1 PC and 1 NVIDIA GTX1650 display card;

step 1, data collection:

the public data set of the confirmed people who published relevant viruses by the university of john hopkins system science and engineering center is downloaded and saved. And daily recording data of all global meteorological stations are collected and downloaded from a China meteorological data network.

Step 2, data preprocessing:

the collected virus data is the accumulated confirmed number of people per day in each country, and the newly increased confirmed number of people in each country is obtained by subtracting the accumulated confirmed number of people per day and the accumulated confirmed number of people in the previous day in each country. And selecting the monthly average high temperature, the monthly average low temperature, the sea level pressure, the altitude, the wind speed, the rainfall, the dew point temperature and the relative humidity of each country as each climate factor data. And for missing climate factor data of a certain day, taking an average value of the data of the previous day and the next day for filling. The climate factor data with missing continuous dates is filled with 0 to prevent the influence on the experimental results.

Step 3, data set division and training model:

the data set was updated with 7: 3 into a training set and a test set. Namely, the data in the period from 3 months 22 days to 5 months 8 days of each country is a training set, and the data in the period from 5 months 9 days to 6 months 22 days is a testing set.

And (3) building a linear regression model by using Python language, inputting the training set into the model, and obtaining an intercept and a linear regression coefficient after training to construct and complete the multiple linear regression model. And verifying the expression capability of the current model for the data through the test set.

Step 4, model checking:

the climate factor data of the test set is input into the constructed multiple linear regression model to obtain the predicted value of the number of newly-added diagnosed people each day, the score method of the multiple linear regression model is called to obtain the correction decision coefficients of 65 models, the model with the correction decision coefficient larger than 0.5 is determined to have good fitting effect, and the model with good fitting effect is selected to calculate the correlation coefficient of the independent variable and the dependent variable.

Step 5, calculating the correlation coefficient of the independent variable and the dependent variable:

calculating the correlation coefficient between each climate factor and the newly increased number of confirmed people per day in different countries through a Pearson correlation coefficient formula, calculating the correlation coefficient between the newly increased number of confirmed people and eight climate factor parameters in the period from 3-22 days to 6-22 days in each country, and obtaining a correlation coefficient matrix between any two observation variables in each country.

The above embodiments are only exemplary embodiments of the present invention, and are not intended to limit the present invention, and the scope of the present invention is defined by the claims. Various modifications and equivalents may be made by those skilled in the art within the spirit and scope of the present invention, and such modifications and equivalents should also be considered as falling within the scope of the present invention.

9页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:基于卷积神经网络的染色体重要特征可视化方法及装置

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!