Oil fingerprint identification method for selecting biomarkers based on difference degree of main components

文档序号：648477 发布日期：2021-05-14 浏览：2次中文

阅读说明：本技术 一种基于主成分差异度挑选生物标志物的油指纹识别方法 (Oil fingerprint identification method for selecting biomarkers based on difference degree of main components ) 是由张鲁筠王春艳黄小东王岩于 2020-11-30 设计创作，主要内容包括：本发明提供一种基于主成分差异度挑选生物标志物的油指纹识别方法,所述方法,包括获得全套生物标志物信息、计算原始主成分矩阵、逐一去生物标志物后计算新的主成分矩阵、计算差异度、选取重要生物标志物。本发明所述油指纹识别方法,挑选的少量生物标志物,对溢油样品进行分类识别的可靠性和准确性完全可以比拟全套生物标志物的结果,甚至要更好。本发明所述油指纹识别方法,将允许开发更快的洗脱程序,简化原本繁琐甚至可能产生矛盾的化学解释,帮助得到更精准的识别结果。同时,利用该方法选择的生物标志物集可以与基于知识和经验的化学分离方法的结果进行比较,为寻找新的有用生物标志物并探索其化学和地质意义提供可能性。(The invention provides an oil fingerprint identification method for selecting biomarkers based on principal component difference, which comprises the steps of obtaining the information of a whole set of biomarkers, calculating an original principal component matrix, calculating a new principal component matrix after removing the biomarkers one by one, calculating the difference and selecting important biomarkers. The reliability and the accuracy of the oil fingerprint identification method for classifying and identifying the oil spilling sample by the selected few biomarkers can completely be compared with the results of a whole set of biomarkers, and even better. The oil fingerprint identification method provided by the invention allows a faster elution procedure to be developed, simplifies original complicated chemical explanations which may even cause contradictions, and helps to obtain a more accurate identification result. Meanwhile, the biomarker set selected by the method can be compared with the results of a chemical separation method based on knowledge and experience, and the method provides possibility for searching new useful biomarkers and exploring the chemical and geological significance of the biomarkers.)

1. An oil fingerprint identification method for selecting biomarkers based on principal component difference degree is characterized in that: the method comprises the steps of obtaining the information of a whole set of biomarkers, calculating an original principal component matrix, calculating a new principal component matrix after removing the biomarkers one by one, calculating the difference degree and selecting important biomarkers.

2. The oil fingerprint identification method for selecting biomarkers based on the degree of difference of main components according to claim 1, wherein: calculating an original principal component matrix, and forming a matrix X by using the detection values of m complete sets of biomarkers of n samples as observation values_m×nPCA analysis is carried out on the original principal component matrix PC, the principal component contribution rate is calculated, the first p principal components with the cumulative contribution rate of more than 95 percent are selected, and the original principal component matrix PC is obtained_p×n。

3. The oil fingerprint identification method for selecting biomarkers based on the degree of difference of main components according to claim 1, wherein: calculating a new principal component matrix, and removing the biomarkers one by one from the first biomarker to obtain a new matrixPCA analysis is carried out again to obtain a new principal component matrix

4. The oil fingerprint identification method for selecting biomarkers based on principal component difference degree according to claim 1Characterized in that: the calculated difference degree is calculatedAnd the original PC_p×nDifference between them^k。

5. The oil fingerprint identification method for selecting biomarkers based on the degree of difference of main components according to claim 1, wherein: the important biomarkers are selected, and all differences are obtained^kAnd comparing, and selecting the first p biomarkers with the maximum difference, namely the important biomarkers selected.

6. The oil fingerprint identification method for selecting biomarkers based on the degree of difference of main components according to any one of claims 3 to 5, wherein: k is the biomarker number from 1 to m.

7. The oil fingerprint identification method for selecting biomarkers based on the degree of difference of main components according to claim 1, wherein: and obtaining the detection distribution result of the whole set of biomarkers of the petroleum oil spill sample through GC-MS analysis.

Technical Field

The invention relates to an oil fingerprint identification method for selecting gas chromatography/mass spectrum biomarkers based on the difference degree of main components, and belongs to the technical field of oil fingerprint identification.

Background

The frequent occurrence of marine oil spill accidents and their serious harm to marine environmental safety and human health will suggest one of the focuses of global environmental issues for marine oil spill research. Due to high oil spill incidence and high risk, the exact source of oil spill is determined, and monitoring of chemical changes in the weathering migration process of crude oil is necessary. Therefore, a set of oil fingerprint identification technology which is fast, economical, simple and easy to popularize is established, and the method has important practical value for the most developed Chinese furniture in the world with increasingly severe environmental pressure in China.

Oil fingerprinting involves a series of analytical and statistical techniques to objectively identify the most likely source of a hydrocarbon leak by matching the hydrocarbons in the oil spill to a set of potential candidate sources. Gas chromatography-mass spectrometry (GC-MS) is recognized as the cornerstone of modern oil spill fingerprints. Gas chromatography-mass spectrometry (GC-MS) is rapidly applied to detection of water, air, soil, ocean and other environments, agricultural supervision, food safety and discovery and production of medical products due to strong and effective separation, separation and identification capabilities of compounds. In recent decades, a great deal of research on oil spill identification by applying GC-MS has been endless, and on the one hand, the effectiveness of GC-MS is shown. On the other hand, GC-MS related oil spill studies and articles are continually being explored and updated, suggesting that GC-MS has not been able to completely solve the oil spill identification problem, which is consistent with the widely accepted view that no single approach is available to solve this problem due to the complex nature of crude oil and the low concentration of biomarkers.

Some of the bottlenecks in gas chromatography-mass spectrometry are responsible for this problem, one of the major bottlenecks is that almost all oil fingerprinting studies use a full set of biomarkers measured in chromatography, typically including terpenes, regular and rearranged stanols, monoaromatics and triarylsteroids, bicyclic sesquiterpenes and adamantanes, etc. The detection and analysis of such large amounts of compounds requires highly skilled personnel and careful examination and is therefore quite time consuming and costly. Meanwhile, fuzzy and even contradictory behaviors may exist among the variables, so that the interpretation of the result is complex and the decision is difficult to make. Researchers have come to appreciate that the chromatographic ratios or abundances of certain biomarkers may be more informative than others, and that certain variables, if stored in an identification dataset, may even lead to incorrect recognition results. Therefore, proper variable selection is critical to minimize uncertainty and produce reliable results.

Currently, most of the research for sorting biomarkers is based on the knowledge and experience of researchers on petroleum analysis, and a chemical separation method is directly adopted to extract and analyze a specific biomarker set. However, this method is generally only effective for one or a few types of oil samples, and is likely to fail for new oil samples. In addition, this method may also produce some subjective bias.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides an oil fingerprint identification method for selecting biomarkers based on the difference degree of main components, wherein a group of simplified biomarker parameters are found in a GC-MS complete set of biomarker parameters by a chemometrics and data analysis method, and are used for representing main information in the whole data set, so that the classification identification capability almost identical to that of the complete set of biomarker parameters is generated under the condition of not losing important information.

In order to solve the technical problems, the invention adopts the following technical scheme:

an oil fingerprint identification method for selecting biomarkers based on principal component difference degree comprises the steps of obtaining complete set of biomarker information, calculating an original principal component matrix, calculating a new principal component matrix after removing the biomarkers one by one, calculating the difference degree and selecting important biomarkers.

The following is a further improvement of the above technical solution:

step 1: and (3) obtaining the detection distribution results of the whole set of biomarkers (m) of the petroleum oil spill sample through GC-MS analysis.

Step 2: forming a matrix X by using the detection values of the whole set of biomarkers of all samples (n samples) as observation values_m×nSubjecting it to PCA, analyzing, and selecting the first p principal components (representing the most important main information) with the accumulated contribution rate of more than 95 percent according to the contribution rate of the principal components to obtain a principal component matrix PC_p×n。

And step 3: removing the biomarkers one by one from the first biomarker to obtain a new matrix(k denotes the number of removed biomarkers, ranging from 1 to m), and the PCA analysis is repeated to obtain a new principal component matrix

And 4, step 4: calculate newAnd the original PC_p×nDifference between them^k。

And 5: all Difference^kAnd (k is from 1 to m), and selecting the first p biomarkers with the largest difference degree, namely the biomarkers with the most important and most information quantity to be selected, and using the biomarkers as the basis for oil fingerprint identification.

The Principal Component Analysis (PCA) method can realize data dimension reduction and classification identification based on principal components by extracting the principal components of the original variables. PCA uses orthogonal transformation to transform a set of observations of possibly relevant variables (in the present invention, the original full set of biomarker parameters) into a set of linearly uncorrelated variable values called principal components. This transformation is defined in such a way that the first principal component has the largest variance and the variance of each subsequent principal component decreases in turn, while being orthogonal to the preceding principal component. The first few principal components with larger variance (larger contribution rate) represent the main information of the original variable. Can be expressed by a formula: PC (personal computer)_p×n＝Loading_p×mX_m×nWherein n represents the number of samples, m represents the number of observed values of each sample, and X is a matrix formed by all the observed values of the samples; p represents the number of principal components, PC is the principal componentA partial matrix, wherein each sample reserves p principal components; the Loading is a weighting coefficient matrix, and the Loading is_i,jThe weighting factor for the jth observed value in the sample to the ith principal component is also equivalent to the weight of the jth observed value in the ith principal component.

As can be seen from the formula, each observation contributes to the principal component result, but the contribution is small or large. If one observation value (one biomarker) is removed from all observation values (the whole set of biomarkers) and the principal component calculation is carried out again, the obtained new principal component result can be marked as the new principal component resultThe contribution of this observation is not necessarily included anymore. At this time, it is possible to calculateAnd the original PC_p×nThe degree of difference between them to determine whether the contribution of the observation to the principal component is important. Obviously, if the difference degree is small, the observation value has little effect on the main component and can be discarded; otherwise, the information of the observed value plays an important role and must be preserved. Thus, the observation values can be removed one by adopting a cross-check method, all the differences are compared, and the first p observation values (the number of the selected observation values is the same as that of the main component) with the largest difference are the most important biomarkers to be selected. The calculation method of the difference degree adopts a classical mean square error form, and the specific formula is as follows:

compared with the prior art, the invention has the following technical effects:

according to the oil fingerprint identification method, the reliability and the accuracy of classifying and identifying the oil spilling sample by the selected few biomarkers are high. Experiments on the examples show that the results of biomarker selection, whether PCA spatial (three-dimensional and two-dimensional) clustering or hierarchical clustering, can completely match the results of a full set of biomarkers, even better. When the selected biomarkers are used as a classification basis, and the oil sample classification of the artificial neural network is carried out through GRNN, the correct recognition rate is higher than that when the original complete set of biomarkers are used.

The oil fingerprinting method according to the invention, which allows a significant reduction in the number of key variables (biomarkers) used to identify the sample, will allow the development of faster elution procedures, since only some compounds have to be carefully analyzed, with a corresponding reduction in the pre-treatment time; the simplified key variables also simplify the original complicated and possibly contradictory chemical explanations, and help to obtain a more accurate identification result.

Meanwhile, the method for selecting the biomarker set based on data analysis is a completely objective analysis method which is separated from subjective experience.

The biomarker set selected by the method can be compared with the results of a chemical separation method based on knowledge and experience, so that the method provides possibility for searching new useful biomarkers and exploring the chemical and geological significance of the biomarkers, and provides new ideas and prospects for petrochemical and geological analysis.

Drawings

FIG. 1 is a graph of GC-MS detection of 61 biomarkers from oil-like LD 1;

FIG. 2 is a graph of GC-MS detection of 61 biomarkers from oil-like LD 2;

FIG. 3 is a graph of GC-MS detection of 61 biomarkers from oil-like LD 3;

FIG. 4 is a graph of GC-MS detection of 61 biomarkers from an oil sample BZ 1;

FIG. 5 is a graph of GC-MS detection of 61 biomarkers from an oil sample BZ 2;

FIG. 6 is a graph of GC-MS detection of 61 biomarkers of oil-like NH;

FIG. 7 is a graph of GC-MS detection of 61 biomarkers from oil-like WC;

FIG. 8 is a graph of GC-MS detection of 61 biomarkers from oil-like NB;

FIG. 9 is a graph of GC-MS detection of 61 biomarkers for oil-like CB;

FIG. 10 is a graph showing the GC-MS detection of 61 biomarkers for SZ of oil sample;

FIG. 11 is a graph of the principle component differential scores of 61 biomarkers;

FIG. 12 is a graph comparing the distribution of selected biomarkers (5) in oil sample LD 1;

FIG. 13 is a graph comparing the distribution of selected biomarkers (5) in an oil sample LD 2;

FIG. 14 is a graph comparing the distribution of selected biomarkers (5) in an oil sample LD 3;

FIG. 15 is a graph comparing the distribution of selected biomarkers (5) in the oil sample BZ 1;

FIG. 16 is a graph comparing the distribution of selected biomarkers (5) in the oil sample BZ 2;

FIG. 17 is a graph comparing the distribution of selected biomarkers (5) in oil-like NH;

FIG. 18 is a graph comparing the distribution of selected biomarkers (5) in oil-like WC;

FIG. 19 is a graph comparing the distribution of selected biomarkers (5) in oil-like NB;

FIG. 20 is a graph comparing the distribution of selected biomarkers (5) in oil-like CB;

FIG. 21 is a graph comparing the distribution of selected biomarkers (5) in the oil sample SZ;

FIG. 22 is a three-dimensional PCA spatial clustering distribution plot of PC1-PC2-PC3 based on a full set of biomarkers;

FIG. 23 is a three-dimensional PCA spatial clustering distribution plot of PC1-PC2-PC3 based on selected biomarkers;

FIG. 24 is a two-dimensional PCA spatial clustering distribution plot of PC1-PC2 based on a full set of biomarkers;

FIG. 25 is a two-dimensional PCA spatial clustering distribution plot of PC1-PC2 based on selected biomarkers;

FIG. 26 is a hierarchical clustering tree diagram based on a full set of biomarkers;

FIG. 27 is a hierarchical clustering tree based on selected biomarkers;

FIG. 28 is a graph of GRNN classification recognition results based on a full set of biomarkers;

fig. 29 is a graph of GRNN classification recognition results based on selected biomarkers.

Detailed Description

Example (b):

1. petroleum samples and treatments

The sample selects ten crude oil samples of four types (type A) belonging to Trava LD-A11# (LD1), LD-A16# (LD2), LD-A12# (LD3), and (type B) belonging to Bohai middle BZ26-2(BZ1), BZ28-1(BZ2), and (type C) belonging to south sea oil (NH) and Wenchang oil (WC) of the Bohai oil field, and (type D) belonging to A12# (NB), Suibei north 306# (CB), and 36-1# (SZ) of the Bohai oil field to carry out GC-MS test.

Sample treatment for GC-MS: a crude oil sample of 800mg was taken and dissolved in 10mL of n-hexane to prepare a crude oil stock solution of 80 mg/mL. 200. mu.L of the eluate was put on a 10mm column packed with 3g of activated silica gel (1.0 cm of anhydrous sodium sulfate placed on the top), and the saturated hydrocarbon fraction F1 was eluted with 12mL of n-hexane, and the eluate F1 was concentrated to about 0.9mL on a nitrogen blower. Add 100. mu.L of internal standard (containing d)₁₈-Decahydronaphthalene,d₁₆-Adamantane,C₃₀17 β (H),21 β (H) -hopane) to yield 1.0mL of concentrate for GC-MS analysis.

2. GC-MS experiment

GC-MS measurements were performed using an Agilent HP 6890 instrument (Agilent Technologies, Palo Alto, Calif., USA) with a pulsed splitless sample injector, HP 5973 mass spectrometer and HP-5MS fused silica capillary column (J & W Scientific, Folsom, Calif., USA). The operating conditions were: the initial temperature is 50 ℃, the temperature is kept isothermal for 2 minutes, the temperature is increased to 300 ℃ at the speed of 6 ℃/min, and the temperature is kept isothermal for 16 minutes. Carrier gas: helium gas; the injection was carried out in a pulsed no-split mode with a sample injection rate of 1.0mL/min and sample inlet and detector temperatures of 290 and 300 ℃ respectively. Ionization voltage: 70eV, and the ion source temperature 230 ℃. The m/z range of MS analysis is 40-400.

3. Selecting the GC-MS biomarkers according to the specific scheme of the invention:

step 1: for 10 petroleum oil spill samples, detection values of a full set of 61 biomarkers are obtained through GC-MS detection and analysis, and the biomarkers are common biomarkers measured in oil fingerprint chromatography and comprise p-menthane, regular and rearranged stane, monoaryl and triaryl steroid, bicyclic sesquiterpene, adamantane and the like. The detection distribution diagram is shown in the attached figures 1-10, and the detailed information of 61 biomarkers is shown in the table 1.

TABLE 1 61 biomarkers for GC-MS detection

Step 2: forming a matrix X by using the detection values of the whole set of biomarkers of all samples as observation values_61×10PCA analysis is carried out on the mixture, and the first 5 principal components are selected according to the requirement that the accumulated contribution rate reaches 98 percent, so that a principal component matrix PC is obtained_5×10。

And step 3: removing the biomarkers one by one from the first biomarker to obtain a new matrix(k denotes the number of removed biomarkers, ranging from 1 to 61), and the PCA analysis is repeated to obtain a new principal component matrix

And 4, step 4: calculate newAnd the original PC_5×10Difference between them^kSee fig. 11.

And 5: all Difference^k(k from 1 to 61), selecting the first 5 biomarkers with the largest degree of difference as the biomarkers with the most important and most information quantity to be selected, and using the biomarkers as the basis for the oil fingerprint identification.

4. Analyzing and verifying the Classification recognition Capacity of selected biomarkers

FIG. 11 shows principal component Difference Difference^kThe distribution of C can be seen by comparison₂₉(k＝6)，C₃₀(k-7), G (k-20), SQT3 (k-54), and SQT4 (k-55) gave large degrees of difference and were therefore selected as simplified biomarker combinations.

FIGS. 12-21 show the profiles of these five selected biomarkers in 10 samples. It can be seen that with only these 5 biomarker parameters, already significant differences between the petroleum samples of different classifications have been shown, and are therefore fully feasible as a basis for classification identification.

In order to further verify the reliability and accuracy of the sorting of the biomarkers to classify and identify the oil spill samples, PCA spatial clustering (front three-dimensional principal component matrix PC) is respectively used_3×10Spatial clustering, first two-dimensional principal component matrix PC_2×10Spatial clustering), hierarchical clustering, and Generalized Regression Neural Network (GRNN) to verify the classification of the original set of biomarkers and the selected biomarkers, as shown in fig. 22-29.

Whether PCA space (three-dimensional and two-dimensional) clustering or hierarchical clustering, the result of selecting the biomarkers can be completely similar to the result of a full set of biomarkers, or even better.

When the selected biomarkers are used as classification bases, the correct recognition rate can reach 100% when the artificial neural network oil sample classification is carried out through GRNN, the correct recognition rate can only reach 90% by adopting the whole set of biomarkers, and the BZ1 sample is wrongly recognized as the LD sample. This indicates that fuzzy and even wrong information exists in the whole set of original biomarkers, and the information is rather more definite than the selected biomarkers. These verification experiments all prove that the biomarkers selected by the method of the invention are used as classification bases, and have good reliability and accuracy. According to the method provided by the invention, a faster elution procedure can be completely developed, the tedious and possibly contradictory chemical explanation is simplified, and a more efficient and more accurate oil fingerprint identification result is obtained.

26页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种箭形固相微萃取的气态污染物的测试方法

Oil fingerprint identification method for selecting biomarkers based on difference degree of main components

相关技术

网友询问留言