Method for carrying out geographical positioning on unknown sample through microorganism metagenome

文档序号:1467594 发布日期:2020-02-21 浏览:12次 中文

阅读说明:本技术 一种通过微生物宏基因组对未知样本进行地理定位的方法 (Method for carrying out geographical positioning on unknown sample through microorganism metagenome ) 是由 许灿强 黄丽红 杨文娴 俞容山 于 2019-10-09 设计创作,主要内容包括:本发明涉及一种通过微生物宏基因组对未知样本进行地理定位的方法,其基于微生物宏基因组对未知样本的来源城市进行预测定位,且在对训练样本和未知样本进行数据处理的过程中,对样本中各菌株的丰度进行分级,通过多个门限值将双精度的菌株丰度转换成离散的多元值。对相较于现有的定位方法,本发明预测准确性高。(The invention relates to a method for carrying out geographical positioning on an unknown sample through a microorganism macro genome, which carries out prediction positioning on a source city of the unknown sample based on the microorganism macro genome, grades the abundance of each strain in the sample in the process of carrying out data processing on a training sample and the unknown sample, and converts the abundance of double-precision strains into discrete multivariate values through a plurality of threshold values. Compared with the existing positioning method, the method has high prediction accuracy.)

1. A method for geolocating an unknown sample through a microbial metagenome, comprising: the method comprises the following steps:

step 1, training a prediction model

Taking a microorganism sample with a known source as a training sample, inputting the microorganism sample into a prediction model for training, and performing data preprocessing and feature selection on the training sample before training;

step 1.1, data preprocessing

Preprocessing the macrogene sequencing data of the training sample, including quality control, abundance quantification and abundance grading;

classifying the abundance of each strain of the training sample, wherein the classifying method converts the abundance of the double-precision strain into discrete multivariate values through a plurality of threshold values;

step 1.2, feature selection

Selecting strains with distinguishing and identifying capability from a set of all strains in a sample as characteristic strains;

step 1.3, predictive model training

In the step, all characteristic strain abundance value grading multivariate values after characteristic selection in training samples and source cities of the training samples are used as input, and a machine learning method is used for training to obtain a prediction model;

step 2, geographic positioning of unknown samples

Step 2.1, carry on the data preconditioning to the unknown sample

Preprocessing macrogene sequencing data of an unknown sample, including quality control, abundance quantification and abundance grading;

step 2.2, feature selection

Selecting a part of strains with distinguishing and identifying capability from a set of all strains in an unknown sample as characteristic strains;

step 2.3, probability prediction of unknown samples on training set cities

Inputting the abundance grading multivariate values of all characteristic strains of an unknown sample into a prediction model to obtain the probability y of the sample from n citiesi,(i=1,2,...,n);

Step 2.4, geolocation

If the unknown sample is from the n cities, then the city with the highest prediction probability is used as the source city of the unknown sample.

2. The method of claim 1, wherein the method comprises the steps of: the step 2.4 further comprises:

if the unknown sample is not from n cities of the training set, assume that the coordinates of the cities in the n training sets under the specified coordinate system are (x)i,yi) (i ═ 1,2,. multidot., n), then,the probabilities of the unknown samples in these cities are ziAnd (i-1, 2.. once, n), performing probability calculation on all cities on the specified coordinate system by adopting an interpolation method, wherein the city with the highest probability is the source city of the unknown sample.

3. The method of claim 2, wherein the method comprises the steps of: the designated coordinate system is a geographic coordinate system, and the geographic coordinates of the city in the geographic coordinate system are longitude and latitude coordinates of the city.

4. The method of claim 2, wherein the method comprises the steps of: the specified coordinate system is a biological coordinate system, the biological coordinates of the city under the biological coordinate system are obtained by affine transformation of the geographic coordinates, and the method comprises the following steps:

taking the abundance grading multivariate values of all characteristic strains of the training samples as input, and performing dimensionality reduction by a manifold learning method TSNE (time series analysis), thereby obtaining a two-dimensional coordinate of each sample in the training set; for a city in a training set, calculating coordinates of a central point through two-dimensional coordinates of all samples from the city in the training set, and taking the coordinates as biological coordinates of the city; the longitude and latitude coordinates of the city are used as the geographic coordinates of the city, and the geographic coordinates of the city in the training set are converted into corresponding biological coordinates through affine transformation; converting the geographic coordinates of cities outside the training set into biological coordinates through affine transformation;

on a biological coordinate system, when the biological coordinate of the point with the maximum probability is obtained through an interpolation method, the biological coordinate is subjected to affine transformation to obtain a geographic coordinate of the biological coordinate, and a city corresponding to the geographic coordinate is a source city of the unknown sample.

5. The method of claim 1, wherein the method comprises the steps of: in the step 1.2 and the step 2.2, an integrated learning method combining recursive feature elimination and random forest algorithms is adopted for feature selection.

6. The method of claim 1, wherein the method comprises the steps of: in the data preprocessing process, the abundance of each strain is classified as follows: converting the double-precision abundance value into a ternary value of-1, 0 and 1; for each strain contained in a sample, the abundance value was converted to-1 at less than 25%, 0 at between 25% and 75%, and 1 at more than 75%.

Technical Field

The invention relates to the technical field of microorganisms, in particular to a method for carrying out geographical positioning on an unknown sample through a microorganism metagenome.

Background

The microorganisms are the most abundant, most diverse and most widely distributed biological groups on the earth. The metagenome technology based on high-throughput sequencing does not need to culture microorganisms, and can directly analyze and research the microorganism samples taken from the environment. The whole process is that DNA in a sample is extracted and sequenced, and then the sequencing result is analyzed through an algorithm and computer software. The development of the current metagenomics can quickly and accurately obtain the genome sequence of environmental microorganisms from a plurality of different environmental samples. The flora detection and the quantification of the flora abundance can be carried out through the metagenome sequencing data, and the species composition and the functional composition of the sample are analyzed. The metagenomic technology brings a new method and thought for identification and identification of disease sources, traceability analysis and the like, and has huge potential and development space in the aspects of food safety, infectious disease prevention and control and the like.

The geographic location of an unknown sample refers to the location of the geographic origin of an unknown microbial sample through analysis of sequencing data of the sample. Most of the existing methods are based on 16s RNA sequencing to predict microorganism source cities, and the accuracy of prediction in small sample amount is not ideal.

Disclosure of Invention

In view of the above problems, the present invention aims to provide a method for geolocating an unknown sample through a microorganism metagenome, which has high accuracy.

In order to achieve the purpose, the invention adopts the technical scheme that:

a method of geolocating an unknown sample through a microbial metagenome comprising the steps of:

step 1, training a prediction model

Taking a microorganism sample with a known source as a training sample, inputting the microorganism sample into a prediction model for training, and performing data preprocessing and feature selection on the training sample before training;

step 1.1, data preprocessing

Preprocessing the macrogene sequencing data of the training sample, including quality control, abundance quantification and abundance grading;

classifying the abundance of each strain of the training sample, wherein the classifying method converts the abundance of the double-precision strain into discrete multivariate values through a plurality of threshold values;

step 1.2, feature selection

Selecting strains with distinguishing and identifying capability from a set of all strains in a sample as characteristic strains;

step 1.3, predictive model training

In the step, all characteristic strain abundance value grading multivariate values after characteristic selection in training samples and source cities of the training samples are used as input, and a machine learning method is used for training to obtain a prediction model;

step 2, geographic positioning of unknown samples

Step 2.1, carry on the data preconditioning to the unknown sample

Preprocessing macrogene sequencing data of an unknown sample, including quality control, abundance quantification and abundance grading;

step 2.2, feature selection

Selecting a part of strains with distinguishing and identifying capability from a set of all strains in an unknown sample as characteristic strains;

step 2.3, probability prediction of unknown samples on training set cities

Inputting the abundance grading multivariate values of all characteristic strains of an unknown sample into a prediction model to obtain the probability y of the sample from n citiesi,(i=1,2,...,n);

Step 2.4, geolocation

If the unknown sample is from the n cities, then the city with the highest prediction probability is used as the source city of the unknown sample.

The step 2.4 further comprises:

if the unknown sample is not from n cities of the training set, assume that the coordinates of the cities in the n training sets under the specified coordinate system are (x)i,yi),(i=1,2, n), then the probability of the unknown sample in these cities is z respectivelyiAnd (i-1, 2.. once, n), performing probability calculation on all cities on the specified coordinate system by adopting an interpolation method, wherein the city with the highest probability is the source city of the unknown sample.

The designated coordinate system is a geographic coordinate system, and the geographic coordinates of the city in the geographic coordinate system are longitude and latitude coordinates of the city.

The specified coordinate system is a biological coordinate system, the biological coordinates of the city under the biological coordinate system are obtained by affine transformation of the geographic coordinates, and the method comprises the following steps:

taking the abundance grading multivariate values of all characteristic strains of the training samples as input, and performing dimensionality reduction by a manifold learning method TSNE (time series analysis), thereby obtaining a two-dimensional coordinate of each sample in the training set; for a city in a training set, calculating coordinates of a central point through two-dimensional coordinates of all samples from the city in the training set, and taking the coordinates as biological coordinates of the city; the longitude and latitude coordinates of the city are used as the geographic coordinates of the city, and the geographic coordinates of the city in the training set are converted into corresponding biological coordinates through affine transformation; converting the geographic coordinates of cities outside the training set into biological coordinates through affine transformation;

on a biological coordinate system, when the biological coordinate of the point with the maximum probability is obtained through an interpolation method, the biological coordinate is subjected to affine transformation to obtain a geographic coordinate of the biological coordinate, and a city corresponding to the geographic coordinate is a source city of the unknown sample.

In the step 1.2 and the step 2.2, an integrated learning method combining recursive feature elimination and random forest algorithms is adopted for feature selection.

In the data preprocessing process, the abundance of each strain is classified as follows: converting the double-precision abundance value into a ternary value of-1, 0 and 1; for each strain contained in a sample, the abundance value was converted to-1 at less than 25%, 0 at between 25% and 75%, and 1 at more than 75%.

After the scheme is adopted, the source city of the unknown sample is predicted and positioned based on the microbial macro genome, the abundances of all strains in the sample are graded in the process of data processing of the training sample and the unknown sample, and the abundances of the strains with double precision are converted into discrete multivariate values through a plurality of threshold values. The fractionation method is a quantitative method that converts continuous values into discrete values, extracts significant differences between abundance values of different strains, and ignores minor differences. Denoising is carried out by the hierarchical method, so that the stability and the robustness of the algorithm are improved. Compared with the existing positioning method, the method has high prediction accuracy.

In addition, a designated coordinate system is set, the cities in the training set and the cities outside the training set are represented by coordinates under the designated coordinate system, then probability calculation is carried out on all the cities on the designated coordinate system by an interpolation method, the city with the highest probability is the source city of the unknown sample, and the city may not exist in the source city set of the training sample. That is to say, the method not only can predict the unknown samples belonging to the city from which the training samples are obtained, but also can predict the unknown samples belonging to other cities outside the city from which the training samples are obtained, thereby further improving the accuracy of the unknown geographical prediction of the unknown samples.

Drawings

FIG. 1 is a data processing flow diagram of the present invention;

FIG. 2 is a flow chart of predictive model training in accordance with the present invention;

FIG. 3 is a flow chart of unknown sample location in accordance with the present invention;

FIG. 4 is a schematic representation of the affine transformation of the geographic coordinates and biological coordinates of the present invention.

Detailed Description

As shown in fig. 1 to 3, the present invention discloses a method for geolocating an unknown sample by means of a microbial metagenome, which specifically comprises the following steps:

step 1, training a prediction model

Microbial strains with known sources are used as training samples and input into a prediction model for training. And the training samples need to be subjected to data preprocessing and feature selection before training.

Step 1.1, data preprocessing

The macrogene sequencing data of the training samples are stored in a FASTAQ file in the form of short read-length sequences, the content of each short read-length sequence is represented as 4 lines of text: 1) header information of the short read length sequence, 2) the short read length sequence itself, or called base sequence, 3) reservation for additional information, 4) the sequence of quality value corresponding to the base sequence.

And (3) preprocessing the macrogene sequencing data of the training sample, including quality control, abundance quantification and abundance grading.

First, the quality control is performed on the short read length sequence in the macro gene sequencing data.

In the sequencing and library building process, various physicochemical reasons or pollution can occur, defects of sequencing technology and a sequencer can cause the quality of bases in a sequencing result to be too low, or polluted sequences containing other sources can be caused. For the reliability of the subsequent analysis, the sequence is partially or completely removed by quality control, and sequencing data which do not meet the quality standard are filtered out.

And then, gathering the macro-gene sequencing data subjected to quality control processing and a reference genome sequence, detecting microorganisms existing in the training sample, and quantifying the abundance of each strain.

Finally, the abundance of each strain of the training sample was graded.

After the abundance of the microbial strains in each sample is quantified, the proportion of each strain in the sample is obtained, and if the values are directly used for subsequent machine learning, overfitting of the model can be caused, especially under the condition of insufficient sample quantity. Thus, it is necessary to rank the abundances of each strain, converting the double precision abundances of the strain into discrete multivariate values by multiple threshold values. The double precision abundance value is converted to a ternary value of-1, 0,1 in this embodiment. For each strain contained in a sample, the abundance value was converted to-1 at less than 25%, 0 at between 25% and 75%, and 1 at more than 75%. To this end, each sample can be represented by a vector having a value in the range { -1,0,1}, each element of the vector representing a characteristic of a particular strain in the sample, and the set of strains being the union of all strains in the sample.

Step 1.2, feature selection

Because a large number of strains are contained in the sample of the training set, if the abundances of all strains are directly used as input features of a prediction model, the dimension of the feature vector is very large, and subsequent analysis is not used. At the same time, many strains do not aid in the geographical localization of the sample. Therefore, we need to select some strains with the ability of distinguishing and identifying as characteristics by a characteristic selection method.

There are many machine learning algorithms that can perform feature selection, in this example, an ensemble learning (ensemble) method is used, and recursive feature elimination and random forest [4] algorithms are combined, and the union of features respectively selected by the two algorithms is used as the input of the prediction model.

In the algorithm of recursive feature elimination, for an initial feature set, each feature is evaluated through the feature weight of a logistic regression model, the features with the lowest feature weight are removed from the feature set, and then a new feature set is used as input and re-evaluated until the specified number of features with the highest weight are selected. In our application example, 50 signature strains were selected by recursive signature elimination.

In the algorithm for feature selection in random forests, each node in each decision tree in a random forest is a condition about a certain feature, and data sets can be classified according to different labels through the decision trees. For each node of each decision tree, its degree of uncertainty in kini (Gini impurity) can be calculated. The degree of the kini impurity of a node is the probability that a sample randomly selected from the node is mistaken when the samples are classified according to the distribution of the samples in the node. In the process of training the decision tree forest, the average reduction purity of each feature can be calculated, and the average reduction impurity degree is used as the weight of feature selection. In our application example, 243 distinct characteristic strains were selected by random forests.

And taking a union set of the two characteristic sets obtained by the recursive elimination method and the random forest, and taking the obtained set as the input of the next prediction model.

Step 1.3, predictive model training

In the step, all characteristic strain abundance value grading ternary values after characteristic selection in training samples and source cities of the training samples are used as input, and a machine learning method is used for training, so that a prediction model is obtained.

In the example, multiple classifiers are constructed using logistic regression in combination with ovr (one vs rest), predicting the probability that each sample comes from a different city on the training set. Assuming that the training samples come from n cities, in the training process, the samples of a certain city are taken as one class, and the samples of all the rest cities are taken as another class, so that n binary classification problems are formed. The n bi-categorical problems are then modeled using a logistic regression algorithm and the outputs of the n bi-categorizers are combined by majority voting to obtain the probability that each input sample is from a different city.

Step 2, prediction of unknown samples on training set cities

Step 2.1, carry on the data preconditioning to the unknown sample

And preprocessing the macrogene sequencing data of the unknown sample, including quality control, abundance quantification and abundance classification, and obtaining an abundance classification ternary value of the unknown sample on the selected characteristic strain. The specific processing procedure is the same as step 1.1, and is not described herein again.

Step 2.2, feature selection

From the collection of all strains in the unknown sample, a part of strains having discriminative power is selected as a characteristic strain. The specific processing procedure is the same as step 1.2, and is not described herein again.

Step 2.3, probability prediction of unknown samples on training set cities

Inputting the abundance grading ternary values of all characteristic strains of an unknown sample into a prediction model to obtain the probability z of the sample from n citiesiAnd (i ═ 1, 2.., n). The prediction result of the unknown sample on city i is as follows:

Figure BDA0002226963370000101

wherein z isiThe probability of the unknown sample on the city i is predicted by the model, x represents the input feature of the model, and w and theta are parameters obtained by the model through training.

Step 2.4, geolocation

If it has been determined by other technical means that the unknown sample is from the n cities, then we can simply consider the city with the highest prediction probability as the source city of the unknown sample.

Since the predictor can only output the probabilities of n cities of unknown samples on top of the training set, cities without training samples cannot be predicted. Therefore, in order to predict the probability of the city outside the training set of the unknown sample, a numerical interpolation method is adopted, and interpolation is carried out on the map by utilizing the probability of the city inside the training set of the unknown sample, so as to obtain the probability of the city outside the training set of the unknown sample.

If it cannot be determined whether the unknown sample is n cities from the training set, then all cities on the map are likely to be target cities. For target cities outside the training set, the probability of an unknown sample being on it is calculated as follows. Suppose that the coordinates of cities in the n training sets on the designated coordinate system are (x)i,yi) (i 1, 2.. times.n), then the probability of the unknown sample in these cities is z, respectivelyi,(i=1,2,...,n)。

Let the coordinates of the target city on the same specified coordinate system be (x, y).

The probability that the unknown sample comes from the target city is obtained by adopting a Kriging interpolation method (reference data: Jean-Paul Chiles and Nicolas Desassis, 'Fifty Yeast of Kriging', Handbook of physical Geosciences, pp 589-:

herein, the

Figure BDA0002226963370000112

Is the probability estimate, λ, of the unknown sample from the target city (x, y)iIs the weight coefficient of city i, ziIs the probability that the sample is from city i. In the Kriging interpolation method, the weight coefficient is an estimated value that can satisfy the point (x, y)

Figure BDA0002226963370000113

The optimum coefficient having the smallest difference with the true value z, i.e.

Figure BDA0002226963370000114

The constraint conditions are as follows:

Figure BDA0002226963370000115

while satisfying the condition of unbiased estimation, i.e.

Figure BDA0002226963370000116

The specific method is briefly described as follows: the optimization target of the Kriging method is as follows:

Figure BDA0002226963370000121

namely:

Figure BDA0002226963370000122

to simplify the formula we define

Figure BDA0002226963370000123

Thereby simplifying the optimization objective to:

Figure BDA0002226963370000124

defining a half-variance function rij=σ2-CijWhere σ is2The variance is indicated. Under the Kriging assumption, z is spatially uniform, i.e., z has the same expected e and variance σ at any point (x, y) in space2. We can transform the optimal solution of the optimization objective into the following form by a half-variance function:

where phi is the lagrange multiplier. Converting the above equation system into matrix form:

for rijKriging interpolation assumes that two spatially close points are close in nature, i.e., rijAnd (i, j) a distance d between the two pointsijThere is a functional relationship, we pass the distance d of two points i, jijSum half-variance rijFitting an optimal fitting curve to depict the relation between d and r to obtain a function:

r=r(d)

by the fitting function, for any two points, the distance can be calculated to obtain the half-variance r of the two pointsijAnd then, the matrix is inverted to obtain the optimal solution of the Kriging interpolation coefficient. From the calculated Kriging interpolation coefficient, the probability of the sample in city outside the training set can be calculated.

The specified coordinate system is a geographic coordinate system. Although the characteristics of urban flora with similar spatial distances are more similar, the spatial distances do not completely reflect the flora similarity between cities. For example, city Ce on the east coast line and city Cw on the west coast line span the entire continent in geographic distance. Ce and Cw are also on the coastline, however, and may have similar geographic environments, with the similarity between the two cities being higher than the inland cities adjacent to them. Therefore, we convert the geographic coordinates into biological coordinates, replace the geographic distance with the biological distance, and use the biological coordinate system as the specified coordinate system of the above method for predicting probability values by Kriging interpolation.

And taking the abundance grading ternary values of all the characteristic strains of the training sample as input, and reducing the dimension by a manifold learning (TSNE) method, thereby obtaining the two-dimensional coordinates of each sample in the training set. For a city in the training set, the coordinates of the center point are calculated from the two-dimensional coordinates of all samples from the city in the training set, and are taken as the biological coordinates of the city. And (3) converting the geographic coordinates of the cities in the training set into corresponding biological coordinates by affine transformation (affinetranform) by taking the longitude and latitude coordinates of the cities as the geographic coordinates of the cities. And converting the geographic coordinates of cities outside the training set into biological coordinates through affine transformation, as shown in fig. 4, so that the biological coordinates of all cities can be obtained.

And finally, on the basis of the probability value of the sample on the source city of the training sample on the biological coordinate system, carrying out probability estimation on all cities by the Kriging interpolation method, thereby obtaining the city of the point with the maximum probability of the sample as the predicted city of the unknown sample.

The probability value of the source city of the training sample is obtained by prediction in step 2, and the probability value of the unknown city to be examined is obtained by Kriging interpolation. Thereby obtaining the biological coordinate of the point with the maximum probability of the sample and obtaining the city of the unknown sample. In this example, we are from an online database: (https://simplemaps.com/data/ world-cities) And downloading longitude and latitude geographic position information of all cities, and listing all cities on a map as unknown cities to be checked. In practical application, a list of possible source cities can be considered as unknown cities to be assessed.

The method carries out prediction positioning on the source city of the unknown sample based on the microbial macro genome, grades the abundance of each strain in the sample in the process of carrying out data processing on the training sample and the unknown sample, and converts the abundance of the double-precision strain into discrete multivariate values through a plurality of threshold values. The fractionation method is a quantitative method that converts continuous values into discrete values, extracts significant differences between abundance values of different strains, and ignores minor differences. Denoising is carried out by the hierarchical method, so that the stability and the robustness of the algorithm are improved. Compared with the existing positioning method, the method has high prediction accuracy.

In addition, a designated coordinate system is set, the cities in the training set and the cities outside the training set are represented by coordinates under the designated coordinate system, then probability calculation is carried out on all the cities on the designated coordinate system by an interpolation method, the city with the highest probability is the source city of the unknown sample, and the city may not exist in the source city set of the training sample. That is to say, the method not only can predict the unknown samples belonging to the city from which the training samples are obtained, but also can predict the unknown samples belonging to other cities outside the city from which the training samples are obtained, thereby further improving the accuracy of the unknown geographical prediction of the unknown samples.

The above description is only exemplary of the present invention and is not intended to limit the technical scope of the present invention, so that any minor modifications, equivalent changes and modifications made to the above exemplary embodiments according to the technical spirit of the present invention are within the technical scope of the present invention.

12页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种精神分裂症基因-基因互作网络及其构建方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!