Analysis method of spatial transcriptome sequencing data

文档序号:872150 发布日期:2021-03-19 浏览:3次 中文

阅读说明:本技术 一种空间转录组测序数据的分析方法 (Analysis method of spatial transcriptome sequencing data ) 是由 周煌凯 高川 陈飞钦 艾鹏 张秋雪 于 2020-12-21 设计创作,主要内容包括:本发明属于生物技术和生物信息技术领域,具体涉及一种空间转录组测序数据的分析方法。本发明提供的测序数据的分析方法,主要包括样本检测建库和测序数据生物信息分析等过程。本发明提供了全面、系统的空间转录组测序建库和数据生物信息分析方法,利用Spot捕获技术,将基因表达信息映射到组织切片中,真实还原基因的空间分布状态,让异质性研究更为清晰,有效降低了新兴组学空间转录组数据处理的复杂度。(The invention belongs to the technical field of biotechnology and biological information, and particularly relates to an analysis method of space transcriptome sequencing data. The method for analyzing the sequencing data mainly comprises the processes of sample detection and library establishment, sequencing data biological information analysis and the like. The invention provides a comprehensive and systematic space transcriptome sequencing database building and data biological information analysis method, gene expression information is mapped into tissue slices by using a Spot capture technology, the space distribution state of genes is really reduced, the heterogeneity research is clearer, and the complexity of the processing of the space transcriptome data of emerging omics is effectively reduced.)

1. A method for analyzing spatial transcriptome sequencing data, which is characterized by comprising the following processes:

s1, sample detection and library building;

and S2, biological information analysis of the sequencing data.

2. The method for analyzing spatial transcriptome data bioinformation according to claim 1, wherein the step S1 comprises the following processes:

(1) sample preparation: quickly freezing tissues, embedding the tissues by using an OCT embedding medium, cutting a section with the length and the width of 10mm and the thickness of 10 mu m from a tissue block by using a frozen section technology under a low-temperature environment, respectively attaching the section to a tissue optimization chip or a library preparation chip, then fixing the section by using precooled methanol, dyeing the section by using a hematoxylin-eosin dyeing method, and recording the tissue morphology and the spot position of a tissue dyeing result by using a bright field microscope;

(2) organization optimization: optimizing the tissue permeabilization time by using a tissue optimization chip, firstly, setting gradient permeabilization time and adding tissue permeabilization liquid to the chip which finishes bright field imaging; then, carrying out fluorescence reverse transcription on the chip after the permeabilization is finished, and carrying out reverse transcription on the mRNA successfully captured into cDNA with a fluorescence label; after the reverse transcription is finished, removing tissues on the chip; finally, imaging the chip by using a fluorescence microscope, and selecting the optimal tissue permeabilization time for library construction according to the brightness, the diffusion degree and the tissue permeabilization time of the fluorescence imaging result;

(3) library preparation: preparing a space transcriptome library by using a library preparation chip, performing tissue permeabilization on the chip which finishes brightfield imaging according to the optimal tissue permeabilization time, releasing tissue mRNA to be combined with a capture sequence on the chip, performing reverse transcription to synthesize a cDNA fragment and marking a space position, performing PCR amplification to synthesize a cDNA double strand, performing cDNA denaturation in an alkaline environment to release the cDNA double strand to a liquid free environment, further amplifying and synthesizing a large amount of cDNA by using the cDNA double strand as a template, cutting the cDNA enzyme digestion into 200-plus-300 bp fragments, and adding a P5 joint, an i5 sample index, a read2, an i7 sample index and a P7 joint to construct a standard sequencing library;

(4) library sequencing: performing high-throughput sequencing on the built library by using a double-end sequencing mode of an Illumina sequencing platform, wherein at a Read 1 end, 16bp of barcode information and 12bp of UMI information are contained for determining the position of a transcript and quantifying the expression quantity; at the Read2 end, a cDNA fragment is included for reference genomic alignment to determine the gene corresponding to the mRNA.

3. The method for analyzing spatial transcriptome data bioinformation according to claim 1, wherein the step S2 comprises the following processes:

(1) sequencing data quality control and gene expression quantity quantification;

(2) clustering and clustering analysis of spots;

(3) analyzing subgroup characteristic genes;

(4) and (4) analyzing spatial characteristic genes.

4. The biological information analysis method according to claim 3, wherein the process (1) in step S2 is specifically operated as follows:

basic quality control of data sequencing: inputting sequencing original data into a space anger, comparing the positions of spots to obtain the spots covered under the tissue slices as effective spots, obtaining barcode information corresponding to the effective spots by combining the serial numbers of the chips, and evaluating the reads quality, sequencing saturation and effective barcode of the sequencing data based on the effective barcode information;

comparing reference genome and quantifying genome expression: comparing reads with a reference genome in a space anger, performing statistical evaluation on parameters such as the number of genes, the number of effective reads and the like of all spots, and obtaining the expression abundance information of different genes in different spots according to the genome comparison result, the UMI count and the barcode sequence information.

5. The biological information analysis method according to claim 3, wherein the process (2) in step S2 is specifically operated as follows:

data normalization: carrying out homogenization treatment on the gene expression quantity by using an SCT ransform algorithm;

analyzing main components: PCA analysis is carried out by utilizing the normalized expression quantity, so that the influence of background noise on clustering and dimension reduction is reduced;

③ spot clustering: clustering and clustering spots by using a graph theory-based clustering method of Seurat;

and tSNE visualization: based on the classification result of the spot subgroup, further using a tSNE nonlinear dimension reduction method to project all spots into a two-dimensional plane for visualization.

6. The biological information analysis method according to claim 3, wherein the specific operation of the process (3) in step S2 is as follows:

analysis of differential expression of subgroups: performing subgroup up-regulation gene analysis on different spot subgroups respectively by using MAST (mask assay) of Seurat, and screening significant up-regulation genes;

subgroup difference gene function analysis: performing KEGG enrichment analysis, DO enrichment analysis, GO enrichment analysis, Reactome enrichment analysis and other analyses on the screened subset up-regulated genes;

thirdly, subgroup characteristic gene screening and analysis: further selecting 20 genes with the highest up-regulation multiple of the expression quantity of each subgroup as subgroup characteristic genes, and displaying the gene distribution characteristics by using a heat map, a tissue map and a bubble map;

fourthly, analyzing the protein interaction network of the characteristic genes of the subgroup: and obtaining protein interaction information through a string database, establishing a protein interaction network among the characteristic genes of the sub-groups through the information, and further knowing the potential regulation and control relation among different spot sub-groups.

7. The biological information analysis method according to claim 3, wherein the specific operation of the process (4) in step S2 is as follows:

spatial difference gene analysis: detecting a high-mutation gene by a Markvariogram algorithm of Seurat, detecting the gene significance by a mark-segregation hypothesis test of trendseek, performing multiple detection and correction on a significance P value by a BH method, and screening a space difference gene by specific conditions;

functional analysis of spatially differentiated genes: performing KEGG enrichment analysis, DO enrichment analysis, GO enrichment analysis, Reactome enrichment analysis and other analyses on the screened spatial difference genes;

screening and analyzing spatial characteristic genes: further selecting 200 genes with the minimum P value as space characteristic genes, and displaying the gene distribution characteristics by using a heat map, a tissue map and a bubble map;

fourthly, the protein interaction network analysis of the spatial characteristic gene: and obtaining protein interaction information through a string database, establishing a protein interaction network among spatial characteristic genes through the information, and further understanding potential regulation and control relations among different tissue regions.

Technical Field

The invention belongs to the technical field of biotechnology and biological information, and particularly relates to an analysis method of space transcriptome sequencing data.

Background

With the technical progress of the biological industry, the high-throughput sequencing technology plays an important role in research in various fields, but with the research depth of various fields, some problems cannot be solved by the traditional high-throughput sequencing technology. The method is characterized in that the ex-vivo tissues or cells are collected by the traditional omics technology for high-throughput sequencing, and the technology naturally causes the irreducibility of the cells in the original positions of the tissues in the process of library construction, so that the microenvironment and the cell relation of the tissues cannot be restored.

For tissue sequencing, (1) the microenvironment of cells is difficult to capture, and unlike the study of large environment, the volume of cells is very small, and the microenvironment of cells is an organic whole constructed by the correlation among cells. Although single cell sequencing technology matures rapidly in recent years, the single cell transcriptome still cannot record cell-cell association information (the preparation of single cell suspension destroys the original position information of cells), and the ordered integrity of microenvironment is lost. Therefore, how to effectively obtain the effective whole is always one of the difficulties of relevant research; (2) the research means of the cell microenvironment is limited, the method of the single cell transcriptome space sequence analysis appears, and the research of the cell microenvironment is really promoted to a great extent by combining the FISH verification. However, the data obtained by such spatial sequence analysis is relatively inefficient.

The cellular microenvironment plays a crucial role in promoting the research of molecular mechanisms, for example, the tumor microenvironment has important influence on the occurrence, the process and the drug resistance of tumors, the gradually freezing disease also considers that the interaction between motor nerves and glia is blocked as an important disease cause, and the crop root system development is also influenced by the microenvironment constructed by endogenous symbiotic bacteria, so that an effective means for quickly excavating the molecular characteristics of the cellular microenvironment is urgently needed.

For the analysis of spatial transcriptome data, the complexity of tissue sequencing, the large amount of data and the existence of spatial heterogeneity make the data analysis difficult. The space transcriptome is used as a new technology, the technology covers multiple fields of tumors, development, immunity, diseases and the like, relates to multiple tissues of tumors, lymph, brains, hearts and the like, can be applied to animals and plants, and has extremely wide application range and value. In view of the requirements of the space transcriptome library construction and data analysis, no technical disclosure of methods for space transcriptome sequencing and data biological information analysis exists at present, so that the difficulties are solved, and the breakthrough of scientific research problems in related fields and the popularization of application are limited.

Disclosure of Invention

Aiming at the defects generally existing in the prior art, the invention creatively provides a method for analyzing spatial transcriptome sequencing data. The method provided by the invention truly restores the spatial distribution state of the genes, makes the heterogeneity research clearer, and effectively reduces the complexity of processing the spatial transcriptome data of the emerging omics.

In order to achieve the purpose, the invention adopts the technical scheme that:

a method for analyzing spatial transcriptome sequencing data, comprising the steps of:

s1, sample detection and library building;

and S2, biological information analysis of the sequencing data.

Preferably, step S1 includes the following process:

(1) sample preparation: quickly freezing tissues, embedding the tissues by using an OCT embedding medium, cutting a section with the length and width of 10mm and the thickness of 10 mu m from a tissue block by using a frozen section technology under a low-temperature environment (generally, the temperature is-10 ℃, the tissues are different, and the upper part and the lower part of the tissue slightly float), respectively attaching the section to a tissue optimization chip or a library preparation chip, then fixing the section by using precooled methanol, dyeing the section by using a hematoxylin-eosin dyeing method, and recording the tissue morphology and the spot position of a tissue dyeing result by using a bright field microscope;

(2) organization optimization: optimizing the tissue permeabilization time by using a tissue optimization chip, firstly, setting gradient permeabilization time and adding tissue permeabilization liquid to the chip which finishes bright field imaging; then, carrying out fluorescence reverse transcription on the chip after the permeabilization is finished, and carrying out reverse transcription on the mRNA successfully captured into cDNA with a fluorescence label; after the reverse transcription is finished, removing tissues on the chip; finally, imaging the chip by using a fluorescence microscope, and selecting the optimal tissue permeabilization time for library construction according to the brightness, the diffusion degree and the tissue permeabilization time of the fluorescence imaging result;

(3) library preparation: preparing a space transcriptome library by using a library preparation chip, performing tissue permeabilization on the chip which finishes brightfield imaging according to the optimal tissue permeabilization time, releasing tissue mRNA to be combined with a capture sequence on the chip, performing reverse transcription to synthesize a cDNA fragment and marking a space position, performing PCR amplification to synthesize a cDNA double strand, performing cDNA denaturation in an alkaline environment to release the cDNA double strand to a liquid free environment, further amplifying and synthesizing a large amount of cDNA by using the cDNA double strand as a template, cutting the cDNA enzyme digestion into 200-plus-300 bp fragments, and adding a P5 joint, an i5 sample index, a read2, an i7 sample index and a P7 joint to construct a standard sequencing library;

(4) library sequencing: performing high-throughput sequencing on the built library by using a double-end sequencing mode of an Illumina sequencing platform, wherein at a Read 1 end, 16bp of barcode information and 12bp of UMI information are contained for determining the position of a transcript and quantifying the expression quantity; at the Read2 end, a cDNA fragment is included for reference genomic alignment to determine the gene corresponding to the mRNA.

Preferably, step S2 includes the following process:

(1) sequencing data quality control and gene expression quantity quantification;

(2) clustering and clustering analysis of spots;

(3) analyzing subgroup characteristic genes;

(4) and (4) analyzing spatial characteristic genes.

Preferably, the specific operation of the process (1) in step S2 is as follows:

basic quality control of data sequencing: inputting sequencing original data into a space anger, comparing the positions of spots to obtain the spots covered under the tissue slices as effective spots, obtaining barcode information corresponding to the effective spots by combining the serial numbers of the chips, and evaluating the reads quality, sequencing saturation and effective barcode of the sequencing data based on the effective barcode information;

comparing reference genome and quantifying genome expression: comparing reads with a reference genome in a space anger, performing statistical evaluation on parameters such as the number of genes, the number of effective reads and the like of all spots, and obtaining the expression abundance information of different genes in different spots according to the genome comparison result, the UMI count and the barcode sequence information.

Preferably, the specific operation of the process (2) in step S2 is as follows:

data normalization: carrying out homogenization treatment on the gene expression quantity by using an SCT ransform algorithm;

analyzing main components: PCA analysis is carried out by utilizing the normalized expression quantity, so that the influence of background noise on clustering and dimension reduction is reduced;

③ spot clustering: clustering and clustering spots by using a graph theory-based clustering method of Seurat;

and tSNE visualization: based on the classification result of the spot subgroup, further using a tSNE nonlinear dimension reduction method to project all spots into a two-dimensional plane for visualization.

Preferably, the specific operation procedure of the process (3) in step S2 is as follows:

analysis of differential expression of subgroups: performing subgroup up-regulation gene analysis on different spot subgroups respectively by using MAST (mask assay) of Seurat, and screening significant up-regulation genes;

subgroup difference gene function analysis: performing KEGG enrichment analysis, DO enrichment analysis, GO enrichment analysis, Reactome enrichment analysis and other analyses on the screened subset up-regulated genes;

thirdly, subgroup characteristic gene screening and analysis: further selecting 20 genes with the highest up-regulation multiple of the expression quantity of each subgroup as subgroup characteristic genes, and displaying the gene distribution characteristics by using a heat map, a tissue map and a bubble map;

fourthly, analyzing the protein interaction network of the characteristic genes of the subgroup: and obtaining protein interaction information through a string database, establishing a protein interaction network among the characteristic genes of the sub-groups through the information, and further knowing the potential regulation and control relation among different spot sub-groups.

Preferably, the specific operation procedure of the process (4) in step S2 is as follows:

spatial difference gene analysis: detecting a high-mutation gene by a Markvariogram algorithm of Seurat, detecting the gene significance by a mark-segregation hypothesis test of trendseek, performing multiple detection and correction on a significance P value by a BH method, and screening a space difference gene by specific conditions;

functional analysis of spatially differentiated genes: performing KEGG enrichment analysis, DO enrichment analysis, GO enrichment analysis, Reactome enrichment analysis and other analyses on the screened spatial difference genes;

screening and analyzing spatial characteristic genes: further selecting 200 genes with the minimum P value as space characteristic genes, and displaying the gene distribution characteristics by using a heat map, a tissue map and a bubble map;

fourthly, the protein interaction network analysis of the spatial characteristic gene: and obtaining protein interaction information through a string database, establishing a protein interaction network among spatial characteristic genes through the information, and further understanding potential regulation and control relations among different tissue regions.

Compared with the prior art, the biological information analysis method provided by the invention has the following advantages:

(1) according to the method provided by the invention, the gene expression information is mapped into the tissue slice by using the Spot capture technology, the spatial distribution state of the gene is really reduced, the heterogeneity research is clearer, and the method is comprehensive and systematic method for establishing the spatial transcriptome sequencing library and analyzing the data biological information;

(2) the method provided by the invention can effectively connect multiple information such as genes, cells, tissue space positions, sample types and the like in series, the analysis result is more comprehensive, and the complexity of processing the emerging omics space transcriptome data is reduced.

(3) The method creatively increases the analysis of the spatial characteristic genes in the aspect of data analysis and increases the excavation depth of the key genes.

Description of the drawings:

FIG. 1 is a flow chart of a spatial transcriptome experiment;

FIG. 2 is a flow chart of experimental data analysis.

Detailed Description

The present invention is further explained with reference to the following specific examples, but it should be noted that the following examples are only illustrative of the present invention and should not be construed as limiting the present invention, and all technical solutions similar or equivalent to the present invention are within the scope of the present invention. Unless otherwise specified, the technical means used in the examples are conventional means well known to those skilled in the art, and the raw materials used are commercially available products.

The Tissue permeabilization solution and the fluorescent label are both selected from a kit, the kit name is Visium Spatial Gene Expression Reagent Kits-Tissue Optimization from 10X genetics, available from Shanghai Kernel Biotech Co., Ltd, cat # PN-1000193; the PCR amplification process during library construction adopts kit amplification, the kit name is Visium Spatial Gene Expression Reagent kit from 10X genomics and can be purchased from Shanghai Kernel Biotechnology Co., Ltd, the product number is PN-1000184, and the amplification program is 98 ℃ for 3min, (98 ℃ for 5s, 63 ℃ for 30s) for 25 cycles.

Example a method for analyzing spatial transcriptome sequencing data

The method for analyzing the biological information of the spatial transcriptome data comprises the following steps of:

s1, sample detection and library building:

(1) sample preparation

After the tissue was snap frozen, the tissue was embedded using OCT embedding medium. In a low temperature environment, a frozen section technique is used to cut a section of 10mm long and wide, 10 μm thick from a tissue block and attach the section to a tissue-optimized chip or a library preparation chip, respectively. The sections were then fixed using pre-chilled methanol and stained by hematoxylin-eosin (HE) staining. Histological staining results tissue morphology and spot location were recorded by bright field microscopy.

Sample preparation is the starting point of the whole technique and the difficulty is freezing the sections. The single spot diameter of the chip of the space transcriptome is 55 μm, and does not reach a real single cell level, and the real cell number captured by a single point is determined by the slice thickness, the slice flatness and the cell diameter.

The cell sizes of different tissues are different. In general tissue slices, a slice thickness of 2 to 3 μm can be selected for an extremely single layer of cells. However, in the spatial transcriptome, 2-3 μm means a layer of disrupted cells, which is not suitable for downstream experiments. Therefore, the thickness of the slice should also be adjusted to the cell size.

The lipid and protein contents of different tissues are different, resulting in differences in hardness and ductility at the same freezing temperature. For a piece of tissue, the temperature is too low, the brittleness of the tissue can be increased, and the obtained section can have larger or smaller cracks; when the temperature is too high, the ductility of the tissue is reduced, and the section is easy to wrinkle. Therefore, the slice ambient temperature may become a tissue-related personalization point.

(2) Tissue optimization

Tissue permeabilization time was optimized using a tissue optimization chip. Firstly, setting gradient permeabilization time (generally 5min, 10min, 15min, 20min, 25min and 30min time gradient) and adding tissue permeabilization liquid to a chip which finishes bright field imaging; then, carrying out fluorescence reverse transcription on the chip after the permeabilization is finished, and carrying out reverse transcription on the mRNA successfully captured into cDNA with a fluorescence label; after the reverse transcription is finished, removing tissues on the chip to avoid the influence of the section on a fluorescence signal; and finally, imaging the chip by using a fluorescence microscope, and selecting the optimal tissue permeabilization time for library construction according to the brightness, the diffusion degree and the tissue permeabilization time of the fluorescence imaging result.

Preferably, the optimal tissue permeabilization time is determined by three conditions: (1) the fluorescence signal is strongest, (2) the fluorescence signal is weakest in dispersion, and (3) a longer tissue permeabilization time is selected under the condition that the former two conditions are consistent.

(3) Library preparation

Spatial transcriptome library preparation was performed using a library preparation chip. And carrying out tissue permeabilization on the chip which finishes the bright field imaging according to the optimal tissue permeabilization time, releasing the combination of tissue mRNA and a capture sequence on the chip, carrying out reverse transcription to synthesize a cDNA fragment, marking a space position, and carrying out PCR amplification to synthesize a cDNA double chain. The cDNA is denatured by alkaline environment to release the cDNA double strand to the liquid free environment, and the cDNA double strand is used as template for further amplification to synthesize a large amount of cDNA. The cDNA cleavage was broken into 200-300bp fragments, and the P5 adaptor, i5 sample index, read2, i7 sample index and P7 adaptor were added to construct a standard sequencing library.

(4) Library sequencing

And performing high-throughput sequencing on the built library by using a double-ended sequencing mode of an Illumina sequencing platform. At the Read 1 end, 16bp of barcode information and 12bp of UMI information are contained for determining the position of a transcript and quantifying the expression quantity; at the Read2 end, a cDNA fragment is included for reference genomic alignment to determine the gene corresponding to the mRNA.

S2, biological information analysis of sequencing data

(1) Sequencing data quality control and gene expression quantity quantification: original data obtained by sequencing, a tissue staining picture and a chip serial number are input into a space anger together for data quality control and sequence comparison so as to obtain high-quality sequencing data, the spatial distribution position of the spot in the tissue and the expression quantity condition of the gene in each spot. The method specifically comprises the following steps:

basic quality control of data sequencing

Inputting sequencing original data into a space anger, comparing the positions of the spots to obtain the spots covered under the tissue slices as effective spots, and obtaining barcode information corresponding to the effective spots by combining the serial number of the chip. Evaluating the reads quality, sequencing saturation and effective barcode of the sequencing data based on the effective barcode information.

Space range relies on image processing algorithms to solve two key problems related to slide images: deciding where the tissue is placed and aligning the pattern of calibration points. Which capture points are covered by tissue and thus which barcode will be used for analysis is determined by tissue detection. Alignment of the calibration points is necessary and the Space range knows where each spot is in the image. Alignment of the calibration points first extracts features that "look" like the calibration points and then attempts to align these candidate calibration points with a known pattern of calibration points. The spot extracted from the image will necessarily contain some missing places, e.g. where the tissue covers, and some false positives, e.g. where a piece or tissue feature on the slide might look like a spot. The output of this process is a spot coordinate value table that associates the Visium spot locations with the tissue images. The tissue detection uses a gray-scale tissue image. A plurality of estimates of tissue slice positions are calculated and compared. These estimates are used to train a statistical classifier to label each pixel within the captured region as tissue or background, respectively, to determine which regions belong to tissue.

② reference genome alignment and genome expression quantification

Comparing reads with a reference genome in the space anger, and performing statistical evaluation on parameters such as the number of genes and the number of effective reads of all spots. And obtaining the expression abundance information of different genes in different spots according to the genome comparison result, the UMI count and the barcode sequence information. The matrix can be used for subsequent analysis.

Space Range calls STAR (threaded transitions Alignment to a reference) Alignment software to align Read2 to the reference genome. According to the results of GTF annotation, reads on alignment were classified as reads of exons, introns, or intergenic regions: if at least 50% of a reads are located in an exon region, then the reads belong to exon reads, and so on. If reads align to a single exon region and one/more non-exon regions, then it is preferentially classified as an exon and its MAPQ is set to 255. Space range further aligned exon reads to annotated transcripts: 1. if the reads are compared with the exons of a certain transcript and the directions of the reads are the same, the reads are considered to be compared with the transcripts and are marked as transcript reads; 2. in transcriptome reads, reads are considered to be the only aligned reads if they align to only one gene. Only unique alignment reads were used for UMI counting (quantification of expression).

Space range will then correct UMI sequencing errors. Reads with the same barcode, UMI and gene annotation would be grouped together. If two sets of reads have the same barcode sequence and gene annotation but the UMIs differ by one base, then one of the UMI sequences may have sequencing errors. Then the UMI with the lower number of reads will be modified to the sequence of the set of UMIs with the higher number of reads. Reads with the same barcode, UMI and gene annotation are again grouped into the same group based on the modified UMI sequence. If two or more groups of reads have the same barcode sequence and UMI sequence but different gene annotations, one group of reads with the highest number of reads is used for UMI counting, and the rest groups are discarded; when the reads numbers of the multiple groups of reads are all high, all the reads groups are discharged out of the UMI counting range. After the correction and filtering are completed, the UMI corresponding to each gene of each barcode is removed, and the number of unique UMI is calculated as the expression amount of the gene in the spot.

(2) spots clustering analysis

Each spot in the space transcriptome comprises 1-10 cells, clustering and grouping are carried out on the spots by utilizing the similarity of spot molecular characteristics, and the obtained clustering result of the spots can form correspondence with the tissue type corresponding to the spots so as to assist in space heterogeneity analysis. Subsequently, the space anger is transferred to the search from the obtained expression quantity information and the spot space position information for subsequent analysis.

Carrying out SCTransform homogenization treatment on the expression quantity data; then, PCA (principal components analysis) analysis is carried out by utilizing the homogenized expression quantity, so that the influence of background noise on clustering and dimension reduction is reduced; and finally, clustering and clustering the spots by adopting a clustering algorithm based on graph theory. Based on the classification result of the spot subgroup, all spots are projected into a two-dimensional plane for visualization by further using a tSNE (t-Distributed stored geometrical Neighbor Embedding) nonlinear dimension reduction method. The specific process is as follows:

data normalization

Before spots clustering, the gene expression level is first normalized. And (3) carrying out homogenization treatment on the gene expression quantity by using an SCT ransform algorithm, wherein the algorithm puts the logarithm value of the gene expression quantity into a regression model conforming to negative binomial distribution. In the regression model, the expression level of gene i should conform to the following formula: log (E (x)i))=β01lg10m, wherein xiIs the expression level of gene i in each spot, and m is the sum of UMIs in each spot. In this model, the dispersion coefficient θ and the expression level mean μ of each gene expression level are calculated accordingly. After the model parameters are corrected by using the negative binomial distribution, the original expression quantity is replaced by the mathematical expected value of the expression quantity. The mathematical expectation formula of the expression quantity is as follows: mu.sij=exp(β0i1ilog10mj) The expression quantity normalization calculation formula is as follows:

in both equations, xijM is the actual expression level of gene i in spot jjIs the sum of all UMI numbers in spot j,. mu.ijFor mathematical expectation of the expression level of gene i in spot j, θijIs the coefficient of variation of gene i in spot j.

The purpose of data homogenization is mainly to reduce the influence of the sequencing depth on the gene expression level, and if the data is not normalized, the gene expression level increases with the sequencing depth, and cells with high sequencing depth contribute to the variation degree of the expression level. The traditional conventional log homogenization has strong correction capability on low-abundance genes and poor correction capability on high-abundance genes, so that the expression quantity of the high-abundance genes of log homogenization data is improved along with the improvement of sequencing depth, and the contribution of cells with low sequencing depth to the variation degree of the high-abundance genes is overestimated. The method adopts the data of the SCTrasform normalization, the expression abundances of other genes except a group of genes with the highest abundance cannot change along with the sequencing depth, and the contribution of cells with different sequencing depths to the variation degree of the gene abundance is basically consistent. SCTransform normalization therefore shows more realistic differential gene test results relative to log normalization.

② Principal Component Analysis (PCA)

In order to overcome the noise interference of single gene expression on spatial transcriptome data, Seurat clusters the spots by PCA, and each principal component represents a gene set with similar expression patterns. Determining how many principal components are used by the downstream analysis relates to the background noise and accuracy of the subsequent analysis. To determine how many principal components to use for downstream analysis, an oversampling test was used to examine the extent to which each principal component had an effect on the spatial transcriptome data. And (4) taking the main component of the low P value gene most enriched in the test result as the main component of the spatial transcriptome dataset which is obviously influenced, and using the main component for downstream analysis.

③ Spot clustering

Seurat clusters the spots using a graph theory based approach. Firstly, calculating Euclidean distance between spots based on the previously determined principal components; then, embedding the spot into a KNN (K-nearest neighbor graph) graph based on the Euclidean distance between the spots; next, dividing the KNN graph into highly interconnected quasi-groups according to the Jaccard distances of two spots in local neighborhood; finally, the spot clusters are clustered using the Louvain algorithm. Therefore, all spot clusters in the spatial transcriptome are divided into a plurality of spot subgroups, so that subsequent analysis is facilitated.

tSNE visualization

Seurat uses a nonlinear dimensionality reduction method t-SNE to map high-dimensional space transcription group data into a two-dimensional space, spots with similar expression modes are gathered together, the spots with different expression modes are separated farther, and the difference between the spots is displayed more visually.

(3) Differential gene analysis

Sub-group differential expression analysis

Separately, subpopulation up-regulated gene Analysis was performed on different spot subpopulations using the MAST test (Model-based Analysis of Single cell transversimics) by senuat, and significant up-regulated genes were screened according to the following thresholds:

a. the natural logarithm of the gene up-regulation factor logFC is more than or equal to 0.25, namely the gene up-regulation factor is more than or equal to e0.25≈1.28;

b.P value is less than or equal to 0.01;

c. in the target subpopulation, more than 25% of the spots express the target gene;

MAST test has the advantages of being more suitable for complex experimental design, fitting the cell proportion of the expressed gene and reducing the influence of sequencing errors.

② subgroup differential gene function analysis

And performing KEGG enrichment analysis, DO enrichment analysis, GO enrichment analysis, Reactome enrichment analysis and other analysis on the screened subset up-regulated genes.

③ subgroup characteristic gene screening and analyzing

Further selecting 20 genes with the highest up-regulation fold of the expression level of each subgroup as subgroup characteristic genes, and displaying the gene distribution characteristics by using a heat map, a tissue map and a bubble map.

Heat map: the gene expression level was normalized by the SCTransform method, and then the normalized gene expression level was used for heat map mapping. Thus, the appearance of extreme values in the heatmap is avoided, so that the expression characteristics of the genes in different subgroups are more obvious.

Organizing a map: processing the gene expression quantity by using an SCTransform homogenization method to reduce the expression difference of the gene among different spots; the normalized gene expression level is then used for drawing a tissue map. Therefore, the influence of the extreme value on data distribution is reduced under the condition of keeping a certain original expression quantity characteristic, and the expression characteristic of the gene in different tissue regions is convenient to observe.

Bubble diagram: averaging the normalized expression quantity of each gene in each cell subset without taking log values, carrying out SCTransform normalization on the average expression quantity of the gene among different subsets based on the average expression quantity, and drawing the color shade of the bubbles in a bubble graph by using the normalized gene expression quantity. In addition, the proportion of cells expressing the gene of interest in the subpopulation was counted and used to plot the size of the bubbles in the bubble map. The bubble chart shows both the average expression level of the gene in a cell subpopulation and the proportion of cells expressing the gene of interest.

Protein interaction network analysis of subgroup characteristic gene

Protein-protein interaction information is obtained through string databases (https:// string-db.org /). Through the information, a protein interaction network among the characteristic genes of the sub-groups is established, and the potential regulation and control relation existing among different spot sub-groups is further understood.

As described in the background, the data information of spatial transcriptome is enormous, and each spot carries spatial position information in addition to gene abundance information. In the basic analysis, the analysis method of clustering the spots does not fully utilize the spatial position information. Therefore, it is necessary to further perform an analysis method to sufficiently mine characteristic gene information by using spatial position information as much as possible.

(4) Spatial signature gene analysis

(ii) analysis of spatially differentiated genes

In order to deeply explore the space position-dependent gene-space difference genes of the expression quantity, firstly, a high-mutation gene is tested by a markvariogram algorithm of Seurat, then the gene significance is tested by a mark-segregation hypothesis test of trendseek, and finally, the significance P value is subjected to multiple test correction by a BH method. Spatially distinct genes were screened by the following conditions:

a.r.metric.5≤0.8

b.P value is less than or equal to 0.01

The markvariogram algorithm adopted by the invention has the advantages that the spot clustering result and the tissue region identification result are not depended on, the variant genes related to the space position can be more fully excavated, and meanwhile, the significance test can be carried out.

Analysis of spatially differentiated Gene function

And carrying out KEGG enrichment analysis, DO enrichment analysis, GO enrichment analysis, Reactome enrichment analysis and other analyses on the screened spatial difference genes.

Screening and analyzing spatial characteristic gene

Further selecting 200 genes with the minimum P value as spatial characteristic genes, and displaying the gene distribution characteristics by using a heat map, a tissue map and a bubble map. Specifically, the method comprises the following steps:

heat map: the gene expression level was normalized by the SCTransform method, and then the normalized gene expression level was used for heat map mapping. Thus, the appearance of extreme values in the heatmap is avoided, so that the expression characteristics of the genes in different subgroups are more obvious. However, for spatially characterized genes, the distribution does not depend entirely on the spot subpopulations, so the heat map shows only the distribution characteristics of a portion of the genes.

Organizing a map: processing the gene expression quantity by using an SCTransform homogenization method to reduce the expression difference of the gene among different spots; the normalized gene expression level is then used for drawing a tissue map. Therefore, the influence of the extreme value on data distribution is reduced under the condition of keeping a certain original expression quantity characteristic, and the expression characteristic of the gene in different tissue regions is convenient to observe. Since the spatial signature gene is a gene analyzed in spatial distribution, the tissue map can most intuitively represent the distribution characteristics of the spatial signature gene.

Bubble diagram: the bubble chart shows both the average expression level of the gene in a cell subpopulation and the proportion of cells expressing the gene of interest. For drawing the bubble map, the mean value of the normalized expression quantity of each gene in each cell sub-group without taking the log value is obtained, then SCTransform normalization is carried out on the mean expression quantity of the gene among different sub-groups based on the mean value of the expression quantity, and the color shade of the bubbles in the bubble map is drawn by using the normalized gene expression quantity. In addition, the proportion of cells expressing the gene of interest in the subpopulation was counted and used to plot the size of the bubbles in the bubble map. Similar to the heat map, the bubble map also showed the expression characteristics of only a portion of the spatial signature genes.

Protein interaction network analysis of space characteristic gene

And obtaining the protein-protein interaction information through the string database. Through the information, a protein interaction network among the spatial characteristic genes is established, and potential regulation and control relations among different tissue regions are further known.

Finally, it should be noted that the above embodiments are only used for illustrating the technical solutions of the present invention and not for limiting the protection scope of the present invention, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions can be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

13页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种优化环介导等温扩增反应的方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!