Method for analyzing cells

文档序号：914645 发布日期：2021-02-26 浏览：20次中文

阅读说明：本技术 分析细胞的方法 (Method for analyzing cells ) 是由 A·卡维吉安 N·M·普拉吉斯 M·R·雷钦 F·A·沃尔夫 P·侯赛尼于 2019-07-16 设计创作，主要内容包括：访问代表第一细胞状态与改变的细胞状态之间的差异细胞组分表达的单细胞转变特征。改变的状态通过从第一细胞状态到改变的细胞状态的细胞转变而发生。转变特征包括多种细胞组分的识别以及,对于每种这样的细胞组分,量化在相应细胞组分的表达的变化与第一细胞状态与改变的细胞状态之间的细胞状态的变化之间的关联的对应第一显著得分。将转变特征与代表未受扰细胞与暴露于扰动的受扰细胞之间的差异细胞组分表达的扰动特征进行比较。扰动特征包括,对于每种相应细胞组分,量化在(i)未受扰细胞与受扰细胞之间的细胞组分的表达的变化与(ii)未受扰细胞与受扰细胞之间的细胞状态的变化之间的关联的对应第二显著得分。(A single cell transition profile representing differential cellular component expression between a first cellular state and an altered cellular state is accessed. The altered state occurs by a cellular transition from a first cellular state to an altered cellular state. The transition feature includes identification of a plurality of cellular components and, for each such cellular component, a corresponding first prominence score that quantifies a correlation between a change in expression of the respective cellular component and a change in cellular state between the first cellular state and the altered cellular state. The transition profile is compared to a perturbation profile representing differential cellular component expression between undisturbed cells and disturbed cells exposed to perturbation. The perturbation signature includes, for each respective cellular component, quantifying a corresponding second saliency score for the correlation between (i) a change in expression of the cellular component between undisturbed and disturbed cells and (ii) a change in cellular state between undisturbed and disturbed cells.)

1. A method for predicting whether a perturbation will affect a cellular transformation, the method comprising:

on a computer system comprising memory and one or more processors:

electronically accessing a single-cell transition feature representing a measure of differential cellular component expression between a first cellular state and an altered cellular state, wherein the altered cellular state occurs by the cellular transition from the first cellular state to the altered cellular state, and wherein the single-cell transition feature comprises an identification of a plurality of cellular components and, for each respective cellular component of the plurality of cellular components, a corresponding first prominence score quantifying an association between a change in expression of the respective cellular component and a change in cellular state between the first cellular state and the altered cellular state;

accessing, in electronic form, a perturbation signature representing a measure of differential cellular component expression between a plurality of undisturbed cells and a plurality of disturbed cells exposed to the perturbation, wherein the perturbation signature comprises an identification of all or a portion of the plurality of cellular components and, for each respective cellular component in the all or the portion of the plurality of cellular components, a respective second significance score quantifying a correlation between (i) a change in expression of the respective cellular component between the plurality of undisturbed cells and the plurality of disturbed cells and (ii) a change in cellular state between the plurality of undisturbed cells and the plurality of disturbed cells; and

Comparing the single-cell transition signature and the perturbation signature to determine whether the perturbation will affect the cell transition.

2. The method of claim 1, wherein accessing the single-cell transition feature comprises:

determining the single-cell transition signature based on (i) a first plurality of first single-cell cellular component expression datasets and (ii) a second plurality of second single-cell cellular component expression datasets, wherein:

obtaining each respective first single-cell cellular component expression dataset of the first plurality of first single-cell cellular component expression datasets from a corresponding single cell of a first plurality of cells in the first cell state, an

Obtaining each respective second single-cell cellular component expression dataset of the second plurality of second single-cell cellular component expression datasets from a corresponding single cell of a second plurality of cells in the altered cellular state.

3. The method of claim 2, wherein:

each respective dataset of the first plurality of single-cell cellular component expression datasets comprising a corresponding cellular component vector of a first plurality of cellular component vectors,

each respective dataset of the second plurality of single-cell cellular component expression datasets comprising a corresponding cellular component vector of a second plurality of cellular component vectors,

Each respective cell component vector of the first and second pluralities of cell component vectors includes a plurality of elements, each respective element of the respective cell component vector being associated with a respective cell component of the plurality of cell components and including a respective value representative of an amount of the respective cell component of the respective single cell represented by a respective data set of the first and second pluralities of single cell component expression data sets.

4. The method of claim 3, further comprising:

performing dimension reduction on the first plurality of single-cell cellular component expression datasets and/or the second plurality of single-cell cellular component expression datasets to generate a plurality of dimension reduced components;

for each respective cell component vector of the first and second pluralities of cell component vectors, applying the plurality of dimension-reduced components to the respective cell component vector to form a corresponding dimension-reduced vector that includes a dimension-reduced component value for each respective dimension-reduced component of the plurality of dimension-reduced components, thereby forming a corresponding first and second plurality of dimension-reduced vectors; and

Performing clustering to generate cluster C_jEach cluster containing a plurality of points corresponding to a subset of the first plurality of reduced-dimension vectors and the second plurality of reduced-dimension vectors;

from the cluster C_jIdentifying the first plurality of cells by a first cluster of the set of; and

from the cluster C_jA second cluster of the set of (a) identifies the second plurality of cells,

the method optionally further includes performing manifold learning with the corresponding first and second pluralities of dimension-reducing vectors to identify a relative cellular state of each cell of the first and second pluralities of cells relative to each other cell.

5. The method of claim 1, wherein the plurality of undisturbed cells is a control cell that has not been exposed to the perturbation, or wherein the undisturbed cells are averages of unrelated disturbed cells that have been exposed to the perturbation.

6. The method of claim 1, further comprising:

pruning the single cell transformation signature and the perturbation signature to confine the plurality of cellular components to transcription factors, optionally measured at the RNA level.

7. The method of claim 2, wherein said determining said single-cell transition characteristic comprises:

Determining differences in cellular constituent amounts of the plurality of cellular constituents between (i) the first plurality of first single-cell cellular constituent expression datasets and (ii) the second plurality of second single-cell cellular constituent expression datasets using one of a mean difference test, a Wilcoxon rank sum test, a t test, logistic regression, and a generalized linear model.

8. The method of claim 1, wherein the measure of differential cellular constituent expression quantifies a difference in cellular constituent amount between (i) the third plurality of third single-cell cellular constituent expression data sets and (ii) the fourth plurality of fourth single-cell cellular constituent expression data sets using one of a Wilcoxon rank sum test, a t test, logistic regression, and a generalized linear model, wherein:

obtaining each respective third single-cell cellular component expression dataset of the third plurality of third single-cell cellular component expression datasets from a corresponding single cell of the plurality of undisturbed cells, an

Obtaining each respective fourth single-cell cellular component expression dataset of the fourth plurality of fourth single-cell cellular component expression datasets from a corresponding single cell of a fourth plurality of cells of the plurality of perturbed cells exposed to the perturbation.

9. The method of claim 1, further comprising:

filtering the single-cell transition feature and the perturbation feature to reduce a number of cellular components included in the single-cell transition feature and the perturbation feature,

optionally wherein the filtering the single-cell transition feature and the perturbation feature comprises reducing a number of cellular components included in the single-cell transition feature and the perturbation feature according to a threshold p-value or according to a threshold number of cellular components.

10. The method of claim 1, wherein determining the corresponding second significance score for a respective cellular component comprises:

for each respective cellular component of the plurality of cellular components, replacing the significance score of the respective cellular component with a corresponding match score of the respective cellular component;

combining the match scores of the plurality of cellular components to generate the perturbed match score; and

determining whether the respective perturbation is associated with the transition of cells between the first cell state and the altered cell state based on the match score of the perturbation,

optionally wherein the corresponding match score comprises a discrete or continuous score.

11. The method of claim 10, wherein replacing the prominence score comprises:

replacing the saliency score with a first score if the cellular component amount of the respective cellular component from the single-cell conversion feature and the cellular component amount of the respective cellular component from the perturbation feature are both up-regulated;

replacing the significance score with a second score if the cellular component amount of the respective cellular component from the single-cell conversion feature is up-regulated and the cellular component amount of the respective cellular component from the perturbation feature is down-regulated; and

replacing the significance score with a third score if the amount of the cellular component from the perturbation feature of the respective cellular component is not significantly up-regulated or down-regulated.

12. The method of claim 10, wherein replacing the prominence score comprises:

replacing the significance score with a first score if both the cellular component amount of the respective cellular component from the single-cell conversion feature and the cellular component amount of the cellular component from the perturbation feature are down-regulated;

replacing the significance score with a second score if the cellular component amount of the respective cellular component from the single-cell conversion feature is down-regulated and the cellular component amount of the cellular component from the perturbation feature is up-regulated; and

Replacing the significance score with a third score if the amount of the cellular component from the perturbation feature of the cellular component is not significantly up-or down-regulated.

13. The method of claim 1, wherein the plurality of cellular components comprise a plurality of genes, optionally measured at the RNA level.

14. The method of claim 2, wherein each single-cell cellular component expression dataset of the first plurality of first single-cell cellular component expression datasets and the second plurality of second single-cell cellular component expression datasets is generated using a method selected from the group consisting of: single cell ribonucleic acid (RNA) sequencing (scRNA-seq), scTag-seq, single cell assays using sequenced transposase accessible chromatin (scATAC-seq), cyttof/SCoP, E-MS/absseq, miRNA-seq, CITE-seq, and any combination thereof, as well as summaries thereof, including combinations, such as linear combinations, representing activation pathways in the single cell cellular component expression dataset.

15. The method of claim 1, further comprising:

identifying the perturbation as a perturbation that promotes the altered cellular state based on the comparison, or

Identifying the perturbation as a perturbation that inhibits the altered cellular state based on the comparison.

16. The method of claim 1, wherein the cellular transformation signature and the perturbation signature are generated using different types of cellular components.

17. The method of claim 1, wherein the cellular transformation signature and the perturbation signature are generated using the same type of cellular component.

18. The method of claim 1, wherein

Performing the accessing in electronic form for each respective feature of the plurality of perturbations to obtain a plurality of perturbation features

The comparison compares the single-cell transition feature and the perturbation feature to each respective feature of a plurality of perturbation features, thereby

Determining a subset of the plurality of perturbations associated with a transition of a cell between the first cell state and the altered cell state.

19. A computer system comprising one or more processors and memory storing instructions for performing a method for predicting whether a perturbation will affect a cell transition, the method comprising:

Accessing, in electronic form, a perturbation signature representing a measure of differential cellular component expression between a plurality of undisturbed cells and a plurality of disturbed cells exposed to the perturbation, wherein the perturbation signature comprises an identification of all or a portion of the plurality of cellular components and, for each respective cellular component in the all or the portion of the plurality of cellular components, a respective second significance score quantifying a correlation between (i) a change in expression of the respective cellular component between the plurality of undisturbed cells and the plurality of disturbed cells and (ii) a change in cellular state between the plurality of undisturbed cells and the plurality of disturbed cells; and

comparing the single-cell transition signature and the perturbation signature to determine whether the perturbation will affect the cell transition.

20. A non-transitory computer readable medium storing one or more computer programs executable by a computer for predicting whether a perturbation will affect a cell transition, the computer comprising one or more processors and memory, the one or more computer programs collectively encoding computer executable instructions for performing a method comprising:

Electronically accessing a single-cell transition feature representing a measure of differential cellular component expression between a first cellular state and an altered cellular state, wherein the altered cellular state occurs by the cellular transition from the first cellular state to the altered cellular state, and wherein the single-cell transition feature comprises an identification of a plurality of cellular components and, for each respective cellular component of the plurality of cellular components, a corresponding first prominence score quantifying an association between a change in expression of the respective cellular component and a change in cellular state between the first cellular state and the altered cellular state;

Comparing the single-cell transition signature and the perturbation signature to determine whether the perturbation will affect the cell transition.

Technical Field

The present invention generally relates to systems and methods for analyzing cells. More particularly, the invention relates to predicting whether a perturbation will affect a cellular transition.

Background

The study of cellular mechanisms is important for understanding the disease.

Tissues are a complex ecosystem of individual cells in which a disorder of the cellular state is the basis of a disease. Existing drug discovery efforts have attempted to characterize the molecular mechanisms that lead to the transition of cells from a healthy state to a disease state, and to identify pharmacological approaches that reverse or inhibit these transitions. Past work has also attempted to identify molecular features that characterize these transitions, and to identify pharmacological approaches that reverse these features.

The molecular data of a large collection of cells in a tissue or cell rich in surface markers masks the phenotype and molecular diversity of individual cells in a population. The heterogeneity of cells in these large collections of cells has led to the results of current work aimed at elucidating disease-driving mechanisms being misleading or even totally incorrect. Novel methods, such as single cell RNA sequencing, can characterize individual cells at the molecular level. These data provide the basis for understanding different cell states at higher resolution and reveal the abundant and significant diversity of states that a cell possesses.

There are significant challenges in interpreting single cell data, i.e., the sparsity of these data, the uncertainty in the accuracy of these molecular measurements, ignoring the presence of molecules present in the cell and noise. Therefore, new approaches are needed to gain insight into the pharmacological approaches to control individual cell states and to address the disease accordingly.

Computational localization and re-localization of chemical substances (including small molecules, extracellular ligands, mRNA, siRNA and others) has great potential to accelerate drug discovery. Past approaches have mapped differential expression signatures derived from a large number of cells perturbed by small molecules to cellular expression Δ between healthy and disease states. This approach has potential, but its applicability to the current format is limited due to the heterogeneity of the large number of cells and the significant cell type differences of molecularly perturbed cells from diseased cells.

In view of the foregoing background, there is a need in the art for systems and methods that enable enhanced cellular analysis. In particular, it is desirable to be able to predict whether a perturbation will affect a cell transition.

Disclosure of Invention

The present disclosure addresses the above-identified shortcomings. The present disclosure addresses these shortcomings at least in part by using single cell data and molecular perturbation data as key data bases and using machine learning to improve the understanding of naturally diverse cellular states, uncovering key transition states when cells select alternative states, facilitating the understanding of the molecular mechanistic basis of cellular state changes and discovering pharmacological approaches to control these state changes.

One aspect of the present disclosure provides methods for predicting whether a perturbation will affect a cellular transition (e.g., whether the transition is promoted or inhibited). The method includes electronically accessing a single-cell transition feature. The transition characteristic represents a measure of differential cellular component expression between the first cellular state and the altered cellular state. The altered cellular state occurs by a cellular transition from the first cellular state to the altered cellular state. The single cell transition feature includes identification of multiple cellular components. For each respective cellular component of the plurality of cellular components, the corresponding first prominence score quantifies a correlation between a change in expression of the respective cellular component and a change in cellular state between the first cellular state and the altered cellular state. Indeed, any number of single cell transition characteristics may be obtained in this manner, each single cell transition characteristic representing a measure of differential cellular component expression between a first cellular state and a different altered cellular state. Thus, any number of different altered cellular states can be analyzed in parallel using the disclosure of the present application.

The method also includes accessing the perturbation signature in electronic form. In some embodiments, the perturbation signature represents a measure of differential cellular component expression between one or more undisturbed cells and one or more disturbed cells exposed to the perturbation. In addition, perturbation characteristics include the identification of all or part of various cellular components. For each respective cellular component in all or a portion of the plurality of cellular components, the respective second significance score quantifies a correlation between a change in expression of the respective cellular component between the one or more undisturbed cells and the one or more disturbed cells and a change in cellular state between the one or more undisturbed cells and the one or more disturbed cells. In fact, any number of perturbation signatures can be obtained in this manner, each perturbation signature representing a measure of differential cellular component expression between one or more undisturbed cells and one or more disturbed cells exposed to a different perturbation of the plurality of perturbations. Further, the method includes comparing the one or more single-cell transition characteristics to the one or more perturbation characteristics, thereby determining whether the one or more perturbations will affect the transition of the cell to the one or more altered states. In some embodiments, two, three, four, ten, or more (e.g., 15, 20, 25, 30, 40, 50, 60, 70, 80, 90, or 100 or more) changed states are analyzed in parallel in this manner. In some embodiments, two, three, four, ten, or more (e.g., 15, 20, 25, 30, 40, 50, 60, 70, 80, 90, or 100 or more) perturbations are analyzed in parallel in this manner.

In some embodiments, accessing the single-cell transformation feature comprises determining the single-cell transformation feature based on the first plurality of first single-cell cellular component expression data sets and the second plurality of second single-cell cellular component expression data sets. Each respective first single-cell cellular component expression dataset of the first plurality of first single-cell cellular component expression datasets is obtained from a corresponding single cell of a first plurality of cells in a first cellular state. Further, each respective second single-cell cellular component expression dataset of the second plurality of second single-cell cellular component expression datasets is obtained from a corresponding single cell of a second plurality of cells that are in an altered cellular state.

In some embodiments, each respective dataset of the first plurality of single cell component expression datasets comprises a corresponding cell component vector of a first plurality of cell component vectors. Further, each respective data set of the second plurality of single-cell cellular component expression data sets includes a corresponding cellular component vector of a second plurality of cellular component vectors. Additionally, each respective cellular component vector of the first plurality of cellular component vectors and the second plurality of cellular component vectors includes a plurality of elements. Each respective element in the respective cellular component vector is associated with a respective cellular component of the plurality of cellular components and includes a respective value representative of an amount of the respective cellular component of a respective single cell represented by a respective dataset of the first plurality of single-cell cellular component expression datasets and the second plurality of single-cell cellular component expression datasets.

In some embodiments, the dimension reduction is performed on the first plurality of single-cell cellular component expression datasets and/or the second plurality of single-cell cellular component expression datasets to generate a plurality of dimension reduced components. Thus, for each respective cellular component vector of the first plurality of cellular component vectors and the second plurality of cellular component vectors, the plurality of dimension-reduced components are applied to the respective cellular component vector to form a corresponding dimension-reduced vector that includes a dimension-reduced component value for each respective dimension-reduced component of the plurality of dimension-reduced components. This forms a corresponding first plurality of reduced-dimension vectors and second plurality of reduced-dimension vectors. The method includes performing clustering to generate a set of clusters Cj. Each cluster includes a plurality of points corresponding to a subset of the first plurality of reduced-dimension vectors and the second plurality of reduced-dimension vectors. The first plurality of cells is identified from a first cluster of the set of clusters Cj, and the second plurality of cells is identified from a second cluster of the set of clusters Cj.

In some embodiments, manifold learning is performed with the corresponding first and second pluralities of dimension-reducing vectors to identify a relative cellular state of each cell of the first and second pluralities of cells relative to each other cell.

In some embodiments, the plurality of undisturbed cells is a control cell that has not been exposed to the perturbation, or undisturbed cells is an average taken over unrelated disturbed cells that have been exposed to the perturbation.

In some embodiments, the method further comprises pruning the single cell transformation and perturbation features to limit the plurality of cellular components to transcription factors.

In some embodiments, determining the single-cell transition characteristic comprises determining a difference in the amount of the cellular constituents in the plurality of cellular constituents between the first plurality of first single-cell cellular constituent expression data sets and the second plurality of second single-cell cellular constituent expression data sets using one of a difference of mean test (Wilcoxon rank sum test), a t test, logistic regression, and a generalized linear model.

In some embodiments, the measure of differential cellular constituent expression quantifies a difference in cellular constituent amount between the third plurality of third single-cell cellular constituent expression data sets and the fourth plurality of fourth single-cell cellular constituent expression data sets using one of a mean difference test, a Wilcoxon rank sum test, a t test, logistic regression, and a generalized linear model. Each respective third single-cell cellular component expression dataset of the third plurality of third single-cell cellular component expression datasets is obtained from a corresponding single cell of the plurality of undisturbed cells, and each respective fourth single-cell cellular component expression dataset of the fourth plurality of fourth single-cell cellular component expression datasets is obtained from a corresponding single cell of a fourth plurality of cells of the plurality of disturbed cells exposed to a disturbance.

In some embodiments, the single cell transition and perturbation features are filtered to reduce the number of cellular components included in the single cell transition and perturbation features. In some embodiments, filtering the single cell transition and perturbation features comprises reducing the number of cellular components included in the single cell transition and perturbation features according to a threshold p-value or according to a threshold number of cellular components.

In some embodiments, the corresponding match scores comprise discrete or continuous scores.

In some embodiments, replacing the prominence score comprises replacing the prominence score with the first score if both the amount of the cellular component from the single-cell transition feature and the amount of the cellular component from the perturbation feature of the respective cellular component are upregulated. Replacing the significance score with a second score if the amount of the cellular component of the respective cellular component from the single-cell conversion feature is up-regulated and the amount of the cellular component of the respective cellular component from the perturbation feature is down-regulated. Further, if the amount of the cellular component from the perturbation signature of the respective cellular component is not significantly up-or down-regulated, the significance score is replaced with a third score.

In some embodiments, replacing the significance score comprises replacing the significance score with the first score if both the amount of the cellular component from the single-cell transition feature and the amount of the cellular component from the perturbation feature of the respective cellular component are down-regulated compared to the counterpart (e.g., the first cellular state and the undisturbed state, respectively). Replacing the saliency score with a second score if the amount of a cellular component from the single-cell conversion feature of the respective cellular component is down-regulated and the amount of a cellular component from the perturbation feature of said cellular component is up-regulated compared to the counterpart (e.g. the first cellular state and the undisturbed state, respectively). Further, if the amount of the cellular component from the perturbation signature of the cellular component is not significantly up-regulated or down-regulated as compared to the counterpart (e.g., the first cellular state and the undisturbed state, respectively), the significance score is replaced with the third score.

In some embodiments, the plurality of cellular components comprises a plurality of genes.

In some embodiments, each single-cell cellular component expression dataset of the first plurality of first single-cell cellular component expression datasets and the second plurality of second single-cell cellular component expression datasets is generated using a method comprising single-cell ribonucleic acid (RNA) sequencing (scRNA-seq), scTag-seq, single-cell assays using sequenced transposase-accessible chromatin (scATAC-seq), cyttof/SCoP, E-MS/Abseq, miRNA-seq, CITE-seq, or a combination or summary thereof.

In some embodiments, the method further comprises identifying the perturbation as a perturbation that promotes an altered cellular state based on the comparison.

In some embodiments, the cellular transformation signature and the perturbation signature are generated using different types of cellular components. In some embodiments, the cellular transformation signature and the perturbation signature are generated using the same type of cellular component.

In some embodiments, the accessing in electronic form is performed for each respective feature of the plurality of perturbations to obtain a plurality of perturbation features. Further, the comparing compares the single-cell transition and perturbation characteristics to each respective characteristic of the plurality of perturbation characteristics, thereby determining a subset of the plurality of perturbations that are associated with a transition of the cell between the first cell state and the altered cell state.

Another aspect of the disclosure provides a method comprising accessing a plurality of single cell cellular component expression datasets. Each single cell cellular component expression dataset is obtained from one cell of a plurality of cells that have been transformed from the same "progenitor" cell type. Each data set comprising vectors r of cell components_i. Vector r of cellular components_iEach entry in (a) is associated with one of a plurality of cellular components, and the value of each entry represents the amount of the cellular component of the cell. The method further includes performing dimensionality reduction of the dataset to generate a matrix M (e.g., a plurality of dimensionless components, such as those of dimensionless dimension-reduced component store 146-1 of FIG. 1). The matrix M includes rows in a first dimension and columns in a second dimension. Each row corresponds to one of the plurality of cells. The values of the matrix M include values generated from the amounts of cellular components located at points in the first and second dimensions. The method also includes performing clustering to generate a cluster C _jThe collection of (2). Each cluster comprising sub-clusters associated with rows in the matrix MThe corresponding plurality of points of the set, and their corresponding cells. The method further comprises using cluster C_jDetermining the differentially expressed cell fraction E of the cells_kThe collection of (2).

In certain embodiments, the method further comprises performing manifold learning with the matrix M under a relative similarity approximation of points to create the matrix N. The matrix N includes a plurality of rows (the same rows as the matrix M) and two columns. Each row corresponds to one of the plurality of cells, and each of the two columns corresponds to one of two dimensions in a two-dimensional space. Based on the data set, the values of matrix N indicate the relative cell type of each cell relative to each other cell.

There are a plurality of embodiments of the plurality of cells from which the data set is obtained. In certain embodiments, when obtaining the single cell component expression dataset, the plurality of cells is a heterogeneous population of cells having various cell types. In additional embodiments, the plurality of cells is a homogenous population of cells having a "progenitor" cell type, and the single-cell component expression dataset is obtained at each of a plurality of time points when the cells are transformed from the "progenitor" cell type, such that a different dataset of the plurality of datasets is collected for each unique cell and time point combination. In such embodiments, the plurality of time points may include at least three time points. In further embodiments, the plurality of time points may comprise "progenitor" time points at which a substantial portion of the plurality of cells have not been converted from an "progenitor" cell type. In some additional embodiments, the plurality of time points may include transition time points at which a substantial portion of the plurality of cells have transitioned from an "ancestor" cell type. In some further embodiments, the plurality of time points may include at least one intermediate time point at which a substantial portion of the cells have at least partially transitioned from a "progenitor" cell type.

The various cellular components may also vary. For example, in some embodiments of the methods disclosed herein, the plurality of cell components is selected from the group consisting of: nucleic acids, proteins, lipids, carbohydrates, nucleotides, and any combination thereof. In such embodiments, the nucleic acid may be selected from the group consisting of DNA and RNA. In further embodiments, the RNA may be selected from the group consisting of coding RNA and non-coding RNA. In certain embodiments, the plurality of single cell cellular component expression datasets are generated using a method selected from the group consisting of: single cell ribonucleic acid (RNA) sequencing (scRNA-seq), scTag-seq, single cell assays using sequenced transposase accessible chromatin (scATAC-seq), CyTOF/SCoP, E-MS/Abseq, miRNA-seq, CITE-seq, and any combination or summary thereof.

Dimensionality reduction may be performed on the dataset to generate the reduced-dimensional components in a number of ways (e.g., in the form of the matrix M described above). In certain embodiments, performing the dimension reduction comprises performing Principal Component Analysis (PCA) on the single-cell cellular component expression dataset to generate the dimension reduction component. In further embodiments, the dimensionality reduction may be performed on the dataset using a diffusion map and/or a neural network autoencoder to generate the reduced-dimensional components.

Similarly, manifold learning may be performed using the reduced-dimension components (e.g., in the form of matrix M) to create another dataform, such as matrix N, in a variety of ways. In some embodiments of the present disclosure, performing manifold learning may include estimating a geometry of data in matrix M to create matrix N. In such embodiments, performing manifold learning may include performing localized linear Embedding, localized linear isometric mapping (ISOMAP), t-distributed random neighbor Embedding (t-SNE), thermal diffusivity for Affinity Based Trajectory Embedding (PHATE), or Unified Manifold Approximation and Projection (UMAP). In further embodiments, performing manifold learning may include creating a force directing layout based on data in matrix M to generate matrix N. In one embodiment, the Force directed layout may be created using the Force Atlas 2 algorithm.

Clustering can also be performed in a number of different ways. In some embodiments, performing clustering assumes that there is no a priori knowledge of the organization of the plurality of points in each cluster. Is disclosed herein In additional embodiments of the disclosure, performing clustering includes performing HDBSCAN and/or Louvain community detection to generate cluster C_jThe collection of (2). In further embodiments, performing the clustering comprises assigning each point to cluster C based on the time point at which the single-cell cellular component expression dataset associated with the point was collected_jOne of them. In some embodiments, performing clustering comprises analyzing the plurality of points using a diffusion path algorithm that assigns points to clusters based on a measure of how well the points are cluster ends.

To determine the differentially expressed cell fraction E_kMay be used in a number of different ways. For example, in one embodiment, for each cellular component, for at least one cluster C_jThe amount of cellular constituents of the plurality of points in the at least one cluster may be compared to the amount of cellular constituents of the plurality of points in at least one other cluster. Then, in response to the amount of the cellular component of the plurality of points in the at least one cluster being greater than a threshold level of the amount of the cellular component of the plurality of points in the at least one other cluster, the cellular component can be added to the differentially expressed cellular component E _kConcentration of (2). In certain embodiments, the at least one cluster may include cluster C_j(iii) an intra-lineage (on-line) cluster containing a plurality of points with desired cell types. In further embodiments, the at least one other cluster may include cluster C_jOff-line clusters containing points with undesired cell types.

In a further embodiment, to determine the differentially expressed cell fraction E_kFor each cellular constituent, for at least one cluster, a distance measure between the amount of the cellular constituent of a plurality of points in at least one cluster and the amount of the cellular constituent of a plurality of points in at least one other cluster may be calculated. Then, in response to the distance metric being statistically significant, a cellular fraction can be added to the differentially expressed cellular fraction E_kConcentration of (2).

In some embodiments of the present invention, the substrate is,the methods described herein can further comprise screening a database of transcription factors for differentially expressed cellular component E_kTo identify a set of differentially expressed transcription factors. In such embodiments where a set of differentially expressed transcription factors is identified, the method may further comprise the steps of: for the differentially expressed cell fraction E _kPerforming empirical mode decomposition to generate a pseudo-temporal representation of the data set; and identifying the set of differentially expressed transcription factors based on the pseudo-temporal representation.

In another aspect, the present disclosure provides a method comprising accessing a plurality of single cell cellular component expression datasets. Each data set is obtained from one cell of a plurality of cells that have been transformed from the same "progenitor" cell type. Each data set comprising vectors r of cell components_i. Vector r of cellular components_iEach entry in (a) is associated with one of a plurality of cellular components, and the value of each entry represents the amount of the cellular component of the cell. The method further includes generating a kNN map using a kNN algorithm and the single cell component expression dataset, performing clustering to generate a cluster C_jAnd using said cluster C_jDetermining the differentially expressed cell fraction E of a plurality of cells_kThe collection of (2). Each cluster includes a plurality of points, each point corresponding to a single-cell cellular component expression dataset for one of the plurality of cells. In some embodiments, differentially expressed cell component E is determined_kIncludes determining a cluster C_jA measure of the distance between the plurality of points.

In another aspect, the present disclosure provides a method comprising accessing a single cell transition signature representing a measure of differential cellular component expression between a first cellular state and an altered cellular state. The method further includes accessing perturbation characteristics representing a measure of differential cellular constituent expression between undisturbed cells that are not exposed to the perturbation and disturbed cells that are exposed to the perturbation. The method also includes determining whether the perturbation is associated with a transition of the cell between the first cell state and the altered cell state based on a comparison of the single cell transition characteristic and the perturbation characteristic.

In some embodiments, accessing the single-cell transformation feature comprises determining the single-cell transformation feature based on a first plurality of single-cell cellular component expression data sets, each first data set obtained from one cell of the first plurality of cells in the first cellular state, and based on a second plurality of single-cell cellular component expression data sets, each second data set obtained from one cell of the second plurality of cells in the altered cellular state. For each cell, each of the first plurality of single-cell cellular component expression datasets and the second plurality of single-cell cellular component expression datasets may include a vector r of cellular components _iEach entry in the vector is associated with one of a plurality of cellular components, and the value of each entry represents the amount of the cellular component of the cell. In some embodiments, determining the single-cell transition characteristic based on the first plurality of single-cell cellular component expression datasets and the second plurality of single-cell cellular component expression datasets comprises determining a difference in cellular component amounts between the first plurality of single-cell cellular component expression datasets and the second plurality of single-cell cellular component expression datasets using one of an average difference test, a Wilcoxon rank sum test (Mann Whitney U test), a t test, logistic regression, and a generalized linear model.

In such embodiments, where the single-cell transformation feature comprises determining the single-cell transformation feature based on the first plurality of single-cell cellular component expression data sets and the second plurality of single-cell cellular component expression data sets, the method may further comprise obtaining the first plurality of single-cell cellular component expression data sets and the second plurality of single-cell cellular component expression data sets. The obtaining step further comprises performing dimension reduction on the first plurality of single-cell cellular component expression datasets and the second plurality of single-cell cellular component expression datasets to generate a matrix M. The matrix M includes rows in a first dimension and columns in a second dimension. Each row of the matrix M corresponds to one cell of the plurality of cells. The values of the matrix M include values generated from the amounts of cellular components located at points in the first and second dimensions. Obtaining a first single-cell component expression dataset and a second single-cell component expression dataset The single cell cellular component expression dataset further comprises performing clustering to generate cluster C_jThe collection of (2). Each cluster includes a plurality of points corresponding to a subset of the rows in the matrix M, and their corresponding cells. Obtaining the first single-cell cellular component expression dataset and the second single-cell cellular component expression dataset may even further comprise clustering C_jIdentifies a first plurality of cells from a first cluster of the set of C_jIdentifying a second plurality of cells, obtaining a first plurality of single-cell cellular component expression data sets from the first plurality of cells, and obtaining a second plurality of single-cell cellular component expression data sets from the second plurality of cells.

In certain embodiments, obtaining the first single-cell cellular component expression dataset and the second single-cell cellular component expression dataset further comprises performing manifold learning with matrix M under a relative similarity approximation of points to create matrix N. The matrix N comprises a plurality of rows and two columns. Each row corresponds to one cell of the first and second pluralities of cells, and each column corresponds to one or two dimensions of the two-dimensional space. The values of the matrix N are indicative of the relative cellular state of each cell relative to each other cell based on the first plurality of single-cell cellular component expression data sets and the second plurality of single-cell cellular component expression data sets.

In certain embodiments, accessing the perturbation characteristics may comprise determining the perturbation characteristics based on the plurality of undisturbed single-cell component expression datasets for undisturbed cells that are not exposed to the perturbation and based on the plurality of disturbed single-cell component expression datasets for disturbed cells that are exposed to the perturbation. The undisturbed cell can be a control cell that has not been exposed to the perturbation of the disturbed cell. Alternatively, the undisturbed cells can be an average of unrelated disturbed cells that have been exposed to the disturbance. In some embodiments, determining the perturbation signature based on the undisturbed plurality of single-cell component expression data sets and the perturbed plurality of single-cell component expression data sets may comprise determining a difference in cell component quantity between the undisturbed plurality of single-cell component expression data sets and the perturbed plurality of single-cell component expression data sets using one of an average difference test, a Wilcoxon rank sum test (mann-whitney U test), a t test, logistic regression, and a generalized linear model.

In some embodiments, the method further comprises filtering the single cell transition and perturbation signatures to include cellular components as transcription factors. In additional embodiments, the method further comprises filtering the single cell transition and perturbation features to reduce the number of cellular components included in the single cell transition and perturbation features. In particular, the single cell transition and perturbation features may be filtered to reduce the number of cellular components included in the single cell transition and perturbation features according to a threshold p-value or according to a threshold number of cellular components.

In further embodiments of the methods disclosed herein, the perturbation signature may comprise a plurality of cellular components, each cellular component associated with a significance score that quantifies the association between a change in the amount of the cellular component and a change in the cellular state between an undisturbed cell and a disturbed cell. In such embodiments, determining whether the perturbation is associated with a transition of the cell between the first cellular state and the altered cellular state may comprise: replacing the significance score of each cellular component with the match score of the cellular component; combining the match scores of the plurality of cellular components to generate a perturbed match score; and determining whether the perturbation is associated with a transition of the cell between the first cell state and the altered cell state based on the matching score of the perturbation. The matching score may comprise a discrete score or a continuous score. The replacement prominence score may include: replacing the saliency score with the first score if both the cellular component amount from the single-cell conversion feature and the cellular component amount from the perturbation feature of the cellular components are up-regulated; replacing the saliency score with a second score if the cellular component amount from the single-cell conversion feature of the cellular component is up-regulated and the cellular component amount from the perturbation feature is down-regulated; and replacing the significance score with the third score if the amount of the cellular component from the perturbation feature is not significantly up-regulated or down-regulated. Alternatively, replacing the prominence score may include: replacing the saliency score with the first score if both the amount of the cellular component from the single-cell conversion feature and the amount of the cellular component from the perturbation feature of the cellular component are down-regulated; replacing the saliency score with a second score if the amount of the cellular component from the single-cell conversion feature of the cellular component is down-regulated and the amount of the cellular component from the perturbation feature is up-regulated; and replacing the significance score with the third score if the amount of the cellular component from the perturbation feature is not significantly up-regulated or down-regulated.

In an alternative embodiment of the methods disclosed herein, the match score is not used to replace a significant score associated with the cellular component of the perturbation signature. Rather, in alternative embodiments, the perturbation signature may comprise a plurality of cellular components, each cellular component being associated with a prominence score that quantifies the association between a change in the amount of the cellular component and a change in the cellular state between an undisturbed cell and a disturbed cell. In such embodiments, determining whether the perturbation is associated with a transition of the cell between the first cellular state and the altered cellular state may comprise: simply combining the prominence scores of the various cellular components to generate a perturbed prominence score; and determining whether the perturbation is associated with a transition of the cell between the first cellular state and the altered cellular state based on the significance score of the perturbation.

In some embodiments, the false cell component discovery rate of the matching score of the perturbation is estimated to determine a confidence level in the perturbation. In such embodiments, the false cell component discovery rate is estimated by: calculating an empirical marginal expression frequency for each cellular component of the plurality of cellular components; summing the empirical marginal expression frequencies of the plurality of cellular components over a combination thereof to generate a probability of identifying a number of cellular components by occasionally assuming independently distributed expression; and estimating a false cell component discovery rate of the perturbed match score based on the probability.

In certain embodiments, determining whether a perturbation is associated with a transition of a cell between a first cellular state and an altered cellular state depends on a covariate of the perturbation. For example, in some embodiments, determining whether the perturbation is associated with a transition of the cell between the first cellular state and the altered cellular state may comprise: determining that a threshold amount of a perturbed covariate is associated with a transition of the cell between the first cellular state and the altered cellular state; and in response to the determining, determining that the perturbation is associated with a transition of the cell between the first cellular state and the altered cellular state. In certain embodiments, the perturbation may comprise exposing the cell to a small molecule. The perturbed covariates may include a specific dose of the small molecule, a time to measure differential cellular component expression between the undisturbed cell and the perturbed cell relative to the time of exposure of the perturbed cell to the small molecule, and a cell line of the perturbed cell.

In certain embodiments, the cellular component may comprise a gene. The single cell component expression dataset may be generated using a method selected from the group consisting of: single cell ribonucleic acid (RNA) sequencing (scRNA-seq), scTag-seq, single cell assays using sequenced transposase accessible chromatin (scATAC-seq), CyTOF/SCoP, E-MS/Abseq, miRNA-seq, CITE-seq, and any combination or summary thereof.

In some embodiments of the methods disclosed herein, at least one of the single cell transition characteristics and the perturbation characteristics are obtained from a database. The perturbation signature may be obtained from a database of multiple perturbation signatures comprising multiple perturbations. In such embodiments, for each perturbation of the plurality of perturbations in the database, a perturbation signature of the perturbation is accessed from the database, and it is determined whether the perturbation is associated with a transition of the cell between the first cell state and the altered cell state based on a comparison of the single-cell transition signature to the perturbation signature.

In further embodiments of the methods disclosed herein, the methods may further comprise: accessing a plurality of perturbation features from a plurality of perturbed cells; and screening for perturbations that promote the altered cellular state by, for each of the plurality of perturbation features, determining whether a perturbation associated with the perturbation feature is associated with a transition of the cell between the first cellular state and the altered cellular state based on the comparison of the single cell transition feature and the perturbation feature. In embodiments where the screening for perturbation features that promote perturbation of the altered cell state, accessing the plurality of perturbation features may comprise: exposing the cells to a plurality of perturbations to generate a plurality of perturbed cells; and measuring the amount of the cell component from the plurality of disturbed cells.

The method may further comprise identifying perturbations that promote the altered cellular state. Promoting the altered cellular state can include promoting a transition from the first cellular state to the altered cellular state in a population of cells that includes the first cellular state. Alternatively, promoting the altered cellular state may comprise, in a population of cells comprising the first cellular state, increasing the ratio of the number of cells in the alternative state to the number of cells in the first state or optionally a state other than the altered cellular state. In a further alternative embodiment, promoting the altered cellular state may comprise increasing the absolute number of cells in the altered cellular state in a population of cells comprising the first cellular state. In a still further alternative embodiment, promoting the altered cellular state may comprise, in a population of cells comprising the first cellular state, reducing the absolute number of cells in the first cellular state or optionally a state other than the altered cellular state.

In certain embodiments, the cellular transformation signature and the perturbation signature can be generated using different types of cellular components. For example, a cellular transition signature can be generated based on RNA expression (e.g., a count of RNA transcripts), and a perturbation signature can be generated based on protein expression (e.g., a count of amino acids). In alternative embodiments, the cellular transformation signature and the perturbation signature may be generated using the same type of cellular component. For example, both cellular transformation characteristics and perturbation characteristics can be generated based on RNA expression (e.g., counts of RNA transcripts).

In another aspect, the present disclosure provides a method comprising accessing a single cell transition signature representing a measure of differential cellular component expression between a first cellular state and an altered cellular state. The method also includes accessing a plurality of perturbation features, each perturbation feature associated with a perturbation representing a measure of differential cellular constituent expression between undisturbed cells that are not exposed to the perturbation and disturbed cells that are exposed to the perturbation. The method also includes determining a subset of perturbations associated with a transition of the cell between the first cell state and the altered cell state based on a comparison of the single-cell transition characteristic and the plurality of perturbation characteristics.

In certain embodiments, each perturbation feature comprises a plurality of cellular components, and each cellular component is associated with a significance score that quantifies the association between a change in the amount of the cellular component and a change in the cellular state between an undisturbed cell and an disturbed cell. In such embodiments, determining the subset of perturbations associated with the transition of the cell between the first cellular state and the altered cellular state comprises, for each perturbation feature, replacing the saliency score of each cellular component with the match score of that cellular component, and combining the match scores of the multiple cellular components to generate the perturbed match score. Then, the method further includes sorting the perturbations according to their matching scores, and selecting a subset of the perturbations based on the sorted list of perturbations.

In another aspect, the present disclosure disclosed herein provides a computer program product comprising a non-transitory computer-readable storage medium having instructions encoded thereon. When executed by a processor, the encoded instructions cause the processor to perform any embodiment of the methods disclosed herein. In yet another aspect, the invention disclosed herein provides a system comprising a non-transitory computer-readable storage medium having instructions encoded thereon. When executed by a processor, the encoded instructions cause the processor to perform any embodiment of the methods disclosed herein.

In yet another aspect, the present disclosure provides a method for promoting neuronal and/or "progenitor" cells. The method includes exposing the starting fibroblast population to a perturbation having perturbation characteristics that promote the conversion of the starting fibroblast population to "progenitor" cells and/or neurons. In such embodiments, the perturbation signature is an increase in activity of one or more of Brn2, Ascl1, Myt1, Zfp941, Taf5B, St18, Zkscan16, Camta1, and Arnt2 and/or a decrease in activity of one or more of Ascl1, Atf3, Rorc, Scx, Satb1, Elf3, and Fos.

In certain embodiments of the methods for promoting neuronal and/or "progenitor" cells, the neuronal and/or "progenitor" cells are promoted by one or more of: increasing the absolute number of neurons and/or "progenitor" cells; reducing the absolute number of fibroblasts; promoting the conversion of fibroblasts to neuronal and/or "progenitor" cells; promoting the longevity of neuronal or "progenitor" cells; reducing the lifespan of fibroblasts; or increasing the ratio of neuronal and/or "progenitor" cells to fibroblasts. In further embodiments, the perturbation does not include Forskolin (Forskolin), PP1, PP2, and trichostatin a (trichostatin a).

In yet another aspect, the present disclosure provides a method of increasing the number of neurons and/or "progenitor" cells. The method comprises exposing a population of fibroblasts to a pharmaceutical composition having perturbation characteristics that promote the conversion of the population of fibroblasts into neurons. The pharmaceutical composition comprises forskolin, PP1, PP2, trichostatin A, BRD-K38615104, Geldanamycin (Geldanamycin), manumycin a (manumycin a), mitoxantrone, curcumin, Alvocidib, vatinostat, KI20227, or a combination of the foregoing, e.g., a combination of 2, 3, 4, 5 or more of the foregoing. In some embodiments, the pharmaceutical composition does not comprise forskolin, PP1, PP2, and trichostatin a.

In yet another aspect, the present disclosure provides a pharmaceutical composition for promoting neuronal and/or "progenitor" cells. The pharmaceutical composition comprises a perturbation selected from the group consisting of forskolin, PP1, PP2, trichostatin A, BRD-K38615104, geldanamycin, manumycin a, mitoxantrone, curcumin, alvocidib, valproate, KI20227, or a combination of the foregoing, and a pharmaceutically acceptable excipient. In some embodiments, the perturbation does not include forskolin, PP1, PP2, and trichostatin a.

In yet another aspect, the present disclosure provides a unit dosage form comprising one of the pharmaceutical compositions disclosed herein.

In yet another aspect, the present disclosure provides a method of identifying candidate perturbations that promote the conversion of a starting fibroblast population into neuronal and/or "progenitor" cells. The method includes exposing the starting fibroblast population to a perturbation and identifying perturbation characteristics of the perturbation. The perturbation characteristics of the perturbation include one or more cellular components and a significance score associated with each cellular component. The significance score for each cellular component quantifies the correlation between changes in cellular component expression after exposure of the population of fibroblasts to the perturbation and changes in the cellular state of the population of fibroblasts to neuronal and/or "progenitor" cells. The perturbation profile comprises an increase in activity of one or more of Brn2, Ascl1, Myt1, Zfp941, Taf5B, St18, Zkscan16, Camta1 and Arnt2 and/or a decrease in activity of one or more of Ascl1, Atf3, Rorc, Scx, Satb1, Elf3 and Fos. The method further includes identifying the perturbation as a candidate perturbation for promoting the conversion of the fibroblast population into neurons and/or "progenitor" cells based on the perturbation characteristics.

Yet another aspect of the present disclosure provides a non-transitory computer-readable storage medium, wherein the non-transitory computer-readable storage medium stores instructions that, when executed by a computer system, cause the computer system to perform any of the methods for analyzing cells described in the present disclosure.

Drawings

The embodiments disclosed herein are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings. Like reference numerals refer to corresponding parts throughout the drawings.

FIG. 1 illustrates a block diagram of an exemplary system and computing device, according to one embodiment of the present disclosure;

fig. 2 provides a flow diagram of methods and features of a system for analyzing cells, wherein elements in the dashed box are optional, according to various embodiments of the present disclosure;

FIG. 3 is a flow diagram of a first example of a differential cellular component expression assay to determine a set of differentially expressed cellular components according to one embodiment of the present disclosure;

fig. 4A depicts a timeline tracking a trajectory of induced cellular state transitions over a period of time, according to one embodiment of the present disclosure;

FIG. 4B depicts a manifold generated by a force directed layout algorithm for the exemplary matrix N in supplemental Table 1, according to one embodiment of the present disclosure;

Fig. 5A depicts the manifold of fig. 5B according to one embodiment of the present disclosure;

figure 5B depicts the expression level of each BAM transcription factor in each cell per measurement day depicted as a point in the manifold of figure 4B, according to one embodiment of the present disclosure;

fig. 6 depicts images of MEF cells in which Ascl1 transcription factor expression is forced, which have been stained with DAPI, Map2 antibody and Tuj1 antibody, mouse neurons which have been stained with DAPI, Map2 antibody and Tuj1 antibody, and MEF cells in which Ascl1 transcription factor expression is not forced, which have been stained with DAPI, Map2 antibody and Tuj1 antibody, according to one embodiment disclosed;

FIG. 7A depicts the manifold of FIG. 4B, where points in the manifold are grouped into clusters C identified by the clusters, according to one embodiment of the present disclosure_jPerforming the following steps;

figure 7B depicts transcription factors known and unknown in the literature to be associated with MEF conversion to mouse neurons (and vice versa mouse myocytes), according to one embodiment of the present disclosure;

fig. 8A depicts a mapping of transition trajectories of MEF cells discussed with respect to fig. 4A according to one embodiment of the present disclosure;

FIG. 8B depicts a method for identifying a perturbation affecting a transition trajectory of a cell by altering gene expression in the cell such that the cell transitions from a first state to a second state in the mapping of the transition trajectory of FIG. 8A, according to one embodiment of the present disclosure;

Figure 9 depicts small molecule perturbations associated with the conversion of MEFs into mouse neurons (and vice versa mouse myocytes) according to one embodiment of the present disclosure;

figure 10A provides a histogram showing the total number of neurons under each treatment condition, wherein the total number of neurons was manually counted based on positive Tuj1/Map2 signals and neuron morphology, and wherein the data for each treatment condition was normalized by the number of neurons in DMSO-treated wells for each experiment, according to one embodiment of the present disclosure; and is

Figure 10B provides a bar graph showing the percentage of neurons under each treatment condition, according to one embodiment of the present disclosure.

Detailed Description

Reference will now be made in detail to embodiments, examples of which are illustrated in the accompanying drawings. In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure. It will be apparent, however, to one skilled in the art that the present disclosure may be practiced without these specific details. In other instances, well-known methods, procedures, components, circuits, and networks have not been described in detail so as not to unnecessarily obscure aspects of the embodiments.

Multiple instances may be provided for a component, operation, or structure described herein as a single instance. Finally, the boundaries between the various components, operations and data stores are somewhat arbitrary, and particular operations are illustrated in the context of specific illustrative configurations. Other forms of functionality are contemplated and may fall within the scope of one or more embodiments. In general, structures and functionality presented as separate components in the exemplary configurations may be implemented as a combined structure or component. Similarly, structures and functionality presented as a single component may be implemented as separate components. These and other variations, modifications, additions, and improvements may fall within the scope of the one or more embodiments.

It will also be understood that, although the terms first, second, etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. For example, a first data set may be referred to as a second data set, and similarly, a second data set may be referred to as a first data set, without departing from the scope of the present invention. The first data set and the second data set are both data sets, but they are not the same data set.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the claims. As used in the description of the embodiments and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

As used herein, the term "if" may be interpreted to mean "when a stated prerequisite is true," or "in response to a determination," or "according to a determination," or "in response to a detection," depending on the context, that the stated prerequisite is true. Similarly, the phrase "if it is determined (the recited prerequisite is true)" or "if (the recited prerequisite is true)" or "when (the recited prerequisite is true)" may be interpreted to mean "when the recited prerequisite is determined to be true", either "in response to determining that the recited prerequisite is true", or "in accordance with determining that the recited prerequisite is true", or "in detecting that the recited prerequisite is true", or "in response to detecting that the recited prerequisite is true", depending on the context.

Further, when a reference number is given as an "ith" designation, the reference number refers to a common component, group, or embodiment. For example, a cellular fraction referred to as "cellular fraction i" refers to the ith cellular fraction of the plurality of cellular fractions.

The foregoing description includes exemplary systems, methods, techniques, instruction sequences, and computer program products that embody illustrative embodiments. For purposes of explanation, numerous specific details are set forth in order to provide an understanding of various embodiments of the inventive subject matter. It will be apparent, however, to one skilled in the art that embodiments of the present subject matter may be practiced without these specific details. In general, well-known instruction instances, protocols, structures, and techniques have not been shown in detail.

The foregoing description, for purpose of explanation, has been described with reference to specific embodiments. However, the illustrative discussions below are not intended to be exhaustive or to limit the embodiments to the precise forms disclosed. Many modifications and variations are possible in light of the above teaching. The embodiment was chosen and described in order to best explain the principles and its practical application, to thereby enable others skilled in the art to best utilize the embodiment and various embodiments with various modifications as are suited to the particular use contemplated.

In the interest of clarity, not all of the routine features of the implementations described herein are shown and described. It should be appreciated that in the development of any such actual implementation, numerous implementation-specific decisions are made to achieve the developers' specific goals, such as compliance with use-case-related and business-related constraints, which will vary from one implementation to another and from one designer to another. Moreover, it should be appreciated that such a design effort might be complex and time consuming, but would nevertheless be a routine undertaking of engineering for those of ordinary skill in the art having the benefit of this disclosure.

Some portions of this description describe embodiments of the invention in terms of algorithms and symbolic representations of operations on information. These algorithmic descriptions and representations are commonly used by those skilled in the data processing arts to convey the substance of their work effectively to others skilled in the art. Although these operations may be described functionally, computationally, or logically, it should be understood that these operations are performed by computer programs or equivalent electrical circuits, microcode, or the like.

The language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the inventive subject matter. It is therefore intended that the scope of the invention be limited not by this detailed description, but rather by any claims set forth below that apply to their equivalents. Accordingly, the disclosure of embodiments of the invention is intended to be illustrative, but not limiting, of the scope of the invention.

In general, terms used in the claims and the specification are intended to be interpreted to have ordinary meanings as understood by those of ordinary skill in the art. Certain terms are defined below to provide additional clarity. If the plain meaning conflicts with the provided definition, the provided definition is used.

Any terms not directly defined herein should be understood to have meanings commonly associated with their understanding within the field of the invention. Certain terms are discussed herein to provide additional guidance to the practitioner in describing the compositions, devices, methods, etc., and how to make or use them of various aspects of the invention. It should be understood that the same thing can be stated in more than one way. Accordingly, alternative languages and synonyms may be used for any one or more of the terms discussed herein. It is immaterial whether or not a term is set forth or discussed herein. Synonyms or alternatives are provided for methods, materials, etc. The recitation of one or more synonyms or equivalents does not exclude the use of other synonyms or equivalents, unless explicitly stated otherwise. Use examples (including examples of terms) are for illustrative purposes only and do not limit the scope and meaning of the inventive aspects herein.

As used herein, the term "perturbation" (e.g., a perturbation of a cell or a perturbation of a cell) with respect to a cell refers to any treatment of a cell with one or more compounds. These compounds may be referred to as "perturbagens". In some embodiments, a perturbagen may include, for example, a small molecule, a biologic, a protein in combination with a small molecule, an ADC, a nucleic acid (such as an siRNA or interfering RNA), a cDNA that overexpresses a wild-type and/or mutant shRNA, a cDNA that overexpresses a wild-type and/or mutant guide RNA (e.g., Cas9 system or other gene editing system), or any combination of any of the foregoing.

As used herein, the term "progenitor" (e.g., progenitor cell) with respect to a cell refers to any cell that is capable of being converted from one cellular state to at least one other cellular state.

As used herein, the term "dataset" with respect to measurement of expression of a cellular component for a cell or cells may refer in some contexts to a high dimensional dataset collected from a single cell (e.g., a single cell cellular component expression dataset). In other contexts, the term "dataset" may refer to a plurality of high-dimensional datasets (e.g., a plurality of single-cell cellular component expression datasets) collected from a single cell, each dataset in the plurality of datasets collected from one cell in a plurality of cells.

As used herein, the term "affect" refers to a change in cellular transformation.

I. Exemplary System embodiments

The details of an exemplary system are described in connection with fig. 1, since an overview of some aspects of the disclosure and some definitions used in the disclosure have been provided.

Fig. 1 provides a block diagram illustrating a system 100 according to some embodiments of the present disclosure. The system 100 provides a prediction of whether a perturbation will affect cell transition. In fig. 1, system 100 is illustrated as a computing device. Of course, other topologies for computer system 100 are possible. For example, in some embodiments, system 100 may actually constitute several computer systems linked together in a network, or may be a virtual machine or container in a cloud computing environment. Thus, the exemplary topology shown in fig. 1 is only used to describe features of one embodiment of the present disclosure in a manner that will be readily understood by those skilled in the art.

Referring to fig. 1, in some embodiments, a computer system 100 (e.g., a computing device) includes a network interface 104. In some embodiments, the network interface 104 interconnects system 100 computing devices within the system to each other and optionally to external systems and devices through one or more communication networks (e.g., through the network communication module 118). In some embodiments, the network interface 104 optionally provides communication through the network communication module 118 via the internet, one or more Local Area Networks (LANs), one or more Wide Area Networks (WANs), other types of networks, or a combination of such networks.

Examples of networks include the World Wide Web (WWW), intranets, and/or wireless networks such as cellular telephone networks, wireless Local Area Networks (LANs), and/or Metropolitan Area Networks (MANs), among other devices that communicate via wireless. Wireless communication optionally uses any of a number of communication standards, protocols, and technologies, including Global System for Mobile communications (GSM), Enhanced Data GSM Environment (EDGE), High Speed Downlink Packet Access (HSDPA), High Speed Uplink Packet Access (HSUPA), Evolution Data-Only (EV-DO), HSPA +, Dual-Cell HSPA (DC-HSPDA), Long Term Evolution (LTE), Near Field Communication (NFC), wideband code division multiple Access (W-CDMA), Code Division Multiple Access (CDMA), Time Division Multiple Access (TDMA), Bluetooth, Wireless Fidelity (Wi-Fi) (e.g., IEEE 802.11a, IEEE 802.11ac, IEEE 802.11ax, IEEE 802.11b, IEEE 802.11g, and/or IEEE 802.11n), Voice over Internet Protocol (VoIP), Wi-MAX, electronic mail protocols (e.g., Internet Message Access Protocol (IMAP), and/or Post Office Protocol (POP)), (POP), Instant Messaging (Instant Messaging) (e.g., extensible Messaging and Presence Protocol (XMPP)), Session Initiation Protocol for Instant Messaging and Presence Extensions (SIMPLE), Instant Messaging and Presence Service (IMPS), and/or Short Message Service (SMS)), or any other suitable communication Protocol, including communication protocols that have not been developed until the filing date of the document.

In some embodiments, system 100 includes one or more processing units (CPUs) 102 (e.g., processors, processing cores, etc.), one or more network interfaces 104, a user interface 107 for use by a user including (optionally) a display 108 and an input system 110 (e.g., input/output interface, keyboard, mouse, etc.), memory (e.g., non-persistent memory 111, persistent memory 112), and one or more communication buses 114 for interconnecting the aforementioned components. The one or more communication buses 114 optionally include circuitry (sometimes referred to as a chipset) that interconnects and controls communications between system components. Non-persistent memory 111 typically includes high-speed random access memory such as DRAM, SRAM, DDR RAM, ROM, EEPROM, flash memory, while persistent memory 112 typically includes CD-ROM, Digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, magnetic disk storage, optical disk storage, flash memory devices, or other non-volatile solid state storage devices. Persistent memory 112 optionally includes one or more storage devices located remotely from one or more CPUs 102. The one or more non-volatile memory devices within the persistent 112 and non-persistent 112 memories include non-transitory computer-readable storage media. In some embodiments, non-persistent memory 111, or alternatively a non-transitory computer-readable storage medium, sometimes in conjunction with persistent memory 112, stores the following programs, modules, and data structures, or a subset thereof:

An optional operating system 116 (e.g., ANDROID, iOS, DARWIN, RTXC, LINUX, UNIX, OS X, WINDOWS, or an embedded operating system such as VxWorks) that includes procedures for handling various basic system services and for performing hardware-dependent tasks;

an optional network communication module (or instructions) 118 for connecting the system 100 with other devices and/or the communication network 104;

a data set store 120 that stores a plurality of data sets 122, each data set including one or more identifiers (e.g., a sample identifier 124 and/or a cell/data set identifier 126), an associated time period 128, and a cellular component vector 130 including one or more cellular components 132; and

a feature storage area 140 storing one or more single-cell transition features 142 and one or more perturbation features 150.

As described above, the data set storage area 120 includes a plurality of data sets 120. Each dataset is a single cell measurement (e.g., the single cell of fig. 3) from a population of cells (e.g., a corresponding sample)Cell measurements 310) are obtained (e.g., collected, communicated, etc.). A sample Identifier (ID)124 associated with each data set 122 indicates from which sample the data set of cells came. The cell/dataset identifier 126 indicates which cell and/or datasets (e.g., subsets of datasets) the dataset 122 is associated with and/or the status of the cell. In some embodiments, the time period 128 is in communication with the capture time period of the data set 122 (e.g., a first time period t when during cell growth, such as when the cells are initially cultured ₀A second time period t when performing the measurement of the expression of the cells₁Etc.).

Further, in some embodiments, each data set 120 includes a cellular component vector 130 that includes one or more cellular components 132. In some embodiments, the one or more cellular components 132 include all cellular components of a cell or a subset of these cellular components of a cell. Each cellular component 132 represents a dimension of data associated with a measurement (e.g., single cell measurement 310 of fig. 3). Typically, the data set 122 includes a high (e.g., greater than 3, greater than 5, greater than 10, greater than 100, etc.) dimensionality that includes a large amount of data. Further, in some embodiments, each data set 122 is obtained from a cell (e.g., from a sample) in a plurality of cells that have been transitioned from a "progenitor" cell type (e.g., from a first state to an altered state).

In some embodiments, the system includes a feature storage region 140 that stores one or more single-cell transition features 142 and one or more perturbation features 150. In some embodiments, the one or more single-cell transition features 142 comprise one or more predetermined features (e.g., training features). In some embodiments, the one or more single-cell transformation characteristics 142 include single-cell transformation characteristics determined by system 100 and/or stored within the system for future use. Each single-cell transformation feature 142 includes a cellular component identification 144, the cellular component identification 144 further including a plurality of cellular components (e.g., cellular components 132-1-1 through 132-1-D of fig. 1). Further, each cellular component 132 associated with a single-cell transition feature 142 includes a corresponding prominence score 134. In some embodiments, a dimension reduction (e.g., dimension reduction 320 of FIG. 3) is performed on dataset 122, which generates (e.g., stores within dimension reduction component store 146-1 of FIG. 1 and/or generates matrix M of FIG. 3) a plurality of dimension reduction components 148 (e.g., dimension reduction components 148-1-1 through dimension reduction components 148-1-F of FIG. 1). Thus, in some embodiments, the system 100 performs dimension reduction (e.g., dimension reduction 320 of fig. 3) to generate a plurality of dimension reduction components 148 (e.g., to generate the matrix M of fig. 3), thereby preserving the potential patterns present in the cellular components 132 of the data set 122. In some embodiments, the output of such a dimension reduction (e.g., dimension reduction components 148-1-1 through 148-1-F of fig. 1) is a matrix (e.g., matrix M as mentioned below) that encodes data set 122 in compressed form while also maintaining the underlying structure of the data set.

In some embodiments, the feature transition storage region comprises a manifold 149. In some embodiments, such manifolds 149 are associated with corresponding dimension-reduced components 148 of single-cell transition features 142. Such prevalence 149 is identified by performing manifold learning with the cell component vectors 130 of the dataset 122 associated with the manifold (e.g., the dataset 122 associated with the single-cell transition feature 142).

The signature store 140 also includes one or more perturbation signatures 150 associated with corresponding perturbations. Each perturbation signature includes a cellular constituent identification 152, the cellular constituent identification 152 including a plurality of cellular constituents (e.g., the cellular constituents 132-1-1 through 132-1-H of fig. 1). In some embodiments, the cellular components of cellular component identification 152 include some or all of the cellular components associated with corresponding single-cell transition feature 144 (e.g., cellular component identification 152 of perturbation feature 150-1 includes a subset of cellular component identification 144 of single-cell transition feature 142-1 of fig. 1). Further, each cellular component of the perturbation signature 150 includes a corresponding prominence score 134.

In various embodiments, one or more of the above-identified elements are stored in one or more of the aforementioned memory devices and correspond to a set of instructions for performing the functions described above. The above identified modules, data, or programs (e.g., sets of instructions) need not be implemented as separate software programs, procedures, data sets, or modules, and thus various subsets of these modules and data may be combined or otherwise rearranged in various embodiments. In some embodiments, non-persistent memory 111 optionally stores a subset of the modules and data structures identified above. Further, in some implementations, the memory stores additional modules and data structures not described above. In some embodiments, one or more of the above-identified elements are stored in a computer system other than the computer system of system 100, which is addressable by system 100 so that system 100 may retrieve all or a portion of this data as needed.

Although FIG. 1 depicts a "system 100," the diagram is intended more as a functional description of the various features that may be present in a computer system than as a structural schematic of the embodiments described herein. In practice, and as recognized by one of ordinary skill in the art, items shown separately may be combined, and some items may be separated. Further, although FIG. 1 depicts certain data and modules in the non-persistent memory 111, some or all of these data and modules may alternatively be stored in the persistent memory 112 or in more than one memory. For example, in some embodiments, at least the data set storage area 120 is stored in a remote storage that may be part of a cloud-based infrastructure. In some embodiments, at least the data set storage area 120 is stored on a cloud-based infrastructure. In some embodiments, the data set store 120 and the feature store 140 may also be stored in one or more remote storage devices.

While a system according to the present disclosure has been disclosed with reference to fig. 1, a method 200 according to the present disclosure is now described in detail with reference to fig. 2.

Block 202 referring to block 202 of fig. 2, the method includes accessing (e.g., in electronic form) a single-cell transition feature (e.g., single-cell transition feature 142-1 of fig. 1). Single cell transition feature 142 represents a measure of differential cellular component expression between a first cellular state and an altered cellular state. The altered cellular state occurs by a cellular transition from the first cellular state to the altered cellular state. Single cell transition feature 142 includes identification of various cellular components (e.g., cellular component identification 144-1 of fig. 1). For each respective cellular component (e.g., cellular component 132-1-1 through cellular component 132-1-D of fig. 1) of the plurality of cellular components, a corresponding first prominence score (e.g., prominence score 134-1-1) quantifies a correlation between a change in expression of the respective cellular component and a change in cellular state between the first cellular state and the altered cellular state.

In some embodiments, accessing the single-cell transition feature comprises determining single-cell transition feature 142. Such a determination is based on a first plurality of first single-cell cellular component expression datasets (e.g., dataset 122-1, dataset 122-2, and dataset 122-3) and a second plurality of second single-cell cellular component expression datasets (e.g., dataset 122-4, dataset 122-5, and dataset 122-6). Each respective first single-cell cellular component expression dataset 122 in the first plurality of first single-cell cellular component expression datasets is obtained from a corresponding single cell (e.g., single-cell measurement 310 of fig. 3) of the first plurality of cells in the first cell state. Further, each respective second single-cell cellular component expression dataset of the second plurality of second single-cell cellular component expression datasets is obtained from a corresponding single cell (e.g., single-cell measurement 310 of fig. 3) of the second plurality of cells that is in the altered cellular state.

In some embodiments, determining the single cell transition characteristic comprises determining a difference in the amount of the cellular components across the plurality of cellular components 132. The difference is between the first plurality of first single-cell cellular component expression datasets and the second plurality of second single-cell cellular component expression datasets. In some embodiments, the difference is determined using one of a mean difference test, a Wilcoxon rank sum test, a t test, logistic regression, or a generalized linear model.

In some embodiments, each respective dataset 122 of the first plurality of single cell component expression datasets includes a corresponding cell component vector of the first plurality of cell component vectors (e.g., cell component vector 130-1 of dataset 122-1 of fig. 1). In addition, each respective dataset of the second plurality of single-cell cellular component expression datasets includes a corresponding cellular component vector of the second plurality of cellular component vectors (e.g., cellular component vector 130-2 of dataset 122-2). Each respective cellular component vector of the first plurality of cellular component vectors and the second plurality of cellular component vectors includes a plurality of elements. Each respective element in respective cellular component vector 130 is associated with a respective cellular component 132 of the plurality of cellular components and includes a respective value representative of an amount of the respective cellular component of a respective single cell represented by a respective dataset (e.g., the cellular components and values of table 2) of the first plurality of single-cell cellular component expression datasets and the second plurality of single-cell cellular component expression datasets.

Further, in some embodiments, cellular component 132 includes a plurality of genes. Additionally, in some embodiments, one or more data sets 122 are generated using methods (e.g., the methods of table 1) that include single cell ribonucleic acid (RNA) sequencing (scRNA-seq), scTag-seq, single cell assays using sequenced transposase accessible chromatin (scATAC-seq), cyttof/SCoP, E-MS/absseq, miRNA-seq, CITE-seq, and any combination thereof.

Referring to block 204, the method further includes accessing (e.g., electronically) the perturbation signature (e.g., perturbation signature 150-1 of FIG. 1). Perturbation signature 150 represents a measure of differential cellular constituent expression between a plurality of undisturbed cells and a plurality of perturbed cells exposed to perturbation. Perturbation signature 150 includes identification of all or a portion of the plurality of cellular components (e.g., cellular component identification 152-1 of fig. 1). For each respective cellular component (e.g., cellular component 132-3-1 through cellular component 132-3-D of fig. 1) in all or a portion of the plurality of cellular components, a corresponding second prominence score (e.g., prominence score 134 of fig. 1) quantifies a correlation between changes in expression of the respective cellular component between the plurality of undisturbed cells and the plurality of disturbed cells and changes in cellular state between the plurality of undisturbed cells and the plurality of disturbed cells.

In some embodiments, method 200 includes performing dimension reduction (e.g., dimension reduction 320 of fig. 3) on the first plurality of single-cell cellular component expression datasets and/or the second plurality of single-cell cellular component expression datasets 122. This dimension reduction generates a plurality of dimension reduction components (e.g., dimension reduction components 148 of FIG. 1). In some embodiments, the dimensionality reduction is a principal component algorithm, a stochastic projection algorithm, an independent component analysis algorithm or feature selection method, a factorial analysis algorithm, a Sammon mapping, curvilinear component analysis (curvilinear components analysis), random neighbor embedding (SNE) algorithm, Isomap algorithm, maximum variance expansion algorithm, local linear embedding algorithm, t-SNE algorithm, non-negative matrix decomposition algorithm, kernel principal component analysis algorithm, graph-based kernel principal component analysis algorithm, linear discriminant analysis algorithm, generalized discriminant analysis algorithm, Unified Manifold Approximation and Projection (UMAP) algorithm, LargeVis algorithm, Laplacian Eigenmap algorithm, or Fisher's linear discriminant analysis algorithm. See, e.g., Fodor,2002, "A surview of dimension reduction technologies," Center for Applied Scientific Computing, Lawrence Livermore National, Technical Report UCRL-ID-148494; cunningham,2007, "Dimension Reduction," University College Dublin, Technical Report UCD-CSI-2007-7; zahorian et al, 2011, "Nonlinear dimensional Reduction Methods for Use with automated spec Recognition," spec technologies, doi:10.5772/16863, ISBN 978-; and Lakshmi et al, 2016, "2016 IEEE 6th International Conference on Advanced Computing (IACC)," pp.31-34. doi:10.1109/IACC.2016.16, ISBN 978-1-4673-. Thus, in some embodiments, the dimensionality reduction is a Principal Component Analysis (PCA) algorithm, and each respective extracted dimensionality reduction component includes a respective principal component derived from the PCA. In such embodiments, the number of principal components in the plurality of principal components may be limited by a threshold number of principal components calculated by the PCA algorithm. The threshold number of principal components may be, for example, 5, 10, 20, 50, 100, 1000, 1500, or any other number. In some embodiments, each principal component calculated by the PCA algorithm is assigned a feature value by the PCA algorithm, and the corresponding subset of the first plurality of extracted features is limited to a threshold number of principal components assigned the highest feature values. For each respective cell component vector of the first and second plurality of cell component vectors 130, the plurality of dimension-reduced components are applied to the respective cell component vector to form a corresponding dimension-reduced vector that includes the dimension-reduced component values for each respective dimension-reduced component of the plurality of dimension-reduced components (e.g., forming the matrix M of fig. 3). This forms a corresponding first plurality of reduced-dimension vectors and second plurality of reduced-dimension vectors. Further, in some embodiments, the method includes performing clustering to generate a set of clusters Cj (e.g., cluster 340 of fig. 3). Each cluster includes a plurality of points corresponding to a subset of the first plurality of reduced-dimension vectors and the second plurality of reduced-dimension vectors. A first plurality of cells from a first cluster of the set of clusters Cj and a second plurality of cells from a second cluster of the set of clusters Cj are both identified.

In some embodiments, the method 200 includes performing manifold learning (e.g., the manifold learning 330 of fig. 3) with the corresponding first and second pluralities of reduced-dimension vectors 130. This manifold learning identifies the relative cellular state of each cell in the first and second pluralities of cells relative to each other cell (e.g., generating the matrix N of fig. 3). For Manifold Learning, see, e.g., Wang et al, 2004, "Adaptive managed Learning," Advances in Neural Information Processing Systems 17, which are incorporated herein by reference.

In some embodiments, the plurality of undisturbed cells is a control cell (e.g., a cell that has not been exposed to a perturbation). Furthermore, in some embodiments, the undisturbed cells are averages of unrelated disturbed cells that have been exposed to the disturbance.

In some embodiments, the method comprises pruning the single cell transition signature and/or perturbation signature. Such pruning limits the plurality of cellular components 132 (e.g., limits cellular components to transcription factors).

In some embodiments, the measure of differential cellular component expression (e.g., the differentially expressed cellular component 350 of fig. 3) quantifies a difference in cellular component amount between the third plurality of the third single-cell cellular component expression data sets and the fourth plurality of the fourth single-cell cellular component expression data sets. Similarly, in some embodiments, such differences are determined using one of the mean difference test, Wilcoxon rank sum test, t test, logistic regression, or generalized linear model. Further, each respective third single-cell cellular component expression dataset 122 in the third plurality of third single-cell cellular component expression datasets is obtained from a corresponding single cell in the plurality of undisturbed cells. Further, each respective fourth single-cell cellular component expression dataset of the fourth plurality of fourth single-cell cellular component expression datasets is obtained from a corresponding single cell of a fourth plurality of cells of the plurality of perturbed cells exposed to the perturbation.

In some embodiments, for each respective cellular component of the plurality of cellular components, determining the respective second prominence score for the respective cellular component comprises replacing the prominence score for the respective cellular component with the respective match score for the respective cellular component (e.g., replacing the prominence score 134-1-1 associated with cellular component 132-1-1 with prominence score 134-d-E of fig. 1). In some embodiments, such substitutions form a match score. Combining the match scores of the plurality of cellular components to generate a perturbed match score. Thus, it is determined whether a perturbation is associated with a transition of a cell between a first cell state and an altered cell state (e.g., whether a cell transition is affected) based on the matching score of the respective perturbation. In some embodiments, the match score comprises a discrete score or a continuous score.

In some embodiments, replacing the score 134 comprises replacing the saliency score with the first score if both the cellular component amount 132 from the single-cell conversion feature 142 of the respective cellular component and the cellular component amount 132 from the perturbation feature 150 of the respective cellular component are upregulated. Such replacement further includes replacing the saliency score 132 with a second score if the cellular constituent amount from the single-cell transformation feature 142 of the respective cellular constituent is up-regulated and the cellular constituent amount from the perturbation feature 150 of the respective cellular constituent is down-regulated. Further, if the amount of the cellular component from perturbation signature 150 for the corresponding cellular component is not significantly up-or down-regulated, the significance score is replaced with a third score.

Block 206, referring to block 206, method 200 includes comparing single cell transition signature 142-1 to perturbation signature 150-1. This comparison determines whether the perturbation will affect the cell transition.

In some embodiments, method 200 includes filtering single cell transition feature 142 and/or perturbation feature 150. Such filtering reduces the number of cellular components 132 included in single-cell transition features 142 and perturbation features 150, which helps to reduce the data volume of the features and the amount of time required to perform method 200 (e.g., perform post-processing 360 of fig. 3).

In some embodiments, method 200 includes identifying a perturbation as one that promotes an altered cellular state based on comparison 206 (e.g., based on post-processing 360 of fig. 3). In some embodiments, single cell transition feature 142 and/or perturbation feature 150 are generated using different types of cellular components. Similarly, in some embodiments, single-cell transition feature 142 and/or perturbation feature 150 are generated using the same type of cellular components.

Method of culturing cells in vitro to perform single cell analysis

In implementing the techniques described herein for identifying the cause of a cell fate, it is useful to generate a data set obtained from a single cell for measurements of cellular constituents. To generate these datasets (e.g., data set 122-1 of fig. 1 via single cell measurement 310 of fig. 3), a population of cells of interest is cultured in vitro. Single cell measurements of one or more cellular components of interest 132 are performed at one or more time periods during the culturing to generate data set 122. (e.g., single cell measurement 310 of FIG. 3). In some embodiments, the cellular components of interest include: nucleic acids, including DNA, modified (e.g., methylated) DNA; RNA, including coding (e.g., mRNA) or non-coding (e.g., sncRNA) RNA; proteins, including post-transcriptionally modified proteins (e.g., phosphorylated, glycosylated, myristoylated, etc.); a lipid; a carbohydrate; nucleotides (e.g., Adenosine Triphosphate (ATP), Adenosine Diphosphate (ADP), and Adenosine Monophosphate (AMP)), including cyclic nucleotides, such as cyclic adenosine monophosphate (cAMP) and cyclic guanosine monophosphate (cGMP); other small molecule cellular components, such as the oxidized and reduced forms of nicotinamide adenine dinucleotide (NADP/NADPH); and any combination thereof. In some embodiments, the cellular component measurement comprises a gene expression measurement, such as an RNA level.

Any of a number of single cell cellular component expression measurement techniques may be used to collect data set 122 (e.g., the techniques of table 1, the techniques of single cell measurement 310 of fig. 1, etc.). Examples include, but are not limited to, single cell ribonucleic acid (RNA) sequencing (scRNA-seq), scTag-seq, single cell assays using sequenced transposase accessible chromatin (scATAC-seq), CyTOF/SCoP, E-MS/Abseq, miRNA-seq, CITE-seq, and the like. The cellular component expression measurement may be selected based on the desired cellular component to be measured. For example, scRNA-seq, scTag-seq and miRNA-seq measure RNA expression. Specifically, scRNA-seq measures expression of RNA transcripts, scTag-seq enables detection of rare mRNA species, and miRNA-seq measures expression of microRNAs. CyTOF/SCoP and E-MS/Abseq measure protein expression in cells. CITE-seq measures both gene expression and protein expression in cells simultaneously, and scaTATAC-seq measures chromatin conformation in cells. Table 1 below provides a link to an exemplary protocol for performing the expression measurement technique for each of the single-cell cellular components described above.

Table 1-exemplary measurement protocol

The cell component expression measurement technique used may lead to cell death. Alternatively, the cellular components may be measured by extraction from living cells, for example by extracting cytoplasm without killing the cells. Techniques with this diversity allow the same cell to be measured at multiple different time points.

If the cell population is heterogeneous such that multiple different cell types derived from the same "progenitor" cell are present in the population, single cell cellular component expression measurements can be performed at a single time point or at relatively few time points when the cells are grown in culture. Due to the heterogeneity of the cell population, the collected data set 122 will represent various types of cells along the transformation trajectory.

If the cell population is substantially homogeneous such that only a single or relatively few cell types, primarily "progenitor" cells of interest, are present in the population, the single-cell cellular component expression measurements may be performed multiple times over a period of cell transition.

A separate single-cell cellular component expression dataset 122 is generated for each cell and with each time period (e.g., time period 128 of fig. 1) applicable. The collection of single-cell cellular component expression measurements from a population of cells at multiple different time points can be collectively interpreted as a "pseudo-time" representation of cell expression over time for cell types derived from the same "progenitor" cell. The term pseudo-time is used in two respects: first, the transition in cell state is not necessarily the same between cells, and therefore the population of cells provides a distribution of the transition processes that cells of the "progenitor" type may undergo over time; and second, cellular component expression measurements expressed by those multiple cells at multiple time points mimic possible transition behavior over time, even if cellular component expression measurements of different cells produce data sets. As an intentionally simplified example, even though cell X gave a data set at time point a and cell Y gave a data set at time point B, the two data sets together represent the pseudo-time of the transition between time point a and time point B.

For ease of description, two such data sets 122 captured for the "same" cell (assuming the use of techniques that do not kill the cells as introduced above) at two different time periods (e.g., the first time period 128-1 of the first data set 122-1, the second time period 128-2 of the second data set 122-2, etc.) are referred to herein as different "cells" (and corresponding different data sets) because in practice such cells will often be slightly or significantly transformed from each other, in some cases with completely different cell types as determined from the relative amounts of the various cell components. From the context, these two measurements of a single cell at different time points can be interpreted as different cells for analysis purposes, since the cells themselves have changed.

Note that the separation of data sets by cell (e.g., cell/data set identifier 126 of fig. 1)/time period (e.g., time period 128 of fig. 1) described herein is for clarity of description, and in practice, these data sets may be stored in computer memory and logically operated on as one or more aggregated data sets (e.g., all cells and time periods at a time, by cell, over all time periods).

In some embodiments, it is useful to collect a data set 122 in which "progenitor" cells of interest have been perturbed from their baseline state. There are many possible reasons for doing so, such as knocking out (e.g., removing, abrogating, etc.) one or more cellular components, evaluating the difference between a healthy cellular state and a diseased cellular state, and so forth. In these embodiments, the method may further comprise the step of introducing the desired modification to the cell. For example, one or more perturbations can be introduced into a cell, a customized virus designed to knock out one or more cellular components can be introduced, a CRISPR can be used to edit cellular components, and the like. Examples of techniques that can be used include, but are not limited to, RNA interference (RNAi), transcription activator-like effector nucleases (TALENs), or Zinc Finger Nucleases (ZFNs).

Depending on the manner in which the perturbation is applied, not all cells will be perturbed in the same manner. For example, if a virus is introduced to knock out a particular gene, the virus may not affect all cells in the population. More generally, the properties can be advantageously used to assess the effect of many different perturbations on a single population. For example, a large number of custom viruses may be introduced, each of which performs a different perturbation, such as resulting in a different gene being knocked out. The virus will infect several subsets of various cells differently, knocking out the gene of interest. Single cell sequencing or another technique can then be used to identify which viruses affect which cells. The resulting different single cell sequencing datasets can then be evaluated to identify the effect of gene knockouts on gene expression according to methods described elsewhere in this specification.

Other types of multi-perturbed cell modifications can be similarly performed, such as the introduction of a variety of different perturbations, barcoded CRISPRs, and the like. Furthermore, more than one type of perturbation may be introduced into the cell population to be analyzed. For example, cells may be affected differently (e.g., different viruses introduced), and different perturbations may be introduced into different cell subsets.

In addition, different subsets of the cell population may be perturbed in different ways, rather than simply mixing many perturbations and evaluating afterwards which cells are affected by which perturbations. For example, if the population of cells is physically separated into different wells of a multi-well plate, different perturbations can be applied to each well. Other ways of achieving different perturbations on different cells are also possible.

In the following, the method is exemplified using single cell gene expression measurement. It is to be understood that this is illustrative and not restrictive, as the invention encompasses similar methods using measurements of other cellular components obtained from single cells. It is also to be understood that the invention encompasses methods of using measurements obtained directly from experimental work performed by an individual or organization practicing the methods described in this disclosure, as well as methods of using measurements obtained indirectly, for example, from reports of the results of experimental work (including third party publications, databases, assays performed by contractors, or other sources of suitable input data that may be used to practice the disclosed methods) performed by other people and by any means or mechanism.

As discussed herein, gene expression in a cell can be measured by sequencing the cell and then counting the amount of each gene transcript identified during sequencing. In some embodiments, sequenced and quantified gene transcripts can include RNA, e.g., mRNA. In alternative embodiments, sequenced and quantified gene transcripts may include downstream products of mRNA, e.g., proteins, such as transcription factors. Generally, as used herein, the term "gene transcript" may be used to refer to any downstream product of gene transcription or translation (including post-translational modifications), and "gene expression" may be used to refer generally to any measure of gene transcript.

While the remainder of the description focuses on the analysis of gene transcripts and gene expression, all of the techniques described herein are equally applicable to any technique for obtaining data on a single cell basis for those cells. Examples include single cell proteomics (protein expression), chromatin conformation (chromatin state), methylation, or other quantifiable epigenetic effects.

The following description provides an exemplary general description of culturing a population of cells in vitro to perform a single cell component expression measurement (e.g., measurement 310 of fig. 3) over a plurality of time periods (e.g., plurality of time periods 128 of fig. 1). Generally, methods for culturing cells in vitro are known to those skilled in the art. One skilled in the art would also understand how to modify the method to grow for longer or shorter time periods, perform additional or fewer single cell measurement steps, etc.

In one embodiment, a method for culturing cells in a first cellular state into cells in an altered cellular state comprises one or more of the following steps:

day 0: a plurality of cells in a first cellular state are thawed onto a medium suitable for cell growth on a plate.

Day 1: a quantity of cells in a first cellular state is seeded into a multi-well plate. If applicable, additional steps are performed to affect the cellular components of the cells. For example, simultaneous infection with one or more viruses to knock out a cellular component of interest.

Performing cell constituent expression measurement iterations t on cells in wells₁。

Day 1+ l: if any additional processing is performed, the medium is changed as needed.

Performing cell constituent expression measurement iterations t on cells in the well, if applicable_l。

Day 1+ m: the medium is changed to a medium suitable for supporting the growth of cells in the changed cellular state.

Performing cell constituent expression measurement iterations t on cells in the well, if applicable_m。

Days 1+ n, o, p, etc.: the medium is changed as necessary to support further cell state transitions from the first cell state to the changed cell state. If applicable, additional steps are performed to effect a further transition from the first cellular state to the altered cellular state. For example, perturbations of interest are added to push cells towards an altered cellular state.

Performing cell constituent expression measurement iterations t on cells in the well, if applicable_n、t_o、t_pAnd the like.

Day q: performing cell component expression measurement iterations t on cells in altered cellular states in a well_q。

Plates were fixed and stained with antibodies matching the cellular components/proteins of interest to sort/identify cells without lysing/destroying the cells to be measured. It can also be used to identify surface markers that may not be visible in the cytoplasmic environment with as high resolution. Imaging with a cell imaging system such as Molecular Devices HCI IXM4 was performed by scanning each well. The number of cells in each well in the desired altered cellular state is quantified.

Table 2 shows segments of a plurality of data sets 122, including exemplary data that may be collected from single cell expression measurements (e.g., single cell measurements 310 of fig. 3) of a population of cells at one or more time points. The sample ID column indicates from which sample the data for the cell came (e.g., sample identifier 124-1 of fig. 1). In practice, the cells in the population may be taken from more than one sample (e.g., first sample identifier 124-1, second sample identifier 124-2, etc.), each of which may be derived from the same or different subject. The cell or dataset column indicates which cell or dataset the data of a given row is associated with (e.g., cell/dataset identifier 126-1 of FIG. 1). The data set 122 may alternatively be represented as a vector r of data _i(e.g., cell component vector 130-1 of FIG. 1).The time period column indicates when the data set for the row was captured during growth of the cell (if relevant) (e.g., time period 128-1 of fig. 1).

The remaining columns of Table 2 correspond to the cellular components of interest of the cells (cellular components 132-1-1 through 132-1-B). This may be all of the cellular components of the cell, or only a subset. Each cellular component 132 is associated with a different column. If the data set is represented as a vector r_iThen each cell component corresponds to entry i in the vector. In some embodiments, the value for each cell may be an (integer) count of the number of cell components as measured by single cell expression or some normalized (rational) form thereof.

TABLE 2 exemplary data set

Method of analyzing single cell datasets to determine differential expression of cellular components

Overview of III.A

A cellular state transition (i.e., a transition of a cellular state from a first cellular state to an altered cellular state) is marked by a change in the expression of cellular component 132 in the cell. For example, the transition may be marked by a change in cellular component expression 132 in the cell, and thus by the identity and amount of cellular components (e.g., mRNA, transcription factors) produced by the cell. However, at least at present, due to the complexity of intracellular activities, the cellular state transitions are not well defined. To attempt to gain insight into this complexity, this description applies statistical techniques to the single cell dataset 122 that quantify cellular components 132 in the cells of the cell population under the following theory: at different stages of the cellular state transition, different cellular component expressions associated with different presence, absence, or amounts of one or more measured cellular components of interest provide a high-dimensional dataset (e.g., cellular component vector 130 of fig. 1) from which meaningful knowledge can be extracted. Here, the high dimensional number of data is derived from each cellular constituent measurement contained in the data set 122. Each cellular component 132 represents a dimension, and the cellular component measurement datasets 122 for each cellular component may collectively have a shape that encodes potential information about the biological process of the transition of a "progenitor" cell to a different cell type. In practice, the number of cell components 132 may be on the order of thousands to tens of thousands, making the calculations described herein impractical, if not impossible, to perform mentally or manually.

In general, these statistical techniques may be characterized as methods in which high-dimensional data is compressed into a lower-dimensional space while maintaining the shape of any potential information encoded in the dataset (e.g., the cell component vector 130 of FIG. 1 is reduced in dimension 320 to the matrix M of FIG. 3). The low-dimensional data is evaluated to identify cellular components that differ between different stages of cellular state transition. Since the input data for the method is a single-cell cellular constituent expression dataset 122 of multiple cellular constituents of interest on a per-cell basis, the set of differentially expressed cellular constituents thus represents which cellular constituents have a statistically significant over-representation or under-representation in terms of presence, absence or amount relative to other cellular constituents of the cell. Any of a number of methods and metrics can be used to identify which of those cellular components are sufficiently "differentially" expressed relative to other cellular components to be labeled as "differentially expressed" according to this description. Since the cell population from which data set 122 can be obtained includes cells of different types and different transformation stages, knowing which cellular constituents are differentially present (e.g., which cellular constituents are differentially expressed) provides insight into or relating to which cellular constituents affect expression of cellular constituents that are active during the transformation or other transformation.

Example III.B

Regardless of the method used, the determination of the cellular components that are differentially expressed may vary depending on the results sought. For example, if the method used identifies a particular cell as being either within or outside of a lineage, the determination of which cell components are differentially expressed can be performed by comparing the expression levels of the cell components determined to be cells within the lineage to the expression levels of the cell components determined to be cells outside the lineage. The relative expression of these cellular components is indicative of which cellular components are active in one type or another, alone or in combination. As described above, the expression data may be used to identify subsets of cellular components to be tagged for differential expression. The causal relationship can then be determined by knocking out the identified cellular components in vitro and assessing whether the cell fate of the experimental cell population is affected by a change in activity of the cellular components.

As another example, if the method used identifies a particular cell as being within a lineage, while other cells are identified as being "progenitor" or intermediate cells along a transition trajectory to a cell type within the lineage, the determination of which cell components are differentially expressed can be performed by comparing the expression levels of the cell components of the cells determined to be within the lineage to the expression levels of the cell components of the cells determined to be "progenitor" and/or intermediate cells of the cells within the lineage. As described in the preceding paragraph, the relative expression of these cellular components is indicative of which cellular components are active in one type or another of cell, alone or in combination, and the expression data can again be used to identify a subset of cellular components to be tagged as differentially expressed. Also as described above, causal relationships can then be determined by knocking out the identified cellular components in vitro and assessing whether the cell fate of the experimental cell population is affected by changes in the activity of the cellular components.

As another example, the cell population may include two cell subsets, a healthy subset and an unhealthy subset. During cell culture, a variety of different perturbations can be introduced into the unhealthy subpopulation. By subsequent single cell expression measurements in conjunction with the methods described herein, it can be determined what effect the perturbation has on the differential cellular component expression of the cellular components in the unhealthy subpopulation, particularly the effect associated with the healthy subpopulation. For example, a subset of cells from an unhealthy subpopulation exposed to one or more perturbations may exhibit expression of a cellular component consistent with the healthy subpopulation of cells, indicating that the perturbation has a desired effect on the unhealthy subpopulation of cells.

Determination of differentially expressed cell components using low dimensional data

Fig. 3 is a flow diagram of a first example of a differential cellular component expression assay to determine a collection of differentially expressed cellular components 132, according to one embodiment. Note that fig. 3 provides a non-limiting, illustrative embodiment of the general case described using differential cellular component expression. At step 310, single cell expression measurements are performed to generate a plurality of data sets 122 of the cell population, as discussed above in section II. As described above, each data set 122 for each cell may be represented as a vector r of cell components _i(e.g., cell component vector 130 of FIG. 1) that includes the amount of each of the/cell components (e.g., cell components 132-1-1 through 132-1-B of FIG. 1). The data sets 122 obtained from single cell expression measurements 310 are typically stored in a digital format in a persistent memory (e.g., persistent memory 112 of FIG. 1) of a computing device (e.g., system 100 of FIG. 1), however, they may be loaded into an active memory (e.g., non-persistent memory 111 of FIG. 1) as needed in order to perform the remaining steps described herein. Typically, the remaining steps of the method of fig. 3 are performed by one or more computing devices (e.g., system 100 of fig. 1). An exemplary computing device is discussed with reference to fig. 1, however, in practice, the method of fig. 3 may include additional shimming or subsequent steps that may be performed external to the computer, such as additional in vitro tests or clinical decisions based on the results of the steps described herein.

III.C.1. dimensionality reduction

As introduced above, the data set 122 is generally highly dimensional, as each cellular component 132 represents data of a different dimension. At step 320, dimensionality reduction is performed by the computing device (e.g., system 100) to reduce the dimensionality of the data while maintaining the structure of any potential patterns present in the amount of cellular components 132 of the data set 122.

The input to dimension reduction step 320 is typically a matrix, similar to table 2 above, that connects the expression vectors of individual cells (e.g., cell component vectors 130 of fig. 1). The output of dimension reduction 320 is a matrix, referred to herein for simplicity as matrix "M," that encodes the original data in compressed form while maintaining the underlying structure of the data. Each row in the matrix M is associated with a particular one of the cells. Each column in the matrix M is associated with a dimension in a dimension reduction space provided by the dimension reduction. The values in the entries at each row-column grouping are determined by dimension reduction based on the original input data set.

In some embodiments, these dimension reduction techniques result in some lossy compression of the data, however, the resulting output matrix M is small in computational memory size and therefore requires less computational processing power to analyze with other downstream techniques discussed in the remaining steps of the method, which makes it computationally feasible to obtain the results of those steps with the current generation of computing devices in a reasonable time.

A variety of dimensionality reduction techniques may be used. Examples include, but are not limited to, Principal Component Analysis (PCA), non-Negative Matrix Factorization (NMF), Linear Discriminant Analysis (LDA), diffusion mapping, or (neural) network techniques such as autoencoders.

Each of the techniques mentioned in these paragraphs operate differently to extract the varying primary drive and reduce the dimensionality of the original input data, but each technique outputs the matrix M in a lower dimensional space.

III.C.2. manifold learning

The reduced dimensionality data (e.g., reduced dimensionality component storage region 146) in matrix M is significantly reduced in dimensionality relative to the original high-dimensional data from single cell expression dataset 122. However, the resulting matrix M embeds a non-linear manifold (e.g., manifold 149 of fig. 1). At step 330, a manifold learning technique is applied to the matrix M to extract the manifold. Manifold 149 itself not only provides useful information about differential cell component expression between cells within a pseudo-time, but it can also be used to visualize that information.

The input to the manifold learning step 330 is the matrix M from the dimensionality reduction step 320. The output of manifold learning 330 is another matrix, referred to herein as matrix "N" or manifold (e.g., manifold 149 of fig. 1). The structure of matrix N is such that each row of matrix N corresponds to one of the primitive cells of the population, referred to herein as a "point" for the remaining steps of the method. In one embodiment, the matrix N has two columns, arbitrarily referred to as the X and Y dimensions, corresponding to the two dimensions for which the manifold learning step 330 is configured to output, regardless of the particular manifold learning algorithm used. The X and Y dimensions are determined by a manifold learning step and depending on which manifold algorithm is used, which dimension is best suited for the data from matrix M is selected. As shown in FIG. 4B, a manifold with two such columns facilitates visualization. In other embodiments, the manifold matrix N has additional dimensions beyond the two-dimensional form described herein.

An exemplary matrix N is provided in table 3 below. Figure 4B provides a plot of the data from example 1 below in an embodiment using a force directed layout in the dimension reduction step. The plot in fig. 4B is an example of the results obtained according to the method, since in the described and similar exemplary experiments, the points are separated along one or more trajectories in the X/Y plane in the X/Y dimension, where typically "progenitor" cells appear in one general region in the X/Y space, diffuse towards intermediate cells in another general region in the X/Y space, and end at one or more different regions in the X/Y space, which in practice are typically verified as either intra-lineage transformed cells or extra-lineage transformed cells. In general, the number of regions and trajectories identified depends on the type of "progenitor" cells and the type of cells into which the "progenitor" cells are known to transform. Furthermore, the regions of the spots typically have some amount of diffusion between them, indicating that the cells are in different stages of progression during the transition.

TABLE 3 output matrix N

A variety of rheological learning techniques may be applied to matrix M to generate matrix N. Examples include, but are not limited to, force-oriented layouts (Fruchterman, T.M. and Reingold, E.M. (1991). graphics are drawn through a force-oriented layout software: Practice and experience, 21(11),1129- orce Atlas 2), t-distributed random neighbor embedding (t-SNE), local linear embedding (rowis, s.t. and Saul, L.K. (2000). The nonlinear dimensionality reduction is performed by locally linear embedding. Science,290(5500),2323-⁾Local linear isometric mapping (ISOMAP, Tenenbaum, J.B., De Silva, V, and Langford, J.C. (2000) Global geometry framework for nonlinear dimensionality reduction Science,290(5500), 2319-. Discriminant analysis can be used, especially where some information about the specific cell type of each cell is known in advance. Force-directed layouts are useful in various particular embodiments because they are able to identify new, lower dimensions that encode non-linear aspects of the underlying data caused by underlying biological processes, such as cellular state transitions. The force-directed layout uses a physics-based model as a mechanism for determining the reduced dimensionality that best represents the data. As an example, the force-directed layout uses a form of physical simulation, where, in the embodiment, each cell/data set in the collection is assigned a "repulsive" force, and there is a global "gravitational force" that, when calculated on all cells, identifies sectors of data that "spread" together under these competing "forces". The force-guided layout makes little assumptions about the data structure and does not impose a denoising approach.

Note that performing manifold learning 330 is an optional step. In some embodiments, no manifold learning is performed.

III.C.3. clustering

In step 340, clustering is performed to generate j clusters C_jTo identify a pattern of locations of points in the low-dimensional space provided by dimension reduction 320 (e.g., corresponding to a subset of the associated plurality of dimension reduction vectors 146). These clusters are used to cluster similar points (cells/data sets) to extract statistically relevant information about groups of points (e.g., first cluster, second cluster, etc.) that are similar to each other in a low-dimensional space. Table 4 below shows an exemplary set of points that may be output as a cluster 340And (6) clustering.

TABLE 4 clustering assignment

Any of a number of clustering techniques may be used, examples of which include, but are not limited to, hierarchical clustering, k-means clustering, and density-based clustering. In one particular embodiment, a hierarchical density based clustering algorithm (referred to as HDBSCAN, Campello, r.j., Moulavi, d., Zimek, a. and Sander, J. (2015)) is used. Hierarchical density estimation for data clustering, visualization and outlier detection. ACM Transactions on Knowledge Discovery from Data (TKDD),10(1), 5). In another embodiment, a clustering algorithm based on community detection is used, such as Louvain clustering (Blondel, V.D., Guillaume, J.L., Lambliotte, R. and Lefebvre, E. (2008). quick discovery of communities in large networks. Journal of static mechanisms: the order and experiment,2008(10), P10008).

For clustering, these techniques use the data of matrix M to determine clusters. Regardless of the algorithm, in general, points closer to each other in the multidimensional space of the matrix M are more likely to be assigned to the same cluster, while points further away from each other are less likely to be assigned to the same cluster. Fig. 7A provides a plot of the exemplary data from fig. 4B, with cluster assignments 1-10 indicated with different visual markers for each point. The number of clusters may be set or constrained by an operator and/or determined dynamically based on the algorithm used.

III.C.4. determination of differential cell component expression

The dimensionality reduction 320, optional manifold learning 330, and clustering 340 steps are generally used to organize the cells of the population and their corresponding single-cell expression dataset 122 into clusters within a dimensionality reduction space so that the underlying per-cell component expression measurement data can be aggregated and analyzed to extract meaningful information. In some embodiments, such a dimension reduction space further reduces the amount of time and/or processing power required to complete the methods of the present disclosure.

Can self-polymerizeOne piece of information obtained is which cellular components are differentially expressed in the population relative to other cells. This set of cell groups is referred to herein as differentially expressed cell component E _kAs discussed in step 350 of fig. 3. Some exemplary use cases for generating a set of differentially expressed cellular components are discussed above in section iii.b.

There are many usage clusters C_jAnd data set information to determine the manner in which the set of differentially expressed cellular components is determined. In one embodiment, determining whether a given cellular component (e.g., cellular component a) is differentially expressed is determined by evaluating a given cluster C₁The amount of the cellular component A of the spot (cell) in (a) relative to one or more other clusters C_mThe amount of cellular component A of point (b), wherein m is not equal to 1. Normalization may also be used. For example, the expression level of a cellular component in a cell as a whole may vary from cell to cell for reasons independent of the biology of the cellular state transition. Thus, the cell component amounts can be normalized based on the total number of cell component amounts per cell in the data set.

As discussed in section iii.b above, which cluster of cell component a has its cell component amount associated with a given cluster C₁The comparison may vary depending on the embodiment. Other clusters for comparison may be the cluster most strongly associated with cell types within the lineage, the cluster most strongly associated with cell types outside the lineage, the cluster most strongly associated with the "progenitor" cell type, the cluster most strongly associated with intermediate cell types, and so forth. The comparison may also be made for more than one other cluster.

In view of the comparison, the cellular component a can be identified as differentially expressed according to any of a number of metrics, such as a total cellular component amount for each cluster (again, for all points in the cluster, or some aggregation metric, such as an average, etc.), a normalized cellular component amount for each cluster, a median, average, or other aggregated cellular component amount for each cluster, a ratio of expression relative to cellular component amounts of other cellular components, and so forth. In one embodiment, the criterion for establishing differential expression of cellular component a is a threshold requirement.

For example, cluster C₁The normalized cell component amount of cell component a in (c) can exceed the normalized cell component amount of cell component a in one or more other clusters Cm by at least a threshold value.

The determination of the differentially expressed cellular components may also be relative. In one embodiment, a normalized cell component quantity for a plurality of cell component/cluster combinations, a distance metric for a plurality of cell component/cluster combinations, or other similar metric may be calculated. These metrics may be ranked according to ranking criteria (e.g., the highest normalized cell component quantity in a cluster), and the top ranked cell component or cell component/cluster combination may be determined as a differentially expressed cell component.

In one embodiment, the amount of a cellular component of a given cellular component in a given cluster can be used to identify which cellular components are differentially expressed. In one embodiment, the differentially expressed cell components are identified using one of a mean difference test, a Wilcoxon rank sum test (Man-Whitney U test), a t test, logistic regression, and a generalized linear model

One skilled in the art will appreciate that other metrics relating to the amount of cell components per cell component/cluster combination are possible.

III.C.5. post-treatment

Differentially expressed cell fraction E_kThe set of (b) itself represents a useful output. However, it may be useful to further analyze 360 the set of differentially expressed cellular components to identify a subset of the set.

In one embodiment, a set of differentially expressed cellular components is screened against a transcription factor database (e.g., the characteristic storage region 140 of fig. 1) to identify a set of transcription factors associated with the cellular components present in the set. As an example, the information can be obtained from the ChIP-seq dataset (information about which transcription factors bind to which regions of DNA, which is aligned to cellular components).

The data set 122 discussed herein for a particular cell, such as the raw input data set r (e.g., the data set of FIG. 1) 122-1) or differently expressed cell components E_kThe set of (a) and the corresponding data set(s) may be missing cell component amounts for a number of reasons (e.g., technical noise, leakage, low cell component amounts, etc.). Given these and any additional confounding factors, a simple model may be adapted to the data set.

Prediction of perturbation affecting a transition in cell state

By matching differential cellular component expression, which characterizes a particular cellular transition, to differential cellular component expression caused by exposure of the cell to a perturbation, perturbations that affect a particular cellular state transition can be predicted. Cell perturbation includes any treatment of a cell with one or more compounds. The one or more compounds can include, for example, a small molecule, a biological agent, a protein in combination with a small molecule, an ADC, a nucleic acid (such as an siRNA or interfering RNA), a cDNA that overexpresses a wild-type and/or mutant shRNA, a cDNA that overexpresses a wild-type and/or mutant guide RNA (e.g., Cas9 system or other cellular component editing system), or any combination of any of the foregoing. Differentially expressed cellular components of a particular cell transition can be compared to those caused by exposure of the cell to a perturbation. Perturbations in the expression of differential cellular components that cause expression of differential cellular components that match a particular cellular transformation can then be predicted to affect the particular cellular transformation.

To predict the perturbation that affects a particular cellular transformation by matching the expression of a differential cellular constituent that characterizes the particular cellular transformation with the expression of the differential cellular constituent that results from exposure of the cell to the perturbation, first, the cellular constituent that characterizes the particular cellular transformation whose expression is most differentially identified is identified. In some embodiments, these differentially expressed cell components are identified using one of a mean difference test, a Wilcoxon rank sum test (man-whitney U test), a t test, logistic regression, and a generalized linear model. In alternative embodiments, any statistical method may be used to identify the cell components that express the greatest difference for a particular cell transition. The resulting sorted list (or list) of cellular component 132 names and prominence scores 134 may also be referred to as a 'single cell transition feature' (e.g., including single cell transition feature 142 of fig. 1). The prominence score 134 for each cellular component 132 quantifies the correlation between changes in cellular component expression of the cellular component and changes in cell types between the original cell type and the transformed cell type. Taken together, these scores 134 form an overall measure of differential cellular component expression associated with the transition between the original cell type (first cellular state) and the transformed cell type (altered cellular state).

Similarly, differential cellular component expression caused by exposure of the cells to the perturbation is identified for one or more perturbations. In some embodiments, to identify differential cellular component expression caused by exposure of cells to a perturbation, the cellular component expression in cells exposed to the perturbation is compared to an average of cellular component expression in one or more control cells that have not been exposed to the perturbation or an unrelated perturbed sample (e.g., post-treatment 360 of fig. 3). In some embodiments, the comparison is performed using one of a mean difference test, a Wilcoxon rank sum test (manwheaten U test), a t test, logistic regression, and a generalized linear model. In alternate embodiments, any statistical method may be used to perform the comparison. In a still further alternative embodiment, differential cellular component expression caused by exposure of the cells to perturbation may be known and identified from the literature. The resulting similarly ordered list (or list) of cell component names and saliency scores may be referred to as a 'perturbation signature'.

In some embodiments, to reduce confusion due to technical variations, different experimental determinations, and other variables in identifying single cell transition and perturbation characteristics, one or both of the characteristics are filtered to include only transcription factors, which are proteins known to drive expression of certain cellular components. These transcription factors can be identified, for example, from the literature.

In some embodiments, to further reduce confusion due to technical variations and ambiguity in cell transitions, the cell component that most differentially expresses one or both features is truncated (or filtered or sub-grouped) at a given p-value and/or at a threshold number of cell components. The resulting truncated set of differentially expressed cellular components of the cellular transformation and perturbation exposure is disordered and may contain between 10 and 25 or more or less cellular components, depending on the embodiment.

After identifying and any processing of one or both features (e.g., single cell transition feature 142 and/or perturbation feature 150 of fig. 1), the differentially expressed cellular components of single cell transition feature 142 are compared to the differentially expressed cellular components of perturbation feature 150. In one embodiment, to perform the comparison, the perturbed differentially expressed cell components are represented as a matrix (e.g., matrix M of fig. 3, cell component vector 130 of fig. 1, etc.). Each row of the matrix is associated with a single perturbation. Each column on the matrix is associated with one of the differentially expressed cellular components. Each entry in the matrix includes a prominence score 134 (e.g., p-value, t-score) for a differentially expressed cellular component 132 identified for a particular perturbation. The matrix is a subset that includes only differentially expressed cell components identified for single cell transition feature 142. Such filtering can be accomplished using the methods described in the preceding paragraph (e.g., by a threshold p-value, by a threshold number of cellular components, etc.)

Each prominence score 134 in the matrix is replaced with a discrete matching score. To replace each significant score with a discrete match score, a significantly up-regulated cellular component 132 of cellular transformation and a significantly down-regulated cellular component of cellular transformation are identified. For each of the significantly upregulated cellular components identified by single cell transition markers 142, if the cellular component is also significantly upregulated with respect to perturbation features 150 of the perturbation, the significance score in the matrix of cellular component/perturbation combinations is replaced with a discrete match score of '1'. If the cellular components of the perturbation signature are significantly downregulated relative to the single cell transition signature, then the significance scores in the matrix of the cellular component/perturbation combination are replaced with a discrete match score '-2'. If the cellular components of the perturbation signature are not significantly up-or down-regulated, the significance score in the matrix of cellular components/perturbation combinations is replaced with a discrete match score of '0'.

Conversely, for each of the significantly downregulated cellular components identified in the single cell transition signature, if the perturbed cellular component is also significantly downregulated, the significance score in the matrix of the cellular component/perturbation combination is replaced with a discrete match score '-1'. If the perturbed cellular component is significantly upregulated, the significance score in the matrix of the cellular component/perturbation combination is replaced with a discrete match score of '2'. If the perturbed cellular component is not significantly up-or down-regulated, the significance score in the matrix of the cellular component/perturbation combination is replaced with a discrete match score of '0'. Those skilled in the art will appreciate that in some embodiments, these particular scoring alternatives may be substituted with other numerical values.

The result is a matrix, where the number of rows is given by the number of perturbations and the number of columns is given by the differential cell fraction from the single cell transition, and the entries represent the matching scores described above.

After replacing the saliency scores in the matrix with discrete match scores as described above, the discrete match scores in each row of the matrix are summed to generate a summed match score for each row. The rows of the matrix are then sorted in order of decreasing total matching score, each row corresponding to a perturbation. The top-ranked row is associated with the perturbation most likely to be associated with the identified cell transition characteristic of the single cell transition.

In some embodiments, an estimate of the false cell component discovery rate is estimated for the summed match scores for each row in the matrix. To estimate the false cell fraction discovery rate, the empirical edge expression frequency for each cell fraction is calculated and summed over their combination for each cell fraction, which generates the probability of identifying a given number of cell fractions by chance (observing the likelihood of expression at least as rare as seen in the data set used to generate the features), assuming an independently distributed expression. The probability can then be used to calculate the false cell component discovery rate.

In certain embodiments, there may be a perturbed covariate. For example, if the perturbation is a small molecule, the covariate of the small molecule may include a specific dose of the small molecule, a measurement of time of exposure of the cell to the small molecule to quantify cellular components, and/or the identity of the cell exposed to the small molecule (e.g., a cell line). In some embodiments, the perturbation is predicted to affect a particular cell transition only if a threshold amount of the perturbed covariate is also predicted to affect the particular cell transition. For example, a perturbation may be predicted to affect a particular cell transition only if at least two of the covariates of the perturbation are also predicted to affect the particular cell transition.

Alternative matching methods may be used. For example, a network interface (e.g., such as L1000CDS2, ultra-fast LINCS L1000 feature oriented feature search engine, amp, pharm, mssm. edu/L1000CDS2/#/index on the world Wide Web) may be used to match cell fractions to a database. This matching method does not perform as well as the matching method described in the previous paragraph, which yields much more sensitive results, is more scalable, and covers much more data (millions of samples, rather than tens of thousands of samples), takes into account significant overlap, disregards significant inconsistencies, and ignores insignificant information in the features.

Since the expression of cellular components for a particular single cell state transition is highly variable and since the expression of cellular components affected by perturbations is highly variable, it may be difficult to find perturbations that match a particular single cell state transition. To alleviate this problem, in some alternative embodiments, matching and subsequent identification of perturbations that affect the transition of cell states along a particular trajectory may be performed by a trained neural network model.

Examples of identifying perturbations as being perturbations that affect a particular cellular state transition using the methods described above are provided in section iv.e below.

Method for identifying the biological utility of a perturbation

In some embodiments, the disclosed methods are used to identify the biological utility of a perturbation. These methods encompass the measurement of any cellular component (or combination of different cellular components) that can show differential presence in cells having different states or phenotypes (e.g., diseased and normal phenotypes). That is, the presence, absence, or amount of a cellular component and the cellular state orThe phenotypes are associated. In one embodiment, the method comprises: exposing a plurality of cells to a perturbation; performing a first differential cellular component expression assay, the assay comprising accessing a first plurality of single-cell expression datasets obtained from a plurality of cells before and after exposure of the cells to a perturbation, each dataset comprising a vector r of cellular components _iEach entry in the vector is associated with one of a plurality of cellular components, and the value of each entry represents the amount of a cellular component of the cell; applying a statistical technique to the first plurality of data sets to generate a differentially expressed cellular component E in response to exposure to the perturbation_kA set of (2); and determining the cellular component E differentially expressed in response to exposure to said perturbation_kAnd a differentially expressed cell component E associated with a difference between the diseased cell phenotype and the normal cell phenotype_lA level of similarity between the sets of (a) and (b), wherein E_kAnd E_lA significant level of similarity therebetween indicates the utility of the perturbation in the transition of the cell between the diseased cell phenotype and the normal cell phenotype.

In some embodiments, applying statistical techniques comprises: performing a dimension reduction (e.g., dimension reduction 320 of fig. 3) on the first plurality of data sets 132 to generate a first matrix M comprising rows in the first dimension and columns in the second dimension, the values of the matrix M comprising values generated by amounts of cellular components located at points in the first and second dimensions; performing clustering to generate cluster C_jEach cluster comprising a plurality of points corresponding to a subset of the rows in the first matrix M and their corresponding cellular response states; and using the cluster C _jDetermining a cellular component E differentially expressed by the cells in response to exposure to the perturbation_kThe collection of (2).

In some embodiments, the differentially expressed cellular component E is associated with a difference between a diseased cell phenotype and a normal cell phenotype_lCan be determined by performing a second differential cell component expression assay comprising accessing a plurality of cells in different states, such as normal cells and diseased cellsA second plurality of single-cell cellular constituent expression datasets obtained from the cells, each dataset comprising a vector r of cellular constituents_iEach entry in the vector is associated with one of a plurality of cellular components, and the value of each entry represents the amount of a cellular component of the cell; and applying a statistical technique to the second plurality of data sets.

In some embodiments, applying the statistical technique to the second plurality of data sets comprises: performing dimension reduction on the second plurality of data sets to generate a second matrix M comprising rows in the first dimension and columns in the second dimension, the values of the second matrix M comprising values generated from the amounts of the one or more cellular components located at points in the first dimension space and the second dimension space; performing manifold learning with a second matrix M under a relative similarity approximation of points to create a second matrix N comprising a plurality of rows, each row corresponding to one of the cells, and two columns, each column corresponding to one of two dimensions in a two-dimensional space, the values of the second matrix N indicating relative differences in cell phenotype between each cell relative to each other cell based on the dataset; performing clustering to generate cluster C _jEach cluster comprising a plurality of points corresponding to a subset of the rows in the matrix N and their corresponding cellular response states; and use of cluster C_jThe second set of (a) determines a differentially expressed cellular component E associated with a difference between a diseased cell phenotype and a normal cell phenotype of the cell_lA set of cells indicative of a difference between a diseased cell phenotype and a normal cell phenotype.

In some embodiments, the perturbation is known to have acceptable human safety profiles as determined by results obtained in regulatory clinical trials.

In some embodiments, the diseased cell phenotype is identified by a difference between the diseased cell and the normal cell. For example, in some embodiments, identification may be by: loss of cell function, gain of cell function, progression of cells (e.g., transition of cells to a differentiated state), arrest of cells (e.g., failure of cells to transition to a differentiated state), invasion of cells (e.g., presence of cells in abnormal locations), disappearance of cells (e.g., absence of cells in locations where cells normally exist), disorder of cells (e.g., structural, morphological, and/or spatial changes within and/or around cells), loss of cell network (e.g., cellular changes that eliminate normal effects in progeny cells or cells downstream of cells), gain of cell network (e.g., cellular changes that trigger new downstream effects in progeny cells of cells downstream of cells), excess of cells (e.g., overabundance of cells), deficiency of cells (e.g., cell density below a critical threshold), differences in cellular component ratios and/or amounts in cells, A difference in turnover rate in a cell, or any combination thereof.

In some embodiments, the diseased cells include cell lines, biopsy sample cells, and cultured primary cells. In some embodiments, normal cells include cultured primary cells and biopsy sample cells. In some embodiments, the cell is a human cell.

In some embodiments, the methods are used to select perturbations that can be used to treat a disease based on the indicated utility identified using the methods described above. In some embodiments, the method comprises treating a subject having a disease by administering to the subject an effective amount of a selected perturbation or drug substance developed from a perturbed lead compound.

Detailed description of the preferred embodiments

Embodiment 1. a method comprising the steps of: accessing a plurality of single-cell cellular component expression datasets, each dataset obtained from one cell of a plurality of cells that have been transformed from the same "progenitor" cell type, each dataset comprising a vector r of cellular components_iEach entry in the vector is associated with one of a plurality of cellular components, and the value of each entry represents the amount of the cellular component of the cell; performing dimension reduction on the dataset to generate a matrix M comprising rows in a first dimension and columns in a second dimension, each row corresponding to one cell of the plurality of cells, the values of the matrix M comprising values generated by the amount of cellular components located at points in the first dimension space and the second dimension space; performing clustering To generate a cluster C_jEach cluster comprising a plurality of points corresponding to a subset of the rows in the matrix M and their corresponding cells; and using the cluster C_jDetermining the differentially expressed cell fraction E of said cells_kThe collection of (2).

Embodiment 2. the method of embodiment 1, further comprising performing manifold learning with the matrix M under a relative similarity approximation of points to create a matrix N comprising a plurality of rows and two columns, each row corresponding to one of the plurality of cells and each column corresponding to one of two dimensions in a two-dimensional space, the values of the matrix N indicating the relative cell type of each cell with respect to each other cell based on the dataset.

Embodiment 3. the method of any one of embodiments 1 to 2, wherein when the single cell cellular component expression dataset is obtained, the cells are a heterogeneous population of cells having various cell types.

Embodiment 4. the method of any one of embodiments 1 to 2, wherein the cells are a substantially homogeneous population of cells having the "progenitor" cell type; and wherein the single-cell cellular component expression dataset is obtained at each of a plurality of time points when the cells are transformed from the "progenitor" cell type, such that a different dataset of the plurality of datasets is collected for each cell and time point combination.

Embodiment 5 the method of embodiment 4, wherein the plurality of time points comprises at least three time points.

Embodiment 6. the method of any one of embodiments 4 to 5, wherein said plurality of time points comprises "progenitor" time points at which a substantial portion of said cells have not been converted from said "progenitor" cell type.

Embodiment 7 the method of any one of embodiments 4 to 6, wherein said plurality of time points comprises transition time points at which a substantial portion of cells have transitioned from said "progenitor" cell type.

Embodiment 8 the method of any one of embodiments 4 to 7, wherein said plurality of time points comprises intermediate time points at which at least a substantial portion of cells have been at least partially converted from said "progenitor" cell type.

Embodiment 9 the method of any one of embodiments 1 to 8, wherein the plurality of cell components is selected from the group consisting of: nucleic acids, proteins, lipids, carbohydrates, nucleotides, and any combination thereof.

Embodiment 10 the method of embodiment 9, wherein the nucleic acid is selected from the group consisting of DNA and RNA.

Embodiment 11 the method of embodiment 10, wherein the RNA is selected from the group consisting of coding RNA and non-coding RNA.

Embodiment 12 the method of any one of embodiments 1 to 11, wherein said single cell cellular component expression dataset is generated using a method selected from the group consisting of: single cell ribonucleic acid (RNA) sequencing (scRNA-seq), scTag-seq, single cell assays using sequenced transposase accessible chromatin (scATAC-seq), cyttof/SCoP, E-MS/absseq, miRNA-seq, CITE-seq, and any combination thereof, as well as summaries thereof, including combinations, such as linear combinations, representing activation pathways in the single cell cellular component expression dataset.

Embodiment 13 the method of any one of embodiments 1 to 12, wherein performing dimension reduction comprises performing Principal Component Analysis (PCA) on the single cell cellular component expression dataset to generate the matrix M.

Embodiment 14 the method of any one of embodiments 1 to 13, wherein performing dimension reduction comprises using diffusion mapping on the single cell cellular component expression dataset to generate the matrix M.

Embodiment 15 the method of any one of embodiments 1 to 14, wherein performing dimension reduction comprises using a neural network autoencoder on the single cell cellular constituent expression dataset to generate the matrix M.

Embodiment 16 the method of embodiment 2, wherein performing manifold learning comprises estimating a geometry of data in the matrix M to create the matrix N.

Embodiment 17 the method of embodiment 16, wherein performing manifold learning comprises performing local linear embedding.

Embodiment 18 the method of embodiment 16, wherein performing manifold learning comprises performing local linear equidistant mapping (ISOMAP).

Embodiment 19 the method of embodiment 16, wherein performing manifold learning comprises performing t-distributed random neighbor embedding (t-SNE).

Embodiment 20 the method of embodiment 16, wherein performing manifold learning comprises performing affinity-based trajectory-embedded thermal diffusivity (phosphate).

Embodiment 21 the method of embodiment 16, wherein performing manifold learning comprises performing Unified Manifold Approximation and Projection (UMAP).

Embodiment 22 the method of embodiment 16, wherein performing manifold learning comprises creating a force-directed layout.

Embodiment 23. the method of embodiment 22, wherein the Force directed layout is created using the Force Atlas 2 algorithm.

Embodiment 24 the method of any one of embodiments 1 to 23, wherein clustering is performed assuming no a priori knowledge about the organization of the plurality of points in each cluster.

Embodiment 25 the method of any one of embodiments 1 to 24, wherein performing clustering comprises performing HDBSCAN to generate the cluster C_jThe collection of (2).

Embodiment 26 the method of any one of embodiments 1 to 25, wherein performing clustering comprises performing Louvain community detection to generate cluster C_jThe collection of (2).

Embodiment 27 the method of any one of embodiments 1 to 26, wherein performing clustering comprises assigning each point to cluster C based on the time point at which the single cell cellular component expression dataset associated with said point was collected_jOne of them.

Embodiment 28 the method of any one of embodiments 1 to 27, wherein performing clustering comprises analyzing the plurality of points using a diffusion path algorithm that assigns points to clusters based on a measure of how well the points are ends of the clusters.

Embodiment 29 the method of any one of embodiments 1 to 28, wherein the differentially expressed cell component E is determined_kThe set of (a) includes: for each cellular component, for at least one of the clusters, comparing the amount of the cellular component of the plurality of points in the at least one cluster to the amount of the cellular component of the plurality of points in at least one other cluster; and responsive to the amount of the cellular component of the plurality of points in the at least one cluster being greater than a threshold level of the amount of the cellular component of the plurality of points in the at least one other cluster, adding the cellular component to the differentially expressed cellular component E _kConcentration of (2).

Embodiment 30 the method of embodiment 29, wherein said at least one cluster comprises said cluster C_jContains a plurality of points identifiable as having a desired cell type.

Embodiment 31 the method of embodiment 30, wherein said at least one other cluster comprises said cluster C_jAn extralineage cluster containing points identifiable as having undesirable cell types.

Embodiment 32 the method of any one of embodiments 1 to 31, wherein the differentially expressed cell component E is determined_kThe set of (a) includes: for each cellular component, calculating, for at least one of the clusters, a distance metric between the amount of the cellular component of the plurality of points in the at least one cluster and the amount of the cellular component of the plurality of points in the at least one other cluster; and responsive to the distance metric being statistically significant, adding the cellular component to the differentially expressed cellular component E_kConcentration of (2).

Embodiment 33 the method of any one of embodiments 1 to 32, further comprising screening said differentially expressed cells against a transcription factor database Component E_kTo identify a set of differentially expressed transcription factors.

Embodiment 34 the method of embodiment 33, further comprising: for the differentially expressed cell fraction E_kPerforming empirical mode decomposition to generate a pseudo-temporal representation of the data set; and identifying the set of differentially expressed transcription factors based on the pseudo-temporal representation.

Embodiment 35. a method comprising the steps of: accessing a plurality of single-cell cellular component expression datasets, each dataset obtained from one cell of a plurality of cells that have been transformed from the same "progenitor" cell type, each dataset comprising a vector r of cellular components_iEach entry in the vector is associated with one of a plurality of cellular components, and the value of each entry represents the amount of the cellular component of the cell; generating a kNN map using a kNN algorithm and using the single cell component expression dataset; performing clustering to generate cluster C_jEach cluster comprising a plurality of points, each point corresponding to a single-cell cellular component expression dataset for one cell of the plurality of cells; and using the cluster C_jDetermining a differentially expressed cell fraction E of said plurality of cells _kThe collection of (2).

Embodiment 36. the method of embodiment 35, wherein the differentially expressed cell component E is determined_kComprises determining said cluster C_jIs measured by the distance between the plurality of points.

Embodiment 37. a method comprising the steps of: accessing a single-cell transition signature representing a measure of differential cellular component expression between a first cellular state and an altered cellular state; accessing perturbation characteristics representing a measure of differential cellular component expression between undisturbed cells that are not exposed to perturbation and disturbed cells that are exposed to the perturbation; and determining whether the perturbation is associated with a transition of a cell between the first cell state and the altered cell state based on a comparison of the single cell transition characteristic and the perturbation characteristic.

Embodiment 38 the method of embodiment 36, wherein accessing the single cell transition characteristics comprises: determining the single-cell transition characteristic based on a first plurality of single-cell cellular component expression data sets, each first data set obtained from one cell of a first plurality of cells in the first cellular state, and a second plurality of single-cell cellular component expression data sets, each second data set obtained from one cell of a second plurality of cells in the altered cellular state.

Embodiment 39 the method of embodiment 38, wherein each data set of the first plurality of single-cell cellular component expression data sets and the second plurality of single-cell cellular component expression data sets comprises a vector r of cellular components_iEach entry in the vector is associated with one of a plurality of cellular components, and the value of each entry represents the amount of the cellular component of the cell.

Embodiment 40 the method of any one of embodiments 38 to 39, further comprising: obtaining the first plurality of single-cell cellular component expression datasets and the second plurality of single-cell cellular component expression datasets, the obtaining comprising: performing dimension reduction on the first plurality of single-cell cellular component expression datasets and the second plurality of single-cell cellular component expression datasets to generate a matrix M comprising rows in a first dimension and columns in a second dimension, each row corresponding to one cell of the plurality of cells, the values of the matrix M comprising values generated by amounts of cellular components located at points in a first dimension space and a second dimension space; performing clustering to generate cluster C_jEach cluster comprising a plurality of points corresponding to a subset of the rows in the matrix M and their corresponding cells; from the cluster C _jIdentifying the first plurality of cells by a first cluster of the set of; from the cluster C_jThe second plurality of cells is identified by a second cluster of the set of (a); obtaining the first plurality of single-cell cellular component expression datasets from the first plurality of cells; and obtaining the second plurality of single-cell cellular component expression datasets from the second plurality of cells.

Embodiment 41 the method of embodiment 40, further comprising performing manifold learning with the matrix M under a relative similarity approximation of points to create a matrix N comprising a plurality of rows and two columns, each row corresponding to one cell of the first plurality of cells and the second plurality of cells, each column corresponding to one of two dimensions in a two-dimensional space, the values of the matrix N indicating the relative cellular state of each cell relative to each other cell based on the first plurality of single-cell cellular component expression datasets and the second plurality of single-cell cellular component expression datasets.

Embodiment 42 the method of any one of embodiments 40 to 41, wherein said steps are performed according to any one of the methods of embodiments 1 to 34.

Embodiment 43 the method of any one of embodiments 37 to 42, wherein accessing the perturbation signature comprises: determining the perturbation signature based on a plurality of undisturbed single-cell component expression datasets for the undisturbed cells that are not exposed to the perturbation and based on a plurality of disturbed single-cell component expression datasets for the disturbed cells that are exposed to the perturbation.

Embodiment 44 the method of any one of embodiments 37 to 43, wherein said undisturbed cells are control cells that have not been exposed to said perturbation of said perturbed cells or wherein said undisturbed cells are an average of unrelated perturbed cells that have been exposed to said perturbation.

Embodiment 45 the method of any one of embodiments 37 to 44, further comprising the step of: filtering the single cell transition signature and the perturbation signature to include cellular components as transcription factors.

Embodiment 46. the method of any one of embodiments 38 to 42, wherein determining the single-cell transformation signature based on the first plurality of single-cell cellular component expression datasets and the second plurality of single-cell cellular component expression datasets comprises: determining a difference in cellular constituent amounts between the first plurality of single-cell cellular constituent expression datasets and the second plurality of single-cell cellular constituent expression datasets using one of a mean difference test, a Wilcoxon rank sum test (Mann-Whitney U test), a t test, logistic regression, and a generalized linear model.

Embodiment 47 the method of embodiment 43, wherein determining the perturbation signature based on the undisturbed plurality of single-cell cellular component expression data sets and the perturbed plurality of single-cell cellular component expression data sets comprises: determining differences in the amount of the cellular constituent between the undisturbed plurality of single-cell cellular constituent expression datasets and the disturbed plurality of single-cell cellular constituent expression datasets using one of a mean difference test, a Wilcoxon rank sum test (Mann-Whitney U test), a t test, logistic regression, and a generalized linear model.

Embodiment 48 the method of any one of embodiments 37 to 47, further comprising: filtering the single cell transition feature and the perturbation feature to reduce a number of cellular components included in the single cell transition feature and the perturbation feature.

Embodiment 49 the method of embodiment 48, wherein filtering the single cell transition feature and the perturbation feature comprises reducing the number of cellular components included in the single cell transition feature and the perturbation feature according to a threshold p-value or according to a threshold number of cellular components.

Embodiment 50 the method of any one of embodiments 37-49, wherein the perturbation signature comprises a plurality of cellular components, each cellular component associated with a significance score that quantifies a correlation between a change in an amount of the cellular component and a change in a cellular state between the undisturbed cell and the disturbed cell, and wherein determining whether the perturbation is associated with a transition in a cell between the first cellular state and the altered cellular state comprises: replacing the significance score of each cellular component with the match score of the cellular component; combining the match scores of the plurality of cellular components to generate the perturbed match score; and determining whether the perturbation is associated with a transition of a cell between the first cell state and the altered cell state based on the match score of the perturbation.

Embodiment 51 the method of embodiment 50, wherein the match score comprises a discrete score or a continuous score.

Embodiment 52 the method of any one of embodiments 50 to 51, wherein replacing each significant score comprises: replacing the saliency score with a first score if the cellular component amount of the cellular component from the single-cell conversion feature and the cellular component amount from the perturbation feature are both up-regulated; replacing the saliency score with a second score if the cellular component amount of the cellular component from the single-cell conversion feature is up-regulated and the cellular component amount from the perturbation feature is down-regulated; and replacing the significance score with a third score if the amount of the cellular component from the perturbation signature of the cellular component is not significantly up-or down-regulated.

Embodiment 53 the method of any one of embodiments 50 to 51, wherein replacing the significance score comprises: replacing the saliency score with a first score if the cellular component amount of the cellular component from the single-cell conversion feature and the cellular component amount from the perturbation feature are both downregulated; replacing the saliency score with a second score if the cellular component amount of the cellular component from the single-cell conversion feature is down-regulated and the cellular component amount from the perturbation feature is up-regulated; and replacing the significance score with a third score if the amount of the cellular component from the perturbation signature of the cellular component is not significantly up-or down-regulated.

Embodiment 54 the method of any one of embodiments 37-49, wherein the perturbation signature comprises a plurality of cellular components, each cellular component associated with a significance score that quantifies a correlation between a change in an amount of the cellular component and a change in a cellular state between the undisturbed cell and the disturbed cell, and wherein determining whether the perturbation is associated with a transition in a cell between the first cellular state and the altered cellular state comprises: combining the prominence scores of the plurality of cellular components to generate a prominence score for the perturbation; and determining whether the perturbation is associated with a transition of a cell between the first cell state and the altered cell state based on the prominence score of the perturbation.

Embodiment 55 the method of any one of embodiments 50 to 53, further comprising: estimating a false cell component discovery rate for the matching score of the perturbation by: calculating an empirical marginal expression frequency for each cellular component of the plurality of cellular components; summing the empirical marginal expression frequencies of the plurality of cellular components over combinations thereof to generate a probability of identifying a number of cellular components by occasionally assuming independently distributed expression; and estimating the false cell component discovery rate of the match score of the perturbation based on the probability.

Embodiment 56 the method of any one of embodiments 37 to 55, wherein determining whether the perturbation is associated with a transition of a cell between the first cellular state and the altered cellular state comprises: determining that a threshold amount of the perturbed covariate is associated with a transition of a cell between the first cellular state and the altered cellular state; and in response to the determining, determining that the perturbation is associated with a transition of the cell between the first cell state and the altered cell state.

Embodiment 57 the method of embodiment 56, wherein the perturbation comprises exposing the cell to a small molecule, and wherein the one or more covariates of the perturbation comprise: a specific dose of the small molecule, a time of differential cellular component expression between the undisturbed cell and the disturbed cell relative to the time of exposure of the disturbed cell to the small molecule, and a cell line of the disturbed cell.

Embodiment 58 the method of any one of embodiments 37 to 57, wherein the cellular component comprises a gene.

Embodiment 59 the method of embodiments 37 to 58, wherein said single cell cellular component expression dataset is generated using a method selected from the group consisting of: single cell ribonucleic acid (RNA) sequencing (scRNA-seq), scTag-seq, single cell assays using sequenced transposase accessible chromatin (scATAC-seq), CyTOF/SCoP, E-MS/Abseq, miRNA-seq, CITE-seq, and any combination or summary thereof.

Embodiment 60 the method of embodiments 37-59, wherein at least one of said single cell transition characteristic and said perturbation characteristic is obtained from a database.

Embodiment 61 the method of embodiment 60, wherein the perturbation signature is obtained from a database of a plurality of perturbation signatures comprising a plurality of perturbations, and wherein the method further comprises: for each perturbation of the plurality of perturbations in the database: accessing the perturbation characteristics of the perturbation from the database; and determining whether the perturbation is associated with a transition of a cell between the first cell state and the altered cell state based on a comparison of the single cell transition characteristic and the perturbation characteristic.

Embodiment 62 the method of any one of embodiments 37 to 61, further comprising accessing a plurality of perturbation characteristics of a plurality of perturbed cells; and for each of the plurality of perturbation characteristics, performing the determining step to screen for perturbations that promote the altered cellular state.

Embodiment 63 the method of embodiment 62, wherein accessing the plurality of perturbation features comprises exposing cells to a plurality of perturbations to generate the plurality of perturbed cells; and measuring a plurality of cell component quantities from the plurality of disturbed cells.

Embodiment 64 the method of any one of embodiments 37 to 63, further comprising identifying a perturbation that promotes said altered cellular state.

Embodiment 65 the method of embodiment 64, wherein promoting the altered cellular state comprises promoting a transition from the first cellular state to the altered cellular state in a population of cells comprising the first cellular state.

Embodiment 66. the method of embodiment 64, wherein promoting the altered cellular state comprises, in a population of cells comprising the first cellular state, increasing the ratio of the number of cells in an alternative state to the number of cells in the first state or optionally a state different from the altered cellular state.

Embodiment 67. the method of embodiment 64, wherein promoting said altered cellular state comprises, in a population of cells comprising said first cellular state, increasing the absolute number of cells in said altered cellular state.

Embodiment 68 the method of embodiment 64, wherein promoting the altered cellular state comprises, in a population of cells comprising the first cellular state, reducing the absolute number of cells in the first cellular state or optionally a state different from the altered cellular state.

Embodiment 69 the method of any one of embodiments 37 to 68, wherein said cellular transformation signature and said perturbation signature are generated using different types of cellular components.

Embodiment 70 the method of any one of embodiments 37-68, wherein the cellular transformation signature and the perturbation signature are generated using the same type of cellular component.

Embodiment 71. a method comprising the steps of: accessing a single-cell transition signature representing a measure of differential cellular component expression between a first cellular state and an altered cellular state; accessing a plurality of perturbation features, each perturbation feature being associated with a perturbation and representing a measure of differential cellular component expression between undisturbed cells not exposed to the perturbation and disturbed cells exposed to the perturbation; and determining a subset of the perturbations associated with a transition of a cell between the first cell state and the altered cell state based on a comparison of the single-cell transition feature and the plurality of perturbation features.

Embodiment 72 the method of embodiment 71, wherein each perturbation characteristic comprises a plurality of cellular components, each cellular component associated with a prominence score that quantifies a correlation between a change in an amount of the cellular component and a change in a cellular state between the undisturbed cell and the disturbed cell, and wherein determining a subset of the perturbations associated with a transition in a cell between the first cellular state and the changed cellular state comprises: for each perturbation signature: replacing the significance score of each cellular component with the match score of the cellular component; and combining the match scores of the plurality of cellular components to generate the perturbed match score; ranking the perturbations based on their matching scores; and selecting the subset of the perturbations based on the ordered list of perturbations.

Embodiment 73. a computer program product comprising a non-transitory computer readable storage medium having instructions encoded thereon, which when executed by a processor, causes the processor to perform the method of any of embodiments 37 to 72.

Embodiment 74. a system, comprising: a non-transitory computer readable storage medium having instructions encoded thereon, which when executed by a processor, cause the processor to perform the method of any of embodiments 37-72.

Embodiment 75 a method for promoting neurons and/or progenitor cells, comprising: exposing a starting population of cells comprising fibroblasts to a perturbation having perturbation characteristics that promote conversion of the starting population of cells comprising fibroblasts into progenitor cells and/or neurons, wherein the perturbation characteristics are an increase in activity of one or more of Brn2, Ascl1, Myt1, Zfp941, Taf5B, St18, Zkscan16, Camta1, and Arnt2 and/or a decrease in activity of one or more of Ascl1, Atf3, Rorc, Scx, Satb1, Elf3, and Fos.

Embodiment 76 the method of embodiment 75, wherein the neurons and/or progenitor cells are promoted by one or more of: increasing the absolute number of neurons and/or progenitor cells; reducing the absolute number of fibroblasts; promoting the conversion of fibroblasts to neurons and/or progenitor cells; promoting the longevity of neurons or progenitor cells; reducing the lifespan of fibroblasts; or increasing the ratio of neurons and/or progenitor cells to fibroblasts.

Embodiment 77 the method of embodiment 75, wherein the perturbation does not comprise forskolin, PP1, PP2, and trichostatin a.

Embodiment 78 a method of increasing the amount of neurons and/or progenitor cells, comprising exposing a population of cells comprising fibroblasts to a pharmaceutical composition having perturbation characteristics that promote the conversion of the population of cells comprising fibroblasts to neurons, wherein the pharmaceutical composition comprises forskolin, PP1, PP2, trichostatin A, BRD-K38615104, geldanamycin, manumycin a, mitoxantrone, curcumin, avocado, vallisotat, KI20227, or a combination of the foregoing (e.g., 2, 3, 4, 5 or more of the foregoing).

Embodiment 79 the method of embodiment 78, wherein the pharmaceutical composition does not comprise forskolin, PP1, PP2, and trichostatin a.

Embodiment 80. a pharmaceutical composition for promoting neuronal and/or progenitor cells comprising: a perturbation selected from the group consisting of forskolin, PP1, PP2, trichostatin A, BRD-K38615104, geldanamycin, manumycin a, mitoxantrone, curcumin, avocado, valproate, KI20227, or a combination of the foregoing; and a pharmaceutically acceptable excipient.

Embodiment 81 the pharmaceutical composition of embodiment 80, wherein the perturbation does not comprise forskolin, PP1, PP2, and trichostatin a.

Embodiment 82. a unit dosage form comprising the pharmaceutical composition of embodiment 80 or 81.

Embodiment 83 a method of identifying a candidate perturbation for promoting a conversion of a starting cell population comprising fibroblasts to neurons and/or progenitor cells, the method comprising: exposing the starting cell population comprising fibroblasts to a perturbation; identifying a perturbation signature of the perturbation, the perturbation signature comprising one or more cellular components and a prominence score associated with each cellular component, the prominence score for each cellular component quantifying an association between a change in expression of the cellular component after exposure of the population of cells to the perturbation and a change in cellular state of the population of cells from fibroblasts to neurons and/or progenitor cells; and identifying the perturbation as a candidate perturbation for promoting conversion of a population of cells comprising fibroblasts to neurons and/or progenitors based on the perturbation signature, wherein the perturbation signature is an increase in activity of one or more of Brn2, Ascl1, Myt1, Zfp941, Taf5B, St18, Zkscan16, Camta1, and Arnt2 and/or a decrease in activity of one or more of Ascl1, Atf3, Rorc, Scx, Satb1, Elf3, and Fos.

Examples 0, 1, 2 and 3-recognition of causes in mouse embryonic fibroblasts differentiated into neurons and myocytes Fruit relationship and control of cell fate

The following examples demonstrate the methods described above in sections II and III. In more detail, the examples demonstrate the ability of the methods of sections II and III to accurately identify genes and/or perturbations known to affect the trajectory of a cellular state transition. In addition, the examples discussed below demonstrate the ability of the methods of sections II and III to generate new biological insights that can be used to control the trajectory of a cellular state transition. In particular, the examples demonstrate the ability of the methods of sections II and III to identify previously unknown factors (e.g., genes and perturbations) that affect a transition in cellular state.

The examples discussed below apply the methods of section II and section III to a combination of publicly available data and in vitro experimental data to validate several known and previously unknown factors (e.g., genes and perturbations) that affect the cell state transition trajectory. The results of applying the methods of section II and section III to the combination of publicly available data and in vitro experimental data are shown in fig. 4B to 5A and fig. 7A to 9.

Some of these results were also only validated using in vitro experimental data. The results of this in vitro validation are shown in figure 6. In vitro experimental data were obtained by growing and measuring cells according to the protocol discussed in section iv.a below.

Iv.a. example 0: in vitro cell processing and data set acquisition

This section describes the protocol for the in vitro experiments mentioned above. The data from this in vitro experiment was combined with publicly available data to generate fig. 4B-5A and 7A-9, and independently used to generate fig. 6.

This section applies the general protocol described in section II to a specific example of evaluating Mouse Embryonic Fibroblasts (MEFs) for differentiation into neurons or muscle cells. In this particular embodiment, the neuron is an intracytoplasmic lineage cell, the myocyte is an extracytoplasmic lineage cell, and the MEF is a "progenitor" cell. The protocol also included additional steps including lentiviral overexpression and perturbation mediation of the gene Ascl 1.

MEF Medium is 10% Fetal Bovine Serum (FBS), 1 XGlutamax, 1 Xnon-essential amino acids, Pen/Strep and beta-mercaptoethanol in Dulbecco's Modified Eagle Medium (DMEM) Modified by Darbec. The neuronal medium was DMEM/F12, N2, B27, 1 XGlutamax and 25. mu.g/ml insulin.

The protocol followed is listed below:

day 0: 1 million MEF cells were thawed into MEF medium in 10cm plates.

Day 1: the cells were seeded in 24-well plates at 20K/well.

The cells were simultaneously infected with Ascl1 virus by centrifugation (spin infection) (multiplicity of infection (MOI)8), if applicable. Centrifugation was carried out at 2000rpm for 1 hour at 32 ℃ in the presence of MEF medium (250. mu.l/well) and 8. mu.g/ml polybrene.

Single cell ribonucleic acid (RNA) sequencing (scra-seq) was performed to obtain a d2 dataset for each cell.

Day 2: the medium was changed to wash off polybrene (MEF medium) for virus experiments.

For perturbation experiments, small molecules (resuspended in Dimethylsulfoxide (DMSO) or ethanol) were added.

Day 3: the medium was changed to neuronal medium.

For the perturbation experiment, the molecule was added (resuspended in DMSO or ethanol).

Day 5: half medium changes (addition of small molecules, if applicable)

Day 8: half medium changes (addition of small molecules, if applicable)

Day 9: half medium changes (addition of small molecules, if applicable)

Day 11: half medium changes (addition of small molecules, if applicable)

Day 13: half medium changes (addition of small molecules, if applicable)

Day 15: plates were fixed and stained with Map2 and Tuj1 antibodies. Molecular Devices HCI IXM4 or other high content imaging microscope was imaged by scanning each well. The number of positive neurons per well Map2/Tuj1 was quantified.

Example 1, 2 and 3: in vitro cell processing and data set acquisition

Fig. 4A depicts a timeline that tracks the trajectory of induced cellular state transitions over a period of time, according to one embodiment. More specifically, fig. 4A depicts a timeline that tracks induced MEF transition trajectories over a 23 day period (day 0 to day 22).

As shown in fig. 4A, on day 0 of the 23 day period, MEFs were obtained. In alternative embodiments, the transition trajectories of any single cell may be studied according to similar methods. For example, in an alternative embodiment, the transformation trajectory of mouse embryo blood cells can be studied according to a similar method.

On day 0 of the 23 day period, each MEF in the MEF population is transduced with one or more appropriate transcription factors. As shown in fig. 4A, only Ascl1 or Brn2, Ascl1 and Myt1l (collectively BAM transcription factors) were overexpressed in MEFs. Specifically, in the in vitro experiment generating fig. 6 using the protocol described above in section iv.a, only Ascl1 was overexpressed in MEFs. In contrast, to generate publicly available data applying part II and part III methods to generate fig. 4B-5A and 7A-9, each of Brn2, Ascl1, and Myt1l was overexpressed in MEFs.

In embodiments disclosed herein, inducible expression of Ascl1 forces expression of Ascl1 transcription factor following lentiviral delivery. In alternative embodiments, any alternative means may force the expression of one or more transcription factors. For example, in alternative embodiments, transposon, mRNA delivery, or another type of viral delivery may force expression of one or more transcription factors.

Forced expression of one or more of the BAM transcription factors is known to result in the more general conversion of one or more of the forced MEFs into mouse "progenitor" cells, mouse neurons, and/or mouse muscle cells. Specifically, as known in the literature, Ascl1 elicits the induction of MEF conversion to mouse "progenitor" cells, expression of Ascl1 alone induces the conversion of mouse "progenitor" cells to mouse neurons and mouse muscle cells, and expression of Brn2 and Myt1l induces the conversion of mouse "progenitor" cells to mouse neurons. However, this induction of a cellular state transition by one or more of the BAM transcription factors does not occur with 100% efficiency. Specifically, as known in the literature, BAM transcription factor induces MEF to mouse neuron conversion with 20% efficiency. In other words, despite expression of one or more of the BAM transcription factors, some cells may not be able to switch as expected. In some embodiments, this failed transition is referred to as failed reprogramming.

Mouse cells in which one or more of the BAMs are forced to be expressed are monitored over a 23 day period. More specifically, for mouse cells in which expression of Ascl1 was forced, single cell RNA sequencing (scra-seq) measurements of each single mouse cell of the mouse cells in the population were obtained on days 2, 5 and 22 during the 23 day period. Alternatively, for mouse cells in which all BAM factor expression is forced, scRNA-seq measurements of each single mouse cell of the mouse cells in the population are obtained only on day 22 during the 23 day time period.

In alternative embodiments, RNA sequencing measurements can be made at any frequency at any number of time points. More specifically, to accurately capture the transition trajectory of the cell state, the time points at which RNA sequencing measurements are taken ideally generally correspond to the time points at which one or more transition trajectories diverge. Performing RNA sequencing measurements on single cells on a particular day includes quantifying mRNA expression in single cells on the particular day. In other words, performing RNA sequencing measurements on a single cell on a particular day includes counting each mRNA transcript in the single cell on that particular day. Furthermore, because each mRNA transcript is associated with a particular gene, RNA sequencing measurements of single cells on a particular day include quantification of gene expression in single cells on that particular day. However, in practice, cells will generally not be completely homogeneous in the state of their cellular state transition, and therefore measurements of cellular state transitions on a given day are predicted to capture the distribution of cells at various stages of the cellular state transition.

An in vitro protocol for overexpression of Ascl1 in MEFs was used to perform the validation experiment depicted in fig. 6 and described in detail below. In addition, gene expression measurements obtained from an in vitro protocol for overexpression of Ascl1 in MEFs were combined with publicly available gene expression measurements from MEFs in which all BAM factors were overexpressed. The in vitro and publicly available data pools were then used to generate the data depicted in fig. 4B-5A and 7A-9. As described above, these figures are used both to verify the ability of the methods of sections II and III to accurately identify genes that affect cell state transitions, and to demonstrate the ability of the methods described in sections II and III to generate new biological insights that can be used to control the trajectory of cell state transitions and thus control cell fate.

Iv.c. example 1: validating dimension reduction of document identified transitions

As discussed above, gene expression measurements obtained from MEFs overexpressing only Ascl1 on days 2, 5, and 22 were merged with publicly available gene expression measurements obtained from MEFs overexpressing all BAM factors on day 22. Using the method described above in section II, for each day that gene expression in cells is measured, a transcript vector r is generated using the gene expression measurement for each cell _iThe data set of (2). Each transcript vector r_iAnd obtaining a transcript vector r_iThe specific cell of a specific day of the gene expression measurement contained in (a). TranscriptionObject vector r_iIs associated with a particular gene in the genome of the cell, and a transcript vector r_iThe value of each entry in (a) represents the vector r with the transcript_iCorrelated depth of sequencing (transcript count) of transcripts at a particular day.

Dimension reduction was performed on the data set encoding the gene expression measurements for each cell on each measurement day, as discussed above with respect to section iii.c. In this embodiment, Principal Component Analysis (PCA) is used to perform the dimension reduction and produce a matrix M for the dimension reduction.

Next, manifold learning is performed on the matrix M to generate a further reduced-dimension matrix N. In this embodiment, a force directed layout algorithm is used to generate the matrix N. Matrix N is depicted in supplementary table 1. The matrix N is also drawn as a force directed layout manifold depicted in fig. 4B. The rendering data in the manifold of fig. 4B corresponds to the matrix N data in supplementary table 1. Note that matrix N is used primarily for visualization purposes and need not be generated in some embodiments. In other words, in some embodiments, no manifold learning is performed on matrix M.

As discussed above, each point in the manifold is associated with a row of the matrix N that is associated with a particular cell of the cells of a particular day of the four days that gene expression of the cell is measured. In addition, each point is associated with a dataset of gene transcript counts measured on a particular cell on a particular day. In explaining the manifold of fig. 4B, because the values of dimension x and dimension y in a row of the manifold are based on the gene transcript count of the cell of the day associated with the row, the positioning of a point in the manifold reflects the gene transcript count of the cell of the day associated with the point in the manifold relative to other points, and thus relative to other cells of other days. Thus, visual observation of the manifold allowed for observation of altered gene transcript counts of various genes of the cells over a 23 day period.

In the manifold depicted in fig. 4B, all points are represented by the same shape having the same color. Thus, in the manifold of FIG. 4B, the only discernable information provided by a dot is its position (x, y) in the manifold. However, the gene transcript count by gene transcript count and the particular day on which the gene transcript count for each spot was obtained cannot be discerned in fig. 4B. As discussed in further detail below, the shape of the dots in the manifold of fig. 5A is varied to indicate, in part, the day each dot, and thus each cell, of gene transcript count was obtained. Similarly, the shading of the dots in the manifold of fig. 5B is varied to indicate the gene transcript count on a gene-by-gene basis for each dot, and thus each cell, on each measurement day.

Fig. 5A depicts the manifold of fig. 4B according to one embodiment. In the embodiment of the manifold depicted in fig. 5A, each point in the manifold is labeled with the day the transcription factor expression of the cells associated with the point was measured and the qualitative phase that the cells were in the process of transformation. For example, the dots marked with squares iN the manifold iN fig. 5A indicate that the dots are associated with day 5 cells that are qualitatively characterized as early-induced neuronal (iN) cells.

By labeling each point in the manifold with a day of measuring gene expression of the cells associated with the point and a qualitative phase of the cellular transformation, the transformation trajectory can be identified. For example, two different transition trajectories are indicated by arrows below the manifold in fig. 5A. One identified trajectory traces the transition of MEF cells to mouse neurons. Another identified trajectory in fig. 5A depicts the transition trajectory of MEF cells to mouse myocytes.

By identifying differences in gene expression between points (e.g., cells) at different stages along a transformation trajectory, genes that contribute to the transformation of a cell along a particular trajectory can be identified. But perhaps more importantly, by identifying differences in gene expression between points (e.g., cells) at the junction where two or more transition trajectories diverge, genes that contribute to the divergence in the transition trajectories can be identified. These identified genes can then be predicted to be associated with particular transition trajectories and/or phases. For example, if an increased level of gene a expression is identified iN cells labeled as early day 5 iN cells relative to cells labeled as early day 5 myocytes, it can be assumed that gene a expression is associated with a transition trajectory from MEF to mouse neurons and vice versa.

As discussed above, fig. 5A establishes a transition trajectory based on both the quantitative time points during the cell transition process and the qualitative phase of the cell transition process. However, fig. 5A does not indicate the level of gene expression on a gene-by-gene basis at each point (e.g., cells at different time points). Thus, based on the information depicted in fig. 5A, it is not possible to predict which genes are associated with which transition trajectories. However, as described above, the shading of the dots in the manifold of fig. 5B is changed to indicate the relative gene transcript count on a gene-by-gene basis for each dot. Based on this description of gene expression on a gene-by-gene basis at these points (e.g., cells at different time points), it can be predicted which genes are associated with which transition trajectories.

Fig. 5B depicts the expression levels of each of the three BAM transcription factors in each cell on each measurement day (day 2, day 5, and day 22 for Ascl1, and day 22 for Brn2 and Myt1 l) depicted as a point in the manifold of fig. 4B, according to one embodiment. In particular, fig. 5B depicts three different forms of the manifold of fig. 4B. The first form of manifold depicted in fig. 5B depicts the expression level of Ascl1 transcription factor for each point of the manifold, the second form of manifold depicted in fig. 5B depicts the expression level of Brn2 transcription factor for each point of the manifold, and the third form of manifold depicted in fig. 5B depicts the expression level of Myt1l transcription factor for each point of the manifold.

In fig. 5B, the expression level of a transcription factor for a point in the manifold (e.g., a cell at a time point) is measured as the log of a fragment per million mapped reads (FPKM) per kilobase transcript of the transcription factor. A relatively low log (fpkm) value indicates a relatively low expression level of the transcription factor. On the other hand, a relatively high log (fpkm) value indicates a relatively high expression level of the transcription factor. In the manifold of fig. 5B, a relatively low transcription factor expression level (e.g., a relatively low log (fpkm) value) for a point is indicated by shading the point relatively deeper. Conversely, a relatively high transcription factor expression level (e.g., a relatively high log (fpkm) value) for a point is indicated by shading the point relatively less.

By comparing the transition trajectory depicted in fig. 5A with the manifold of fig. 5B depicting BAM transcription factors on a gene-by-gene expression level basis, transcription factors that influence the progression of cells along a particular transition trajectory are identified.

Turning first to the manifold of fig. 5B depicting expression of Ascl1 transcription factor on day 0 during the 23-day period, mouse cells were transduced with only Ascl1 or BAM. Thus, day 0 cells did not express Ascl1 at detectable levels. These day 0 cells that do not express Ascl1 are MEFs. Then, on day 2 of the 23 day period Ascl1 was expressed at a relatively low level, as depicted by the relatively dark shading of the dots associated with the day 2 cells. These day 2 cells expressing Ascl1 began to progress along the transition trajectory shown in fig. 5A. Specifically, some day 2 cells become mouse progenitor cells, some day 2 cells become intermediate cells on the transition trajectory from MEF to neurons, and some day 2 cells become induced cells on the transition trajectory from MEF to myocytes. Similarly, on day 5 of the 23-day period, expression of Ascl1 was increased in day 5 cells relative to day 2 cells, as depicted by the relatively lighter shading of the points associated with the day 5 cells. These day 5 cells with increased expression of Ascl1 further progressed along the transition trajectory shown in fig. 5A. Specifically, day 5 cells on the transition trajectory from MEF to neurons become intermediate and early iN cells, while day 5 cells on the transition trajectory from MEF to myocytes become early myocytes. Finally, on day 22 of the 23 day period, expression of Ascl1 increased or remained the same in day 22 cells relative to day 5 cells. These day 22 cells expressing Ascl1 further progressed along the transition trajectory shown in fig. 5A. Specifically, the day 22 cells on the transition trajectory from MEF to neurons became mature mouse neurons, and the day 22 cells on the transition trajectory from MEF to myocytes became mature mouse myocytes. No mouse progenitor cells remained on day 22.

These observations of MEF cell state transition after induction of Ascl1 expression follow trends known in the literature. Specifically, as briefly discussed above, Ascl1 elicited induction of MEF conversion to mouse progenitor cells, and expression of Ascl1 alone induced conversion of mouse progenitor cells to mouse neurons and mouse muscle cells. As discussed above with regard to the Ascl1 manifold of fig. 5B, after forced expression of Ascl1 in day 0 MEFs, MEFs were transformed into any of mouse progenitor cells, mouse muscle cells, and mouse neurons.

Turning next to the manifold of fig. 5B depicting expression of Brn2 transcription factor, MEFs were transduced with BAM factor on day 0 of the 23-day period. Brn2 expression was measured only on day 22 during the 23 day period. As shown in figure 5B, at day 22 of the 23 day period, the mouse neurons strongly expressed Brn2 at day 22. Thus, it can be concluded that expression of Brn2 correlates with the progression of MEF cells along the MEF-to-mouse neuronal transition trajectory.

This observation of MEF cell state transition after induction of Brn2 expression follows a trend known in the literature. Specifically, as briefly discussed above, Brn2 expression induces the conversion of mouse progenitor cells into mouse neurons. MEFs expressing Brn2 were transformed into mouse neurons as discussed above with respect to Brn2 manifolds of fig. 5B.

Finally turning to the manifold of fig. 5B depicting expression of Myt1l transcription factor, MEFs were transduced with BAM factor on day 0 of the 23 day period. Myt1l expression was measured only on day 22 during the 23 day period. On day 22 of the 23 day period, mouse neurons strongly expressed Myt1l on day 22. Thus, similar to Brn2 transcription factor, it can be concluded that expression of Myt1l is associated with the progression of MEF cells along the transition trajectory from MEF to mouse neurons.

This observation of MEF cell state transition after induction of Myt1l expression follows a trend known in the literature. Specifically, as briefly discussed above, Myt1l expression induces the conversion of mouse progenitor cells into mouse neurons. As discussed above with respect to the Myt1l manifold of fig. 5B, MEFs expressing Myt1l were converted into mouse neurons.

Thus, these observations, obtained by generating Ascl1, Brn2, Myt1l manifold in fig. 5B using the methods of parts II and III, are consistent with the observations reported in the literature. This agreement in the observations of Ascl 1-assisted transition, Brn 2-assisted transition, Myt1 l-assisted transition helped to validate the ability of the methods of parts II and III to accurately identify genes that influence transitions in cell state.

To further verify the ability of the methods of sections II and III to accurately identify genes affecting cellular state transitions, in vitro experiments were performed to confirm the above observations based on the manifolds of fig. 5A and 5B. In particular, in vitro experiments were performed to confirm the above observations that Ascl1 expression induced MEF conversion to mouse "progenitor" cells, mouse neurons, and/or mouse muscle cells.

In vitro experiments were performed according to the protocol listed above in section iv.a. As discussed above, in the protocol, only expression of Ascl1 was forced in MEFs. After forcing expression of the Ascl1 transcription factor in MEFs on day 0 of the 23-day period, mouse cells were stained with DAPI, Map2 antibody, and Tuj1 antibody on day 15 of the 23-day period. DAPI is known to stain regions of DNA rich in adenine-thymine. Thus, DAPI stains nuclei. Map2 antibody and Tuj1 antibody are known to stain nerve cells. Thus, by staining mouse cells with DAPI, Map2 antibody and Tuj1 antibody, the amount of mouse neurons relative to the amount of total mouse cells can be identified, and thus the effect of Ascl1 overexpression on MEF conversion can be determined. In vitro experiments, the set of mouse cells that forced expression of the Ascl1 transcription factor is referred to herein as the experimental group.

As a positive control group in vitro experiments, samples of mouse cells including only mouse neurons were also stained with DAPI, Map2 antibody, and Tuj1 antibody. As a negative control group, a sample of MEF cells without forced asci 1 expression was also stained with DAPI, Map2 antibody and Tuj1 antibody.

After staining experimental, positive control and negative control with DAPI, Map2 antibody and Tuj1 antibody, each group stained with each dye was imaged on Molecular Devices HCI IXM 4. The resulting image is shown in fig. 6. Fig. 6 depicts images of forced asci 1-expressing MEF cells that have been stained with DAPI, Map2 antibody and Tuj1 antibody, mouse neurons stained with DAPI, Map2 antibody and Tuj1 antibody, and MEF cells without forced asci 1 expression stained with DAPI, Map2 antibody and Tuj1 antibody, according to one embodiment.

Turning first to the image of the negative control group, as shown in fig. 6, it was seen that there were no nuclei of DAPI-stained MEF cells expressing forced Ascl1, but there were almost no neurons in the images depicting Map2 and Tuj1 staining of MEF cells expressing no forced Ascl 1. In other words, although there are many mouse cells (particularly MEFs) in the sample, no neurons are present. This is the expected result, since Ascl1 expression was not forced in MEF cells of the sample, and thus no MEF cell to neuron conversion was induced.

Next to the images of the positive control group, as shown in fig. 6, nuclei of DAPI-stained mouse neurons were visible, and these same mouse neurons were also visible in the images depicting Map2 and Tuj1 staining of mouse neurons. In other words, all cells in the positive control sample are accurately identified as neurons.

Turning finally to the images of the experimental groups, as shown in fig. 6, nuclei of DAPI-stained MEF cells forced to express Ascl1 can be seen. In addition, some of these DAPI-stained cells were also stained with Map2 and Tuj1, indicating that these selected cells were mouse neurons. Thus, it can be concluded that forced expression of Ascl1 is associated with induction of MEF-to-mouse neuron conversion.

In vitro experiments of figure 6 demonstrate that forced expression of Ascl1 in MEF cells can lead to conversion of MEF cells to mouse neurons, as observed in computer experiments described above with respect to figures 5A and 5B. This confirmation of the observations in fig. 5A and 5B further validates the ability of the methods of parts II and III to accurately identify genes that affect the transition of cellular states.

Iv.d. example 2: clustering

As discussed above in section iii.c., above, after generating matrix M by dimensionality reduction, clustering is performed to group data in matrix M, generating cluster C _jThe collection of (2). Cluster C_jEach cluster in the set of (a) includes a set of points.

FIG. 7A depicts the manifold of FIG. 4B, where points in the manifold are grouped into clusters C identified by the clusters, according to one embodiment_jIn (1). In the embodiment of fig. 7A, clustering is performed using the Louvain community test, in particular the GenLouvain community test. As seen in FIG. 7A, the clusters identify 10 unique clusters C of points in the manifold_j。

Typically, a cluster assigns points in the manifold to a given cluster based on a threshold similarity of values associated with the points, e.g., their positions in a reduced dimensional space of the manifold, their associated gene transcript counts, etc. In particular, for the manifold of FIG. 7A, the clusters assign points to a given cluster based on a threshold similarity between the points in the manifold. For example, the dots included in group 8 in the manifold of fig. 7A may all be associated with mouse neurons or other cells that are genetically similar to mouse neurons. Similarly, the dots included in group 9 in the manifold of fig. 7A may all be associated with mouse muscle cells or other cells that are genetically similar to mouse muscle cells.

As discussed above, in addition to being able to accurately identify genes known in the literature to induce cellular state transitions, the methods of sections II and III allow for the identification of factors (e.g., genes and perturbations) unknown in the literature that affect cellular state transitions. Figure 7B depicts transcription factors known and unknown in the literature to correlate with MEF conversion to mouse neurons (and vice versa mouse myocytes), according to one embodiment. In particular, fig. 7B depicts the transcription factors associated with inhibiting the conversion of mouse "progenitor" cells to mouse muscle cells when underexpressed in mouse "progenitor" cells, and the transcription factors associated with the conversion of mouse "progenitor" cells to mouse neurons when overexpressed in mouse "progenitor" cells. The conversion of mouse "progenitor" cells to mouse neurons (and vice versa mouse muscle cells) can be induced by under-expressing in the mouse "progenitor" cells a transcription factor associated with inhibiting the conversion of the mouse "progenitor" cells to mouse muscle cells, and by over-expressing in the mouse "progenitor" cells a transcription factor associated with inducing the conversion of the mouse "progenitor" cells to mouse neurons.

To identify transcription factors associated with a transition of a first cellular state to an alternative specific cellular state or a transition from a first cellular state to any other cellular state, clustering may be used. In particular, the gene transcript count associated with a point in a cluster associated with a first cellular state is identified and compared to the gene transcript count associated with a point in an alternate particular cellular state or another cluster associated with any cellular state other than the first cellular state. This comparison of gene transcript counts between clusters can be performed using any differential expression test (e.g., mean difference test, Wilcoxon rank sum test, t test, logistic regression, and generalized linear model).

As an example, to identify transcription factors associated with the transition from MEFs to mouse neurons, the clustering discussed with respect to fig. 7A was used. First, to identify transcription factors associated with the transition of mouse "progenitor" cells to mouse neurons when overexpressed in the mouse "progenitor" cells, the gene transcript counts associated with the points included in the cluster associated with mouse neurons of fig. 7A (e.g., cluster 8 of fig. 7A) were identified and compared to the gene transcript counts associated with the points included in the alternative cluster not associated with mouse neurons of fig. 7A. In the embodiment of fig. 7B, this comparison is performed using the Wilcoxon rank sum test. However, in alternative embodiments, any other statistical analysis method may be used to perform the comparison. Based on this comparison, genes overexpressed in cells associated with points in the mouse neuron-associated cluster of fig. 7A were predicted to be associated with the transition of mouse "progenitor" cells to mouse neurons. The transcription factors resulting from the transcription and translation of these genes are identified as the transcription factors in fig. 7B that, when overexpressed in mouse "progenitor" cells, are associated with the conversion of mouse "progenitor" cells to mouse neurons.

Similarly, to identify transcription factors associated with inhibiting the conversion of mouse "progenitor" cells to mouse muscle cells when underexpressed in the mouse "progenitor" cells, the gene transcript counts associated with the points included in the cluster associated with mouse muscle cells of fig. 7A (e.g., cluster 9 of fig. 7A) were identified and compared to the gene transcript counts associated with the points included in the alternative cluster not associated with mouse muscle cells of fig. 7A. As described above, in the embodiment of fig. 7B, this comparison is performed using the Wilcoxon rank sum test. However, in alternative embodiments, any other statistical analysis method may be used to perform the comparison. Based on this comparison, it was predicted that genes that were under-expressed in cells associated with points in the mouse muscle cell-associated cluster of fig. 7A were associated with inhibiting the conversion of mouse "progenitor" cells to mouse muscle cells. The transcription factors resulting from the transcription and translation of these genes are identified as the transcription factors in fig. 7B that, when underexpressed in mouse "progenitor" cells, are associated with inhibiting the conversion of mouse "progenitor" cells to mouse muscle cells.

As shown in fig. 7B, transcription factors associated with the conversion of mouse "progenitor" cells to mouse neurons when overexpressed in mouse "progenitor" cells include Zfp941, Brn2, Myt1l, Taf5B, St18, Zkscan16, Camta1, and Arnt 2. When low expressed in mouse "progenitor" cells, transcription factors associated with inhibiting the conversion of mouse "progenitor" cells to mouse myocytes include Atf3, Rorc, Scx, Satb1, Elf3, and Fos. As discussed in detail above with respect to example 1, Brn2 and Myt1l transcription factors are known in the literature to be associated with inducing the conversion of mouse "progenitor" cells to mouse neurons. However, it is not known in the literature that the remaining transcription factors depicted in fig. 7B are associated with MEF conversion to mouse neurons (and vice versa to mouse myocytes). Thus, by using the methods of section II and III above, genes and/or transcription factors known and unknown in the literature to induce cells to follow specific transformation trajectories can be identified. These recognized transcription factors can then be used to control cellular state transitions, and thus cell fates.

Iv.e. example 3: perturbation induced transitions

As discussed in sections iii.d and iii.e, in addition to being able to identify genes and transcription factors that affect the transition of cellular states, the methods of sections II and III are also able to identify perturbations, such as small molecules, that affect the transition of cellular states. First, to identify perturbations that induce cells to follow a particular transition trajectory, possible transition trajectories are identified.

Fig. 8A depicts a mapping of transition trajectories of MEF cells discussed with respect to fig. 4A, according to one embodiment. To construct a map of such transition trajectories, the manifold of fig. 4B is used. Specifically, points in the manifold associated with similar gene transcript counts are grouped into states (represented as circles in fig. 8A). Points with variable gene transcript counts located between states were used to identify transition paths (represented as lines in fig. 8A) between states. Perturbations in the transition trajectory that affect the cell by altering gene expression in the cell and thereby causing the cell to progress from one state to another state in the map of transition trajectories can be identified using the map of transition trajectories depicted in fig. 8A. In some embodiments, to generate the mapping of the transition trajectories depicted in fig. 8A, cell typing via a set of canonical marker genes may be used. In such embodiments, cells identified as the same cell type are predicted to be cells along the same transition trajectory in the mapping of transition trajectories. In an alternative implementation, to generate the mapping of transition trajectories depicted in FIG. 8A, branches of the manifold of FIG. 4B are identified and predicted to define different ones of the mapping of transition trajectories.

Fig. 8B depicts one example of the method described in section iii.d. for identifying perturbations that affect a transition trajectory of a cell by altering gene expression in the cell such that the cell transitions from a first state to a second state in the transition trajectory diagram of fig. 8A, according to one embodiment of the present disclosure. Specifically, to identify a perturbation that causes a cell to alter gene expression when exposed to the cell such that the cell transitions from a first state to a second state, the method of fig. 8B compares the change in gene expression in the cell after the cell transitions from the first state to the second state to the change in gene expression in the vehicle cell after the vehicle cell is exposed to the perturbation. If the change in gene expression after the cell transitions from the first state to the second state matches (e.g., is equivalent or similar to) the change in gene expression in the vehicle cell after the vehicle cell is exposed to the perturbation, then the perturbation can be predicted to induce the cell exposed to the perturbation to transition from the first state to the second state by altering the gene expression in the cell. In this way, it can be predicted that the perturbation is associated with a particular trajectory of the cell state transition.

Turning specifically to the example depicted in fig. 8B, fig. 8B depicts the gene expression levels of six different genes (genes 1-6) of cells in state 1, cells in state 2, vector cells, and vector cells exposed to small molecule perturbations. The gene expression level of a given gene is depicted by shading. Boercard-dot (Polka-dot) shading indicates undetectable gene expression, while cross-hatched shading indicates detectable gene expression. In other words, in the embodiment of fig. 8B, gene expression is measured on a binary (detectable or undetectable gene expression) basis. However, in alternative embodiments, gene expression levels are not measured on a binary basis, but rather on a more quantitative basis.

Turning to examination of the gene expression level of each gene in each cell, the expression of genes 1 to 3 was undetectable, but the expression of genes 4 to 6 was detectable for cells in state 1. In contrast, for cells in state 2, expression of genes 4 to 6 is undetectable, but expression of genes 1 to 3 is detectable. For the vehicle cells, expression of genes 1 to 3 is undetectable, but expression of genes 4 to 6 is detectable. In contrast, for vehicle cells exposed to perturbation, expression of genes 4 to 6 is undetectable, but expression of genes 1 to 3 is detectable.

Next, for each gene, the gene expression level in the cell in state 1 is compared with the gene expression level in the cell in state 2 to determine the change in gene expression level after the cell transitions from state 1 to state 2. As indicated by the dark cross-hatching associated with genes 1 to 3, the expression of genes 1 to 3 increased after the cell transitioned from state 1 to state 2. On the other hand, expression of genes 4 to 6 decreased after the cell transitioned from state 1 to state 2, as indicated by the dark boeka-dot shading associated with genes 4 to 6.

Similarly, for each gene, the gene expression level in the vehicle cell is compared to the gene expression level in the vehicle cell exposed to the perturbation to determine the change in gene expression level after exposure of the vehicle cell to the perturbation. As indicated by the dark cross-hatched shading associated with genes 1 to 3, expression of genes 1 to 3 increases after exposure of the vehicle cells to perturbation. On the other hand, expression of genes 4 to 6 decreased after exposure of the vehicle cells to perturbation, as indicated by the dark boeka-dot shading associated with genes 4 to 6.

Finally, the change in gene expression in the cell after the cell has transitioned from state 1 to state 2 is compared to the change in gene expression in the vehicle cell after the vehicle cell has been exposed to the perturbation. To compare changes in gene expression in transformed cells to changes in gene expression in vehicle cells, any differential expression assay can be used. For example, any of the mean difference test, Wilcoxon rank sum test, t test, logistic regression, and generalized linear model comparison algorithms may be used.

As shown in fig. 8B, expression of genes 1 to 3 was increased in both cells transformed from state 1 to state 2 and in vehicle cells exposed to perturbation. In addition, expression of genes 4 to 6 was reduced in both cells transitioning from state 1 to state 2 and in vehicle cells exposed to perturbation. Based on this similarity of gene expression changes in cells transitioning from state 1 to state 2 and in vehicle cells exposed to perturbation, it can be predicted that exposure of cells in state 1 to perturbation can induce a transition from cells in state 1 to state 2 by altering gene expression in the cells. Thus, perturbation can be used to control the transition of a cell from state 1 to state 2.

The method described above with respect to fig. 8B involves the identification of the perturbation associated with inducing a transition of a cell from general state 1 to general state 2. Thus, the method described above with respect to fig. 8B can be used to identify perturbations associated with inducing a cell to transition from any state to any other state in the map of the transition trajectory of fig. 8A. However, rather than referring to the general state in the mapping of the transition trajectory of fig. 8A, fig. 9 identifies a specific state in the mapping of the transition trajectory of fig. 8A, and then identifies a specific perturbation associated with inducing or inhibiting the transition of a cell from one identified state to another identified state in fig. 9, such that the cell becomes a mouse neuron (and vice versa a mouse muscle cell). Specifically, fig. 9 identifies the MEF state, the mouse "progenitor" cell state, the mouse myocyte state, and the mouse neuronal state, and then identifies the specific perturbation associated with inducing or inhibiting the transition of a cell from one of these states to another, such that the cell becomes a mouse neuron (and vice versa).

Figure 9 depicts small molecule perturbations associated with the conversion of MEFs into mouse neurons (and vice versa mouse myocytes), according to one embodiment. In particular, fig. 9 depicts a set of small molecule perturbations associated with the conversion of MEFs into mouse "progenitor" cells upon exposure to MEFs, a set of small molecule perturbations associated with inhibiting the conversion of mouse "progenitor" cells into mouse myocytes upon exposure to mouse "progenitor" cells, and a small molecule perturbation associated with the conversion of mouse "progenitor" cells into mouse neurons upon exposure to mouse "progenitor" cells. MEFs can be induced to transition to mouse neurons (and vice versa mouse myocytes) by exposing MEFs to perturbations associated with inducing MEFs to transition to mouse neurons, exposing MEFs to perturbations associated with inhibiting mouse "progenitor" cell to mouse myocyte transition, and exposing MEFs to perturbations associated with inducing mouse "progenitor" cell to mouse neuron transition.

Each of the small molecule perturbations depicted in fig. 9 is identified by performing the method described above with respect to fig. 8B. For example, to identify the small molecule perturbation BRD-K38615104 associated with the conversion of MEFs into mouse "progenitor" cells, the method of fig. 8B was used to determine that the change in gene expression in MEFs after conversion of MEFs into mouse "progenitor" cells matched (e.g., was equivalent or similar) the change in gene expression in vehicle cells after exposure of the vehicle cells to BRD-K38615104. And, therefore, BRD-K38615104 is predicted to induce the conversion of MEFs into mouse "progenitor" cells by altering gene expression in MEFs. Similarly, to identify the small molecule perturbation Dasatinib (Dasatinib) associated with inhibition of mouse "progenitor" cell transformation to mouse myocyte, the method of fig. 8B was used to determine that the change in gene expression in mouse "progenitor" cells following mouse "progenitor" cell transformation to mouse myocyte was the reversal of the change in gene expression in vehicle cells following exposure of the vehicle cells to Dasatinib. And, therefore, dasatinib is predicted to inhibit the conversion of mouse "progenitor" cells to mouse muscle cells.

As seen in figure 9, small molecule perturbations associated with the conversion of MEFs to mouse "progenitor" cells upon exposure to MEFs include BRD-K38615104, geldanamycin, manumycin a, mitoxantrone, curcumin, and trichostatin a. Small molecule perturbations associated with the conversion of mouse "progenitor" cells to mouse neurons when exposed to them include alvocidib, valeprinotat, KI20227, forskolin, PP1, and PP 2. Small molecule perturbations associated with inhibiting the conversion of mouse "progenitor" cells to mouse myocytes when exposed to mouse "progenitor" cells include Avastin, geldanamycin, Quinacrine (Quinacrine), CGP-60474, and dasatinib.

Two of the small molecule perturbations identified in figure 9, alvocidib and geldanamycin, were associated with inducing the transition of mouse cells to mouse neurons by inducing and/or inhibiting the transition of mouse cells in two different states. Specifically, as shown in fig. 9, alvocidib is associated with both inducing the conversion of mouse "progenitor" cells to mouse neurons and inhibiting the conversion of mouse "progenitor" cells to mouse muscle cells. Similarly, geldanamycin is associated with both inducing the conversion of MEFs into mouse "progenitor" cells and inhibiting the conversion of mouse "progenitor" cells into mouse muscle cells. Thus, by exposing MEFs to alvocidib and geldanamycin, MEFs can be predicted to be converted into mouse neurons.

Some of the small molecule perturbations identified in figure 9 are known in the literature to be associated with the indicated transition trajectories. In particular, forskolin, PP1 and PP2 are known in the literature to be associated with the induction of the conversion of mouse "progenitor" cells to mouse neurons. Similarly, trichostatin a is known in the literature to be associated with the induction of MEF conversion to mouse "progenitor" cells. This consistency of prediction by the method of fig. 8B and information known in the literature demonstrates the ability of the method of fig. 8B to accurately identify perturbations that affect a transition in cellular state.

In addition to accurately identifying perturbations known in the literature to affect a transition in a cellular state, the method of FIG. 8B is also capable of identifying perturbations in the literature that are not known to affect a transition in a cellular state. In particular, it is not known in the literature that the remaining small molecule perturbations depicted in figure 9 are associated with MEF conversion to mouse neurons (and vice versa to mouse myocytes). Thus, by using the method described above with respect to fig. 8B, perturbations known and unknown in the literature that induce cells to follow a particular transition trajectory can be identified. These identified perturbations can then be used to control cell state transitions, and thus cell fate.

V. example 4

The experiments of this example demonstrate methods for promoting neuronal and/or progenitor cells. In the experiments described herein, a starting population of fibroblasts (i.e., primary mouse fibroblasts) was exposed to a composition comprising Ascl1 overexpressing lentivirus. After 48 hours, a compound (e.g., forskolin, gesatinib, PD-0325901) or vehicle (i.e., DMSO or ethanol) is added to the composition. The total number of neurons was counted manually based on positive Tuj1/Map2 signals and neuron morphology. For each experiment, the total number of neurons per treatment condition was normalized relative to the number of neurons in the experiment in wells treated with DMSO. As shown in fig. 10A and 10B, the presence of neurons that developed from the starting fibroblast population was detected in these experiments. The fold change of both the total number of neurons and the percentage of neurons increases, decreases, or remains the same, depending on the compound added to the composition. These experiments indicate that the methods of the invention can be used to promote neuronal and/or progenitor cells from a starting cell population comprising fibroblasts.

Cell culture and Compound treatment

Primary Mouse Embryonic Fibroblasts (MEFs) at passage 2 were plated at 20,000 and 45,000/well (depending on batch) in MEF medium containing 10% FBS, 1 XGlutamax, 1 XMEM non-essential amino acids, 1mM sodium pyruvate, 0.05U/ml Pen/Strep and 55. mu.M. beta. -mercaptoethanol in DMEM on 24-well plates. After 24 hours of culture, lentiviral-infected MEFs were overexpressed with Ascl1 in MEF medium containing 8. mu.g/ml polybrene by centrifugal transfection (plates centrifuged at 2000rpm for 90 min at 32 ℃). See below for lentivirus production. After 48 hours, the medium was changed to neuronal medium containing DMEM/F12, 1% N2, 2% B271: 50, 1 XGlutamax, 25. mu.g/ml insulin, 0.05U/ml Pen/Strep with compound or vehicle (DMSO or ethanol). The compounds and their concentrations are selected from the following: BI-2536(200nM), cilostazol (1000nM), dabrafenib (2500nM), estradiol-cypionate (2000nM), EX-527(5000nM), fiducib (Fedratinib) (1000nM), forertinib (200nM), forskolin (5000nM), gesatinib (Gleshitinib) (2500nM), indirubin 3 oxime (2000nM), KI20227(250nM), KU 0060648(200nM), M-3M3FBS (1000nM), manumycin (800nM), PD-0325901(5000nM), PHA-665752(1000nM), quinacrine (200nM), marlin (Rottlerin) (1000nM), Seluminib (100nM), troglitazone (5000nM) and vemurafenib (5000 nM). Half-medium changes were performed every 2 to 3 days with supplemented compounds.

Immunofluorescence staining

On day 12 after Ascl1 infection, cells were fixed with 4% paraformaldehyde, permeabilized (0.2% Triton X100) and blocked in 5% serum (donkey, calf, goat serum mixture) and stained with rabbit anti-Tuj 1(1:1000) and mouse anti-Map 2(1:500) antibodies overnight at 4 ℃ or for 2 hours at room temperature before secondary antibody and DAPI staining.

Imaging and analysis

Imaging on Molecular Devices ImageXpress Micro; 36 images per well were taken from a 10x objective. The total number of neurons was counted manually based on positive Tuj1/Map2 signals and neuron morphology. For each experiment, the total number of neurons per treatment condition was normalized by the number of neurons in DMSO-treated wells for the experiment.

Lentivirus (lentivirus)Generating

Lentiviruses were packaged by transfecting 293T cells with a packaging plasmid (SystemsBio, LV510A-1) or an analogue and an Ascl1 overexpression plasmid (Ascl1 cDNA cloned into Origene lentivirus expression vector accession number PS 100064) via Mirus TransIT Lenti transfection reagent (Mirus, MIR 6603) and concentrated in a Beckmann Coulter ultracentrifuge at 16,500RPM for 1.5 hours. Experiments were performed with only 90% or more of the cells infected with lentivirus as judged by immunofluorescence staining of rabbits against Ascl1(1: 200; Abcam, ab74065-100UG) for 48 hours.

V. example 5

Embodiment 1. a method for predicting whether a perturbation will affect a cellular transition, the method comprising: on a computer system comprising memory and one or more processors: electronically accessing a single-cell transition feature representing a measure of differential cellular component expression between a first cellular state and an altered cellular state, wherein the altered cellular state occurs by a cellular transition from the first cellular state to the altered cellular state, and wherein the single-cell transition feature comprises an identification of a plurality of cellular components and, for each respective cellular component of the plurality of cellular components, a corresponding first saliency score quantifying a correlation between a change in expression of the respective cellular component and a change in cellular state between the first cellular state and the altered cellular state; accessing, in electronic form, a perturbation signature representing a measure of differential cellular component expression between a plurality of undisturbed cells and a plurality of disturbed cells exposed to the perturbation, wherein the perturbation signature comprises an identification of all or a portion of the plurality of cellular components and, for each respective cellular component in the all or the portion of the plurality of cellular components, a corresponding second significance score quantifying a correlation between (i) a change in expression of the respective cellular component between the plurality of undisturbed cells and the plurality of disturbed cells and (ii) a change in cellular state between the plurality of undisturbed cells and the plurality of disturbed cells; and comparing the single-cell transition signature and the perturbation signature, thereby determining whether the perturbation will affect the cell transition.

Embodiment 2. the method of embodiment 1, wherein accessing the single cell transition signature comprises: determining the single-cell transition signature based on (i) a first plurality of first single-cell cellular component expression datasets and (ii) a second plurality of second single-cell cellular component expression datasets, wherein: obtaining each respective first single-cell cellular component expression dataset of the first plurality of first single-cell cellular component expression datasets from a corresponding single cell of a first plurality of cells in the first cell state, and obtaining each respective second single-cell cellular component expression dataset of the second plurality of second single-cell cellular component expression datasets from a corresponding single cell of a second plurality of cells in the altered cell state.

Embodiment 3. the method of embodiment 2, wherein: each respective dataset of the first plurality of single-cell cellular component expression datasets comprising a corresponding cellular component vector of a first plurality of cellular component vectors, each respective dataset of the second plurality of single-cell cellular component expression datasets comprising a corresponding cellular component vector of a second plurality of cellular component vectors, each respective cell component vector of the first plurality of cell component vectors and the second plurality of cell component vectors comprises a plurality of elements, each respective element in the respective cellular component vector is associated with a corresponding cellular component in the plurality of cellular components, and comprising a corresponding value representing an amount of the corresponding cellular component of the corresponding single cell represented by a respective dataset of the first plurality of single cell cellular component expression datasets and the second plurality of single cell cellular component expression datasets.

Embodiment 4. the method of embodiment 3, further comprising: performing dimension reduction on the first plurality of single-cell cellular component expression datasets and/or the second plurality of single-cell cellular component expression datasets to generate a plurality of dimension reduced components; for each phase in the first plurality of cellular component vectors and the second plurality of cellular component vectorsApplying the plurality of dimension-reduced components to the respective cell component vectors to form corresponding dimension-reduced vectors comprising a dimension-reduced component value for each respective dimension-reduced component of the plurality of dimension-reduced components, thereby forming corresponding first and second pluralities of dimension-reduced vectors; and performing clustering to generate cluster C_jEach cluster containing a plurality of points corresponding to a subset of the first plurality of reduced-dimension vectors and the second plurality of reduced-dimension vectors; from the cluster C_jIdentifying the first plurality of cells by a first cluster of the set of; and from the cluster C_jThe method optionally further comprises performing manifold learning with the corresponding first and second plurality of dimension-reducing vectors to identify a relative cellular state of each cell of the first and second plurality of cells relative to each other cell.

Embodiment 5 the method of any one of embodiments 1 to 4, wherein the plurality of undisturbed cells is control cells that have not been exposed to the perturbation, or wherein the undisturbed cells are averages of unrelated disturbed cells that have been exposed to the perturbation.

Embodiment 6. the method of any one of embodiments 1 to 5, further comprising: pruning the single cell transformation signature and the perturbation signature to confine the plurality of cellular components to transcription factors, optionally measured at the RNA level.

Embodiment 7 the method of embodiment 2, wherein said determining said single cell transition characteristic comprises: determining differences in cellular constituent amounts of the plurality of cellular constituents between (i) the first plurality of first single-cell cellular constituent expression datasets and (ii) the second plurality of second single-cell cellular constituent expression datasets using one of a mean difference test, a Wilcoxon rank sum test, a t test, logistic regression, and a generalized linear model.

Embodiment 8 the method of embodiment 1, wherein the measure of differential cellular component expression quantifies a difference in cellular component quantity between (i) the third plurality of third single-cell cellular component expression datasets and (ii) the fourth plurality of fourth single-cell cellular component expression datasets using one of a Wilcoxon rank sum test, a t test, logistic regression, and a generalized linear model, wherein: obtaining each respective one of the third plurality of single-cell cellular component expression datasets from a corresponding single cell of the plurality of undisturbed cells, and each respective one of the fourth plurality of single-cell cellular component expression datasets from a corresponding single cell of a fourth plurality of cells of the plurality of disturbed cells exposed to the disturbance.

Embodiment 9. the method of any one of embodiments 1 to 8, further comprising: filtering the single cell transition feature and the perturbation feature to reduce a number of cellular components included in the single cell transition feature and the perturbation feature, optionally wherein the resulting filtering of the single cell transition feature and the perturbation feature comprises reducing the number of cellular components included in the single cell transition feature and the perturbation feature according to a threshold p-value or according to a threshold number of cellular components.

Embodiment 10 the method of any one of embodiments 1 to 9, wherein determining the corresponding second significance score for a respective cellular component comprises: for each respective cellular component of the plurality of cellular components, replacing the significance score of the respective cellular component with a corresponding match score of the respective cellular component; combining the match scores of the plurality of cellular components to generate the perturbed match score; and determining, based on the match scores for the perturbations, whether the respective perturbation is associated with a transition of a cell between the first cell state and the altered cell state, optionally wherein the corresponding match score comprises a discrete or continuous score.

Embodiment 11 the method of embodiment 10, wherein replacing the prominence score comprises: replacing the saliency score with a first score if the cellular component amount of the respective cellular component from the single-cell conversion feature and the cellular component amount of the respective cellular component from the perturbation feature are both up-regulated; replacing the significance score with a second score if the cellular component amount of the respective cellular component from the single-cell conversion feature is up-regulated and the cellular component amount of the respective cellular component from the perturbation feature is down-regulated; and replacing the significance score with a third score if the amount of the cellular component from the perturbation signature of the respective cellular component is not significantly up-regulated or down-regulated.

Embodiment 12 the method of embodiment 10, wherein replacing the prominence score comprises: replacing the significance score with a first score if both the cellular component amount of the respective cellular component from the single-cell conversion feature and the cellular component amount of the cellular component from the perturbation feature are down-regulated; replacing the significance score with a second score if the cellular component amount of the respective cellular component from the single-cell conversion feature is down-regulated and the cellular component amount of the cellular component from the perturbation feature is up-regulated; and replacing the significance score with a third score if the amount of the cellular component from the perturbation signature of the cellular component is not significantly up-or down-regulated.

Embodiment 13 the method of any one of embodiments 1 to 12, wherein the plurality of cellular components comprise a plurality of genes, optionally measured at the RNA level.

Embodiment 14 the method of embodiment 2, wherein each single-cell cellular component expression dataset in the first plurality of first single-cell cellular component expression datasets and the second plurality of second single-cell cellular component expression datasets is generated using a method selected from the group consisting of: single cell ribonucleic acid (RNA) sequencing (scRNA-seq), scTag-seq, single cell assays using sequenced transposase accessible chromatin (scATAC-seq), cyttof/SCoP, E-MS/absseq, miRNA-seq, CITE-seq, and any combination thereof, as well as summaries thereof, including combinations, such as linear combinations, representing activation pathways in the single cell cellular component expression dataset.

Embodiment 15 the method of any one of embodiments 1 to 14, further comprising: identifying the perturbation as a perturbation that promotes the altered cellular state based on the comparison, or identifying the perturbation as a perturbation that inhibits the altered cellular state based on the comparison.

Embodiment 16 the method of any one of embodiments 1 to 15, wherein the cellular transformation signature and the perturbation signature are generated using different types of cellular components.

Embodiment 17 the method of any one of embodiments 1 to 16, wherein the cellular transformation signature and the perturbation signature are generated using the same type of cellular component.

Embodiment 18 the method of any one of embodiments 1 to 17, wherein said accessing in electronic form is performed for each respective feature of a plurality of perturbations, thereby obtaining a plurality of perturbation features, said comparing said single-cell transition feature and said perturbation features with each respective feature of a plurality of perturbation features, thereby determining a subset of said plurality of perturbations associated with a transition of a cell between said first cell state and said altered cell state.

Embodiment 19 a computer system comprising one or more processors and memory storing instructions for performing the method of any one of embodiments 1-18.

Embodiment 20 a non-transitory computer readable medium storing one or more computer programs executable by a computer comprising one or more processors and memory for predicting whether a perturbation will affect a cellular transformation, the one or more computer programs collectively encoding computer executable instructions for performing the method of any one of embodiments 1-18.

Cited references and alternative embodiments

All references cited herein are incorporated by reference in their entirety and for all purposes to the same extent as if each individual publication or patent application was specifically and individually indicated to be incorporated by reference in its entirety for all purposes.

The invention may be implemented as a computer program product comprising a computer program mechanism embedded in a non-transitory computer readable storage medium. For example, the computer program product may include program modules illustrated in any combination of FIG. 1 or FIG. 2. These program modules may be stored on a CD-ROM, DVD, magnetic disk storage product, or any other non-transitory computer readable data or program storage product.

As will be apparent to those skilled in the art, many modifications and variations of the present invention can be made without departing from its spirit and scope. The specific embodiments described herein are provided by way of example only. The embodiments were chosen and described in order to best explain the principles of the invention and its practical applications, to thereby enable others skilled in the art to best utilize the invention and various embodiments with various modifications as are suited to the particular use contemplated. The invention is limited only by the terms of the appended claims, along with the full scope of equivalents to which such claims are entitled.

Supplementary Table 1

71页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：医用信息处理装置以及医用信息处理方法

Method for analyzing cells

相关技术

网友询问留言