Mass spectrometry for determining the presence or absence of a chemical element in an analyte

文档序号：1298345 发布日期：2020-08-07 浏览：19次中文

阅读说明：本技术 用于确定分析物中是否存在化学元素的质谱法 (Mass spectrometry for determining the presence or absence of a chemical element in an analyte ) 是由维布克·安德烈亚·蒂姆塞巴斯蒂安·魏纳尼古拉斯·谢斯勒于 2020-01-19 设计创作，主要内容包括：本发明涉及一种用于确定(预测)分析物中是否存在化学元素的质谱法,其为降低对分析物注释化学式的复杂性而提供有价值的信息。该方法基于将分析物的测量的同位素模式表示为特征向量,并且使用机器学习算法(如,支持向量机(SVM)或人工神经网络(NN))将特征向量分配到存在/不存在的分类。(The present invention relates to a mass spectrometry method for determining (predicting) the presence of a chemical element in an analyte, which provides valuable information for reducing the complexity of annotating a chemical formula with an analyte. The method is based on representing isotope patterns of measurements of analytes as feature vectors and assigning the feature vectors to presence/absence classifications using a machine learning algorithm, such as a Support Vector Machine (SVM) or an artificial Neural Network (NN).)

1. A mass spectrometry method for determining the presence or absence of a chemical element in an analyte comprising the steps of:

(a) generating analyte ions;

(b) measuring an isotope pattern of the analyte ion by mass spectrometry, wherein the isotope pattern includes a plurality of isotope peaks, and each isotope peak is characterized by a mass value and an intensity value;

(d) The feature vector is combinedApplied to a supervised element classifier that classifies the feature vectorsAssigning to a first classification (presence of chemical elements) or a second classification (absence of chemical elements), wherein the supervised element classifier is over a set of feature vectorsTraining, the set of feature vectors representing an isotopic pattern of a compound having a known elemental composition, and wherein the chemical element is present in a proper subset of the compound.

2. The method of claim 1, wherein the feature vectors representing corresponding isotope patternsAndeach of which includes a mass value and a normalized intensity value of the isotope peak.

3. The method of claim 2, wherein the feature vectors representing corresponding isotope patternsAndeach of which includes a mass value of the monoisotopic peak, a mass difference between the monoisotopic peak and other isotopic peaks, and a normalization of the isotopic peaksAnd (6) the intensity value is quantized.

4. The method of claim 3, wherein the feature vectorAndeach of which further comprises the mass difference between the monoisotopic peak and the nominal mass.

5. The method of claim 4, wherein the feature vectorAndis arranged as follows:wherein i is 1 … N, wherein m₀Is the mass value of the monoisotopic peak,is the normalized intensity value of the monoisotopic peak, d (m)₀，m_i) Is the mass difference between the monoisotopic peak and the ith isotopic peak,is the normalized intensity value of the ith isotope peak, and d (m)₀，M₀) Is the mass value of the monoisotopic peak and the nominal mass M₀The difference between them.

6. The method of claims 2 to 5, wherein the peak of the corresponding isotope is determined by using a p-normIntensity value s_iCalculating normalized intensity values of feature vectors

Wherein | s | | (∑ | s |)_i|p)^1/pWherein, p is more than or equal to 1.

7. The method of claim 1, wherein the feature vectors representing corresponding isotope patternsAndeach of which includes a mass value of the isotope peak and a transformed intensity value.

8. The method of claim 7, wherein the intensity values of the isotope peaks of the corresponding isotope pattern are transformed by a central logarithmic ratio (clr) transform or by an isometric logarithmic ratio (ilr) transform.

9. The method of claim 8, wherein the feature vectorAndis arranged as follows: [ m ] of₀，clr₀，d(m₀，m_i)，clr_i，d(m₀，M₀)]Wherein, i is 1 … N,

wherein m is₀Is the mass value of the monoisotopic peak, clr₀Is the intensity value after clr transformation of the monoisotopic peak，d(m₀，m_i) Is the mass difference between the monoisotopic peak and the ith isotopic peak, clr_iIs the clr transformed intensity value of the ith isotope peak, and d (m)₀，M₀) Is the difference between the mass value of the monoisotopic peak and the nominal mass, an

Wherein the clr transform is defined by:

wherein S is_i＝0...NIs the intensity value of the isotope peak.

10. The method of claim 9, wherein the feature vectorAndis arranged as follows:

[m₀，ilr₀，d(m₀，m_i)，ilr_i，d(m₀，m_N)，d(m₀，M₀)]wherein i is 1 … N-1,

wherein m is₀Is the mass value of the monoisotopic peak, ilr_iIs the ilr transformed intensity value of the isotope peak, d (m)₀，m_i) Is the mass difference between the monoisotopic peak and the ith isotopic peak, and d (m)₀，M₀) Is the difference between the mass value of the monoisotopic peak and the nominal mass, an

Wherein the clr transform is defined by:

wherein the content of the first and second substances,the balance matrix B with reduced dimensionality is dim (B) (N +1) × N, andB·B ^T＝I _N。

11. the method of claim 1, wherein the supervised element classifier is one of a Support Vector Machine (SVM), an artificial Neural Network (NN), and a random forest (RF, random decision forest) classifier.

12. The method of claim 11, wherein intrinsic parameters (hyper-parameters) of the supervised element classifier are optimized during training of the supervised element classifier.

13. The method of claim 1, wherein the presentation of the isotope patterns as feature vectors is optimized during training of the supervised element classifier.

14. The method of claim 13, wherein the selection of features or the estimation of feature importance is performed during training of the supervised element classifier.

15. The method of claim 1, wherein the chemical element is one of Br, Cl, S, I, F, P, K, Na, and Pt.

16. The method of claim 15, wherein in step (d), the first classification corresponds to the presence of two or more of the chemical elements and the second classification corresponds to the absence of the two or more of the chemical elements, and wherein the supervised element classifier is on a set of feature vectors representing isotopic patterns of compounds with known elemental compositionIs trained, and wherein said two of said chemical elementsOne or more are present in the proper subset of the compounds.

17. The method of claim 1, wherein isotopic patterns of compounds used to train the supervised element classifier are derived theoretically.

18. The method of claim 1, wherein an experiment measures isotopic patterns of compounds used to train the supervised element classifier.

19. The method of claim 18, wherein the isotopic pattern of the compound for the supervisory element classifier and the isotopic pattern of the analyte ion are measured on the same mass spectrometry system.

20. The method of claim 1, wherein determining whether the chemical element is present is used to reduce or increase the number of chemical elements during the annotating of the chemical formula for the analyte.

Technical Field

The present invention relates to mass spectrometry for determining the presence or absence of a chemical element in a compound.

Background

Mass Spectrometry (MS) is a widely used analytical method for the qualitative and quantitative identification of compounds in a variety of samples, including metabolomics, proteomics, pesticide analysis, natural compound identification, and pharmacy.

Mass spectrometry techniques involve converting a compound of a sample into a gas phase, ionizing the compound in an ion source, separating molecular ions of the compound according to mass-to-charge ratio in a mass analyzer, and detecting the separated molecular ions. The mass to charge ratio is generally represented by the symbol "m/z", where "m" is the mass of the ion and "z" is the fundamental charge number of the molecular ion. Alternatively, the molecular ions may be fragmented to form fragment ions, which are then separated according to mass-to-charge ratio and detected.

There are many different types of ion sources, such as chemical ionization, fast atom bombardment, matrix assisted laser desorption/ionization (MA L DI), and electrospray ionization (ESI), and there are many different types of mass analyzers, such as quadrupole mass filters, time-of-flight mass analyzers with orthogonal ion injection, RF ion traps, DC ion traps (such as orbitrap or Cassini traps), and ion cyclotron resonance traps.

The measured mass spectrum includes peaks (signals) of molecular ions, and the mass-to-charge ratio of each peak is shown on the abscissa and the corresponding relative intensity of the peak is shown on the ordinate. Each peak is characterized by an m/z value and an intensity value (signal height).

Due to the isotopes of the chemical elements, the mass spectrum of a molecular ion of a single compound presents a set of peaks with different m/z values. This set of (isotopic) peaks is called the "isotopic mode". Compounds having the same formula show the same isotopic pattern. The measured intensity of the isotopic peak is related to the abundance of a particular molecular ion within the isotopic pattern of the compound. The relative intensity of the isotope peaks correlates with the relative abundance of the isotopes.

The measured isotopic pattern of a compound can be used to annotate a chemical formula with a compound and is typically used to identify a compound. While this is easily done for very small molecules, it quickly becomes a difficult task for larger molecules.

The mass spectra obtained for complex mixtures of compounds contain multiple isotopic patterns. Matching the measured isotope patterns in terms of m/z and intensity values with the theoretically obtained isotope patterns to annotate the chemical formula for the corresponding compound is not an easy task. First, a set of isotope patterns is calculated for subsequent comparison with the measured isotope patterns. Conceptually, this is done by: the isotopic patterns of the preselected chemical elements are convolved for all possible element combinations (chemical formulae) of the chemical elements that match the m/z values of the monoisotopic peaks within a predetermined mass tolerance. For each of the possible chemical formulae, the isotopic pattern is calculated and compared to the measured isotopic pattern. The calculation can be done in different ways, e.g. Pearsons%²And (6) counting.

However, the amount of possible chemical formula becomes very large due to the combinatorics of the chemical elements involved. The number of possible chemical formulae around a particular m/z value and within a predetermined mass tolerance can be calculated for different sets of chemical elements. FIG. 1 shows the number of possible formulas within a mass tolerance of 5mDa in the m/z range between 100 and 600Da for three groups of chemical elements of interest ({ C, H, N, O }, { C, H, N, S, K, Cl }, { C, H, N, O, P, S, NA, K, Cl, BR, F, I }.

Since not all theoretically derived chemical formulas are chemically valid, the possible chemical formulas can be reduced by applying heuristic rules. However, since the possible chemical formulae grow exponentially with m/z, the number of remaining candidates can still be very large.

Current software tools typically rely on pattern comparisons as described above. This method has been adapted and new methods for calculating isotope patterns, for example using Markov Chains (Markov-Chains), have been proposed which reduce the computational cost by making tradeoffs. However, they do not solve the problem of indices of possible chemical formulae. The set of chemical elements used to calculate the possible chemical formulae is the core of the combinatorial problem. Pre-providing or excluding certain elements reduces the amount of possible chemical formulas to calculate and match. Therefore, there is a need to determine (predict) the chemical elements present in a compound to reduce complexity during the annotation of chemical formulas to the compound.

Disclosure of Invention

The invention provides a mass spectrometry method for determining the presence or absence of a chemical element in an analyte, comprising the steps of:

(a) generating molecular ions of the analyte;

(b) measuring an isotope pattern of the molecular ion by mass spectrometry, wherein the isotope pattern includes a plurality of isotope peaks, and each isotope peak is characterized by a mass value and an intensity value;

(d) The feature vector is combinedApplied to a supervised element classifier that classifies the feature vectorsAssigned to a first classification (presence of chemical elements) or a second classification (absence of chemical elements), wherein the supervised element classifier is at a set of feature vectors representing isotopic patterns of compounds with known elemental compositionAnd wherein the chemical element is present in a proper subset of the compound.

A group of compounds with known elemental composition includes a proper subset of compounds in which a chemical element is present and a proper subset of known compounds in which a chemical element is not present, i.e., neither subset is empty. Preferably, the compounds are distributed to the two subsets in a ratio of at least 20/80. More preferably, the ratio is substantially 50/50. The molecular mass of these compounds is preferably less than 1000Da, more preferably less than 600Da, especially between 100 and 600 Da. For example, mass spectrometry of isotopic patterns of an analyte can be performed at a mass resolution R ≦ 100000 (specifically at R ≦ 50000, more specifically at R ≦ 25000).

Isotopic patterns of known compounds collected in a database, for example in the KEGG database (Kyoto encyclopedia of genes and genomes), can however be used to select compounds containing the chemical element to be determined, theoretically derived isotopic patterns can be selected by applying known chemical construction rules, such as the "L ewis rule" and the "nitrogen rule", to the respective chemical formulae.

The chemical element to be determined is preferably one of Br, Cl, S, I, F, P, K, Na and Pt. The assigning in step (d) may be performed on a plurality of chemical elements by using different supervised element classifiers to achieve multi-element determination. Preferably, the supervised element classifier inherently performs multi-labeled classification on a set of two or more chemical elements. The assigned classification may also correspond to the presence or absence of a set of two or more chemical elements, wherein the supervised element classifier is a set of feature vectors representing isotopic patterns of compounds with known elemental compositionAnd wherein two or more of the chemical elements are present in the proper subset of the compound.

In a first embodiment, feature vectors representing corresponding isotope patternsAndeach of which includes a mass value and a normalized intensity value of an isotope peak. Feature vectorAndpreferably the mass values of the monoisotopic peaks, the mass differences between the monoisotopic peaks and the other isotopic peaks and the normalized intensity values of the isotopic peaks. More preferably, the feature vectorAndeach of which further comprises the mass difference between the monoisotopic peak and the nominal mass.

Feature vectorAndeach of which may be arranged, for example, as follows: :

wherein i is 1 … N, wherein m₀Is the mass value of the monoisotopic peak,is the normalized intensity value of the monoisotopic peak, d (m)₀，m_i) Is the mass difference between the monoisotopic peak and the ith isotopic peak,is the normalized intensity value of the ith isotope peak, and d (m)₀，M₀) Is the mass value of the monoisotopic peak and the nominal mass M₀The difference between them. The difference is preferably the result of a numerical subtraction, but may be a more general distance measurement. N is preferably greater than 1, more preferably greater than 4, in particular equal to 9. For N-2, the eigenvector looks like

By using p-norm, from the intensity values s of the corresponding isotope peaks_iCalculating normalized intensity values of feature vectors

Wherein, | s | | (Σ | s |)_i|p)^1/pWherein, 1 is not more than p, especially p is 1.

In a second embodiment, feature vectors representing corresponding isotope patternsAndeach of which includes a mass value of an isotope peak and a transformed intensity value. Preferably, the intensity values of the isotope peaks of the corresponding isotope pattern are transformed by a central logarithmic ratio transform (clr transform) or by an equidistant logarithmic ratio transform (ilr transform).

For clr transform, feature vectorAndare arranged as follows：

[m₀，clr₀，d(m₀，m_i)，clr_i，d(m₀，M₀)]Wherein, i is 1 … N,

wherein m is₀Is the mass value of the monoisotopic peak, clr₀Is the intensity value after clr transformation of the monoisotopic peak, d (m)₀，m_i) Is the mass difference between the monoisotopic peak and the ith isotopic peak, clr_iIs the clr transformed intensity value of the ith isotope peak, and d (m)₀，M₀) Is the difference between the mass value of the monoisotopic peak and the nominal mass, an

Wherein the clr transform is defined by:

clr_i＝log(s_i/(s₀·s₁·...s_N)^1/(N+1)) Wherein S is_i0 … N is the intensity value of the isotopic peak.

N is preferably greater than 1, more preferably greater than 4, in particular equal to 9. For N-2, the eigenvector looks like [ m [ ]₀，clr₀，d(m₀，m₁)，clr₁，d(m₀，m₂)，clr₂，d(m₀，M₀)]。

For the ilr transformation, the feature vectorAndis arranged as follows:

[m₀,ilr₀,d(m₀,m_i),ilr_i,d(m₀,m_N),d(m₀,M₀)]wherein, i is 1 … N-1,

wherein m is₀Is the mass value of the monoisotopic peak, ilr_iIs the ilr transformed intensity value of the isotope peak, d (m)₀，m_i) Is the monoisotopic peak and the ith isotopic peakA mass difference therebetween, and d (m)₀，M₀) Is the difference between the mass value of the monoisotopic peak and the nominal mass, an

Wherein the ilr transformation is defined by:

wherein the content of the first and second substances,the balance matrix B with reduced dimensionality is dim (B) (N +1) × N, andB·B ^T＝I _N。

n is preferably greater than 1, more preferably greater than 4, in particular equal to 9. For N-2, the eigenvector looks like [ m [ ]₀，ilr₀，d(m₀，m₁)，ilr₁，d(m₀，m₂)，d(m₀，M₀)]。

Preferably, during training of the supervised element classifier, intrinsic parameters (hyper-parameters) of the supervised element classifier are optimized, e.g. by using one of a group optimization, an evolutionary algorithm, a genetic algorithm, a multi-start optimization, simulated annealing and a pattern search.

The representation may be optimized with respect to, for example, the dimensions of the feature vectors, normalization or transformation of the measured intensity values, arrangement of the components of the feature vectors.

The isotopic pattern of the analyte is preferably measured by a mass analyzer coupled to an upper free mobility analyzer and/or a gas or liquid chromatograph. Preferably, the mass analyser is a time-of-flight mass analyser with orthogonal ion implantation (OTOF). More preferably, OTOF is coupled to an ion mobility mass analyser, in particular to a TIMS analyser (trapped ion mobility spectrometry).

In another aspect, the results of determining whether a chemical element is present according to the present invention are used to reduce or increase the number of chemical elements considered during the annotation of a chemical formula to an analyte (particularly, during the calculation of the set of isotopic patterns for subsequent comparison with the measured isotopic patterns). the isotopic patterns of analyte ions are preferably measured during an L C separation or GC separation, and more preferably during a coupled L C-IMS separation or GC-IMS separation.

Drawings

FIG. 1 shows the number of possible chemical formulae of three groups of chemical elements of interest ({ C, H, N, O }, { C, H, N, O, P, S, NA, K, Cl }, { C, H, N, O, P, S, NA, K, Cl, Br, F, I } within a mass tolerance of 5mDa in the m/z range between 100 and 600 Da.

Fig. 2 shows a flow chart of a method according to the invention.

Figure 3 shows the number of compounds (positive and negative) measured experimentally for chemical elements of interest prepared in equal amounts for training and validation. The data set was divided into 80%/20% for training and validation of the supervised element classifier.

Fig. 4 shows the results of smart boundary RBF-Kernel SVMs (smart-margin RBF-Kernel SVMs) trained on experimental data and optimized by particle swarm optimization. The measured intensity values of isotope patterns are normalized by a p-norm (closure) with p ═ 1. The results include accuracy, sensitivity, specificity and complete confusion matrix for correct classification.

Fig. 5 shows the results of the intelligent boundary RBF-kernel SVM trained on experimental data and optimized by particle swarm optimization. The measured intensity values of the isotope patterns are transformed by a central logarithmic ratio transform (clr). The results include accuracy, sensitivity, specificity and complete confusion matrix for correct classification.

Fig. 6 shows the results of the intelligent boundary RBF-kernel SVM trained on experimental data and optimized by particle swarm optimization. The measured intensity values of the isotope patterns are transformed by an isometric logarithmic ratio transform (ilr). The results include accuracy, sensitivity, specificity and complete confusion matrix for correct classification.

FIG. 7 shows a schematic diagram of a dense feed-forward neural network with bias. The numbers in neurons describe the indices of the neurons, not their values.

FIG. 8 shows the results of a dense feedforward artificial neural network trained on experimental data and optimized by an evolutionary algorithm. The measured intensity values of the isotope patterns are normalized by a p-norm with p ═ 1 (closure). The results include accuracy, sensitivity, specificity and complete confusion matrix for correct classification.

FIG. 9 shows the results of a dense feedforward artificial neural network trained on experimental data and optimized by an evolutionary algorithm. The measured intensity values of the isotope patterns are transformed by a central logarithmic ratio transform (clr). The results include accuracy, sensitivity, specificity and complete confusion matrix for correct classification.

FIG. 10 shows the results of a dense feedforward artificial neural network trained on experimental data and optimized by an evolutionary algorithm. The measured intensity values of the isotope patterns are transformed by an isometric logarithmic ratio transform (ilr). The results include accuracy, sensitivity, specificity and complete confusion matrix for correct classification.

Detailed Description

While the invention has been shown and described with reference to a number of different embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the scope of the invention as defined by the appended claims.

Elemental composition is the core of the combinatorial problem of generating possible formulae for a given m/z value. It is within the scope of the invention to predict the chemical elements contained in an analyte from the measured isotopic pattern of the analyte and thus determine the elemental composition of the analyte for subsequent generation of a possible chemical formula. Providing or excluding certain chemical elements reduces the amount of possible chemical formulas to be calculated and compared. According to the present invention, a way to solve this problem is provided by machine learning using supervised classifiers.

In addition to reducing the complexity of the annotation process, the method according to the invention allows to specifically select and only examine certain isotopic patterns, and therefore the compounds of interest, based on the presence of specific chemical elements.

Definition of

The term "mass value" is used interchangeably herein for the mass-to-charge ratio (m/z value) of a molecular ion and for the molecular mass of the corresponding compound. The mass-to-charge ratio of a molecular ion can be converted to the molecular mass of the corresponding compound, for example, by charge deconvolution.

The "nominal mass" of a chemical element is the mass number of its most abundant naturally occurring stable isotope. For molecular ions or molecules, the nominal mass is the sum of the nominal masses of the constituent atoms. For example, carbon has two stable isotopes, 12C at 98.9% natural abundance and 13C at 1.1% natural abundance, so the nominal mass of the carbon is 12.

The mass of a "monoisotopic peak" is the sum of the masses of the atoms in the molecule using the masses of the main (most abundant) isotopes of each chemical element. The difference between the nominal mass and the monoisotopic mass is called the mass deficit.

A "confusion matrix" is a table that allows visualization of the performance of a classifier (typically a supervised classifier). Each row of the confusion matrix represents an instance in the predicted classification, while each column represents an instance in the actual classification:

support Vector Machine (SVM):

a Support Vector Machine (SVM) is a supervised machine learning method that can be used for classification. During training, the SVM constructs a hyperplane in the high-dimensional data space that separates labeled training data points relative to their classification labels. The parameters of the hyperplane are optimized such that the distance to the nearest training data point of any classification (the so-called boundary) is maximized. An important consequence of this geometric description is that the hyperplane of the maximum boundary is entirely determined by those data points located closest to it. These data points are called support vectors. Unlabeled data points to be classified after training are assigned by determining which side the unlabeled data point is located on. Once properly trained, unlabeled data points can be assigned to a class with fast and low computational effort.

SVM can be extended to situations where the data cannot be separated linearly, for example by introducing so-called "soft boundaries". The "soft boundary" allows training data points not to be accurately separated by a boundary. The internal untrained parameters (hyper-parameters) of the SVM determine the tradeoff between adding a boundary and ensuring that all training data points are on the correct side of the boundary.

SVMs may also be generalized by applying so-called kernel trigk, by which data points of the input space are transformed into the transformation feature space. The transformation allows fitting of the largest bounding hyperplane in the transformed feature space. The transformation may be non-linear and the transformation feature space is higher than the dimension of the input space. Although the classifier is based on separate hyperplanes in the transformed feature space, it may be non-linear in the original input space. The non-linear kernel function may further comprise additional hyper-parameters (untrained predetermined parameters). Common kernel functions include, for example, polynomials (homogeneous or non-homogeneous), Radial Basis Functions (RBFs), and hyperbolic tangent functions.

Artificial Neural Network (ANN)

Artificial Neural Networks (ANN) are systems inspired by biological neural networks. An ANN is typically based on many connected nodes (artificial neurons). Each connection (edge) between an artificial neuron (e.g., a synapse in a biological neural network) may send a signal from one artificial neuron to another artificial neuron. An artificial neuron receiving a signal may process the signal and then signal another artificial neuron connected to the artificial neuron. The output of each artificial neuron is computed by some non-linear function (activation function) of the sum of its inputs. The artificial neuron may have a threshold such that a signal is only sent if the sum of the inputs is above the threshold.

Typically, artificial neurons accumulate in the integration layer. Different layers may perform different types of transformations on their inputs. The signal travels from the first layer (input layer) to the last layer (output layer), possibly after passing through multiple hidden layers.

Connections between artificial neurons typically have weights that are adjusted during training. The weights increase or decrease the signal strength at the connection. Many algorithms are available for training neural network models. Most of them can be considered as optimizations that take some form of gradient descent and use back propagation to calculate the actual gradient.

Artificial neural networks typically include a plurality of hyper-parameters, particularly more hyper-parameters than SVMs. The hyper-parameters of the artificial neural network may relate to the structure of the network itself (e.g. the number of hidden layers, the number of nodes, the deviation of nodes or layers), as well as to the parameters of the activation functions of the nodes and regularization parameters that penalize decision boundaries in case of overfitting.

Example 1

Here, the supervised element classifier is a Support Vector Machine (SVM) using "soft boundaries" and RBF kernels. The hyper-parameters are associated with "soft boundaries" and RBF kernels, and are optimized by particle swarm optimization during training. Experimental measurements were used to train and validate the isotope patterns of SVMs.

Experimental data were obtained from measurements on an OTOF mass spectrometer with an electrospray source coupled to L C compounds with known elemental composition belong to different classes of compounds, coffee metabolomics, synthetic molecules, pesticides and toxic substances.

Elemental determination is only applicable to compounds with molecular masses below 600 Da. The training data set was balanced with equal amounts of compounds containing elements (positive) and compounds containing no elements (negative). The chemical elements of interest are: br, Cl, S, I, F, P, K and Na. Elements C, H, N and O are almost always present and are therefore not part of the classification. The selection of the elements of interest is based on their presence in experimental data and in most biomolecules. Fig. 3 shows the number of compounds (positive and negative) used to train and validate the chemical elements of interest of the SVM. Data sets were divided for training (80%) and validation (20%). The number of compounds used for validation was:

isotope patterns are represented in three different ways by using p normalization with p ═ 1 (called closure), central logarithmic ratio transformation (called clr) and equidistant logarithmic ratio transformation (called ilr). For closure and clr representation, the feature vectors are arranged as follows: [ m ] of₀，Int₀，m_i-m₀，Int_i，mDef]Wherein i is 1 … 9, wherein m₀And m_iIs the isotope peak mass value, mDef is mass loss, Int₀And Int_iIs derived from measured intensity values s_iA calculated normalized or transformed intensity value. For the ilr representation, the feature vector does not include Int₉And (4) components. The length of the feature vector is 21 (closure and clr) and 20 (ilr). The hyperparameters of the SVM are optimized individually for each representation.

Fig. 4 to 6 show the results of the intelligent boundary RBF-kernel SVM trained on experimental data and optimized by particle swarm optimization. The results include accuracy, sensitivity, specificity and complete confusion matrix for correct classification. In fig. 4, the measured intensity values of the isotope pattern are normalized by a p-norm (closure) with p ═ 1. In fig. 5, the measured intensity values of the isotope patterns are transformed by a central logarithmic ratio transform (clr). In fig. 6, the measured intensity values of the isotope patterns are transformed by an equidistant logarithmic ratio transform (ilr).

Example 2

Here, the supervised element classifier is a dense feedforward artificial neural network ANN with bias, as shown in fig. 7. In dense networks, each layer is fully connected to the next layer. The activation function of the ANN is a rectified linear unit:

the prediction of the validation data set is made by a feed forward path through the ANN.

Experimental measurements were used to train and validate the isotope patterns of ANN. The experimental data and representation of the isotope patterns are the same as in example 1.

During training, the feature vectors are submitted to the ANN in batches. The batch is a subset of all the feature vectors used to train the ANN. Once a batch has passed through the ANN, back propagation occurs. The errors of the current prediction are propagated back through the ANN to update the weights by adjusting their values in small steps towards the optimal gradient. The weights are adjusted for a given set of hyper-parameters.

The hyper-parameters of the ANN are the regularization parameter, the number of hidden layers and the number of artificial neurons in the hidden layers. The hyperparameters of the ANN are optimized using an evolutionary algorithm.

Fig. 8 to 10 show the results of the ANN. The results include accuracy, sensitivity, specificity and complete confusion matrix for correct classification. In fig. 8, the measured intensity values of the isotope pattern are normalized by a p-norm (closure) with p ═ 1. In fig. 9, the measured intensity values of the isotope patterns are transformed by a central logarithmic ratio transform (clr). In fig. 10, the measured intensity values of the isotope patterns are transformed by an equidistant logarithmic ratio transform (ilr).

The results of both examples show that the machine learning algorithm used achieves good prediction results for element prediction from mass spectral signals. SVMs are more efficient than ANN. Predictions of multi-isotopic (polyisotropic) chemical elements are generally more accurate than predictions of monoisotopic chemical elements.

Considering that the usage of chemical elements is reduced during the annotation of a chemical formula to a measured analyte, if so predicted, elements may be removed based on such considerations. However, it is desirable to prevent elements present in the potential analyte from being removed from this consideration during annotation. Otherwise the correct match cannot be found. For this use case, the Negative Predictive Value (NPV) of the classifier is important. It refers to the percentage of negative predictions that are correct under negative conditions.

The SVM classifier showed that the NPV of the multi-isotopic chemical element was 89-100%. The NPV of ANN is generally poor.

Positive Predictive Value (PPV) is important for the opposite use case of the proposed elements during annotation of the chemical formula for the measured analyte. PPV refers to the percentage of correct positive predictions under positive conditions. However, chemical elements suggested to be not part of the potential analyte lead to the addition of false positive chemical formulas and increase overall complexity. Therefore, the classifier for such a use case needs to have a high PPV value.

The SVM classifier showed a PPV of 89% or more for the multi-isotopic chemical element. The PPV of NN is generally poor.

The present invention has been shown and described above with reference to a number of different embodiments thereof. However, it will be apparent to one skilled in the art that aspects or details of the invention may be changed, if practicable, or may be combined in any way in different embodiments without departing from the scope of the invention. In general, the foregoing description is for the purpose of illustration only, and not for the purpose of limitation, the invention being defined solely by the claims that follow, the invention including any equivalents thereof.

21页详细技术资料下载

Mass spectrometry for determining the presence or absence of a chemical element in an analyte

相关技术

网友询问留言