Feature extraction method, related device, equipment and storage medium

文档序号：70739 发布日期：2021-10-01 浏览：42次中文

阅读说明：本技术 一种特征提取方法、相关装置、设备和存储介质 (Feature extraction method, related device, equipment and storage medium ) 是由毕研广胡志强于 2021-06-29 设计创作，主要内容包括：本申请公开了一种特征提取方法和相关装置、设备和存储介质,该特征提取方法包括：基于获取的氨基酸序列信息,确定每个氨基酸的初始化特征；基于每个氨基酸的初始化特征,得到氨基酸序列中每个氨基酸的全局特征；将每个氨基酸的全局特征与氨基酸序列中其他氨基酸的全局特征进行特征融合,得到每对氨基酸的融合特征组成的逐对特征图；基于逐对特征图的空间特征,得到蛋白质特征向量。上述方案,能够直接对氨基酸序列中任意氨基酸之间进行特征学习,通过蛋白质特征向量体现蛋白质的空间结构信息。(The application discloses a feature extraction method, a related device, equipment and a storage medium, wherein the feature extraction method comprises the following steps: determining an initialization characteristic of each amino acid based on the obtained amino acid sequence information; obtaining a global feature of each amino acid in the amino acid sequence based on the initialized feature of each amino acid; performing feature fusion on the global features of each amino acid and the global features of other amino acids in the amino acid sequence to obtain a pairwise feature map consisting of the fusion features of each pair of amino acids; and obtaining a protein feature vector based on the space features of the pair-by-pair feature map. According to the scheme, the characteristics of any amino acid in the amino acid sequence can be directly learned, and the spatial structure information of the protein can be embodied through the protein characteristic vector.)

1. A method of feature extraction, the method comprising:

determining an initialization characteristic of each amino acid based on the obtained amino acid sequence information;

obtaining a global feature of each amino acid in the amino acid sequence based on the initialized feature of each amino acid;

performing feature fusion on the global features of each amino acid and the global features of other amino acids in the amino acid sequence to obtain a pairwise feature map consisting of the fusion features of each pair of amino acids;

and obtaining a protein feature vector based on the space features of the pairwise feature map.

2. The feature extraction method according to claim 1, wherein the determining the initialization feature of each amino acid based on the obtained amino acid sequence information comprises:

inputting the amino acid sequence information into a one-dimensional convolution network, and extracting the initialization characteristic of each amino acid, wherein the initialization characteristic comprises local characteristic information of the amino acid in the amino acid sequence.

3. The feature extraction method according to any one of claims 1 to 2,

said deriving a global signature for each amino acid in said amino acid sequence based on said initialized signature for each amino acid comprising:

inputting the initialized features of each amino acid into a recurrent neural network;

obtaining the global characteristic of each amino acid in the amino acid sequence based on the position information of each amino acid in the amino acid sequence and the initialized characteristic.

4. The feature extraction method according to claim 3,

the obtaining of the global feature of each amino acid in the amino acid sequence based on the position information of each amino acid in the amino acid sequence and the initialization feature comprises:

traversing the amino acid sequence from left to right based on the initialized feature of each amino acid to obtain a first global feature of each amino acid;

traversing the sequence of amino acids from right to left based on the initialized feature for each amino acid to obtain a second global feature for each amino acid;

and fusing the first global characteristic and the second global characteristic of each amino acid to obtain the global characteristic of each amino acid in the amino acid sequence.

5. The feature extraction method according to any one of claims 1 to 4,

the feature fusion is carried out on the global features of each amino acid and the global features of other amino acids in the amino acid sequence to obtain a pair-by-pair feature map consisting of the fusion features of each pair of amino acids, and the pair-by-pair feature map comprises the following steps:

fusing the global features of each amino acid with the global features of other amino acids to obtain fused features of each pair of amino acids;

and learning the fusion characteristics of each pair of amino acids through a shared sensing machine to construct a pair-by-pair characteristic diagram of a two-dimensional space.

6. The feature extraction method according to any one of claims 1 to 5,

obtaining a protein feature vector based on the spatial features of the pair-wise feature map, including:

inputting the pair-by-pair feature maps into a convolution network, and extracting the spatial features of the pair-by-pair feature maps;

and globally pooling the spatial features of the pair-by-pair feature maps to obtain the protein feature vector.

7. The feature extraction method according to claim 6,

the global pooling of the spatial features of the pair-wise feature map to obtain the protein feature vector comprises:

acquiring the space characteristic with the minimum length in the pair-by-pair characteristic diagram;

pooling other spatial features in the pair-by-pair feature map based on the spatial feature of the minimum length to obtain fixed-length spatial features;

and acquiring the protein feature vector from the fixed-length spatial features.

8. The feature extraction method according to any one of claims 1 to 7,

before determining the initialization feature of each amino acid based on the obtained amino acid sequence information, the feature extraction method further includes:

determining a unique thermal code for each amino acid based on the obtained amino acid sequence information.

9. A feature extraction device characterized by comprising:

the characteristic extraction module is used for determining the initialization characteristic of each amino acid based on the obtained amino acid sequence information;

the feature extraction module is further used for obtaining a global feature of each amino acid in the amino acid sequence based on the initialized feature of each amino acid;

the characteristic fusion module is used for carrying out characteristic fusion on the global characteristic of each amino acid and the global characteristics of other amino acids in the amino acid sequence to obtain a pairwise characteristic diagram consisting of the fusion characteristics of each pair of amino acids;

and the characteristic acquisition module is used for obtaining a protein characteristic vector based on the space characteristics of the pairwise characteristic diagram.

10. A feature extraction device comprising a memory and a processor coupled to each other, the processor being configured to execute program instructions stored in the memory to implement the feature extraction method of any one of claims 1 to 8.

11. A computer-readable storage medium having stored thereon program instructions, which when executed by a processor, implement the feature extraction method of any one of claims 1 to 8.

Technical Field

The present application relates to the field of bioinformatics and computer technologies, and in particular, to a feature extraction method, a related apparatus, a device, and a storage medium.

Background

A key link in the development of new drugs is the prediction of the reaction between a drug and a target protein, and the traditional experimental determination usually requires a large amount of capital, labor and time costs, wherein the analysis of the protein is more difficult. The protein is formed by mutually reacting and connecting different amino acids, has a complex spatial structure, and is high in experimental determination cost, so that the current related technology can only extract the primary sequence structure of the amino acids in most proteins, and a large amount of spatial structure information is lost.

Disclosure of Invention

The application provides at least a feature extraction method, a related device, equipment and a storage medium.

A first aspect of the present application provides a feature extraction method, including:

determining an initialization characteristic of each amino acid based on the obtained amino acid sequence information;

obtaining a global feature of each amino acid in the amino acid sequence based on the initialized feature of each amino acid;

and obtaining a protein feature vector based on the space features of the pairwise feature map.

Therefore, the feature extraction method can directly learn the features of any amino acid in the amino acid sequence, and the spatial structure information of the protein is embodied through the protein feature vector.

In some embodiments, the determining the initialization feature for each amino acid based on the obtained amino acid sequence information comprises:

Therefore, the initialization features of the amino acid are extracted through the one-dimensional convolution network, and the initialization features are used for representing the primary sequence structure of the amino acid sequence.

In some embodiments, said deriving a global feature for each amino acid in said amino acid sequence based on said initialized feature for each amino acid comprises:

inputting the initialized features of each amino acid into a recurrent neural network;

obtaining the global characteristic of each amino acid in the amino acid sequence based on the position information of each amino acid in the amino acid sequence and the initialized characteristic.

Therefore, through the recurrent neural network, the characteristic relation between any one amino acid and the amino acid sequence is calculated and used for constructing spatial structure information.

In some embodiments, said deriving a global feature of said each amino acid in said amino acid sequence based on said positional information of said each amino acid in said amino acid sequence and an initialization feature comprises:

traversing the amino acid sequence from left to right based on the initialized feature of each amino acid to obtain a first global feature of each amino acid;

traversing the sequence of amino acids from right to left based on the initialized feature for each amino acid to obtain a second global feature for each amino acid;

and fusing the first global characteristic and the second global characteristic of each amino acid to obtain the global characteristic of each amino acid in the amino acid sequence.

Thus, a specific function of the recurrent neural network is provided, with the characteristic relationship of each amino acid to the amino acid sequence calculated from left to right and from right to left.

In some embodiments, the feature fusion of the global features of each amino acid with the global features of other amino acids in the amino acid sequence to obtain a pair-wise feature map composed of the fused features of each pair of amino acids comprises:

fusing the global features of each amino acid with the global features of other amino acids to obtain fused features of each pair of amino acids;

and learning the fusion characteristics of each pair of amino acids through a shared sensing machine to construct a pair-by-pair characteristic diagram of a two-dimensional space.

Thus, by global feature multiplication, feature learning between arbitrary amino acids can be performed without regard to sequence distance, potentially modeling spatial structure information.

In some embodiments, the deriving a protein feature vector based on the spatial features of the pair-wise feature map includes:

inputting the pair-by-pair feature maps into a convolution network, and extracting the spatial features of the pair-by-pair feature maps;

and globally pooling the spatial features of the pair-by-pair feature maps to obtain the protein feature vector.

Therefore, the spatial structure information can be constructed by the convolutional network.

In some embodiments, the globally pooling spatial features of the pair-wise feature maps to obtain the protein feature vector includes:

acquiring the space characteristic with the minimum length in the pair-by-pair characteristic diagram;

pooling other spatial features in the pair-by-pair feature map based on the spatial feature of the minimum length to obtain fixed-length spatial features;

and acquiring the protein feature vector from the fixed-length spatial features.

Therefore, the space features of the pair-by-pair feature map are globally pooled to obtain the protein feature vector with a fixed length, and comparison and reconstruction of the feature vector are facilitated.

In some embodiments, before determining the initial feature of each amino acid based on the obtained amino acid sequence information, the feature extraction method further comprises:

determining a unique thermal code for each amino acid based on the obtained amino acid sequence information.

Therefore, each amino acid is coded in a unique hot coding mode, so that the initial characteristics of each amino acid are obtained, and the position relation among the amino acids in the amino acid sequence is favorably embodied.

A second aspect of the present application provides a feature extraction device including:

the characteristic extraction module is used for determining the initialization characteristic of each amino acid based on the obtained amino acid sequence information;

the feature extraction module is further configured to obtain a global feature of each amino acid in the amino acid sequence based on the initialized feature of each amino acid;

and the characteristic acquisition module is used for obtaining a protein characteristic vector based on the space characteristics of the pairwise characteristic diagram.

Therefore, feature learning can be directly performed between arbitrary amino acids in the amino acid sequence, and spatial structure information of the protein can be represented by the protein feature vector.

A third aspect of the present application provides a feature extraction device, which includes a memory and a processor coupled to each other, wherein the processor is configured to execute program instructions stored in the memory to implement the feature extraction method in the first aspect.

A fourth aspect of the present application provides a computer-readable storage medium having stored thereon program instructions that, when executed by a processor, implement the feature extraction method of the first aspect described above.

According to the scheme, the feature extraction equipment determines the initialization feature of each amino acid based on the obtained amino acid sequence information; obtaining a global feature of each amino acid in the amino acid sequence based on the initialized feature of each amino acid; performing feature fusion on the global features of each amino acid and the global features of other amino acids in the amino acid sequence to obtain a pairwise feature map consisting of the fusion features of each pair of amino acids; and obtaining a protein feature vector based on the space features of the pairwise feature map, directly learning the features of any amino acid in the amino acid sequence, and reflecting the space structure information of the protein through the protein feature vector.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and, together with the description, serve to explain the principles of the application.

FIG. 1 is a schematic flow chart diagram illustrating an embodiment of a feature extraction method provided herein;

FIG. 2 is a schematic flow chart diagram of another embodiment of a feature extraction method provided herein;

FIG. 3 is a block diagram of a feature extraction process provided herein;

fig. 4 is a detailed flowchart of step S22 in the feature extraction method shown in fig. 2;

fig. 5 is a detailed flowchart of step S25 in the feature extraction method shown in fig. 2;

FIG. 6 is a block diagram of an embodiment of a feature extraction apparatus provided herein;

FIG. 7 is a block diagram of another embodiment of a feature extraction apparatus provided herein;

FIG. 8 is a block diagram of an embodiment of a computer-readable storage medium provided herein.

Detailed Description

The following describes in detail the embodiments of the present application with reference to the drawings attached hereto.

In the following description, for purposes of explanation and not limitation, specific details are set forth such as particular system structures, interfaces, techniques, etc. in order to provide a thorough understanding of the present application.

The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship. Further, the term "plurality" herein means two or more than two. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C.

In general, in the study of protein sequences, the current research method is based on natural language processing technology, and the amino acid sequences are regarded as language sequences similar to texts, and then amino acid features are extracted through training models such as a recurrent neural network. Then, the role of the protein is determined by the spatial structure of the protein, the learning and research effect in the primary sequence is limited, and the structural information on a higher level cannot be captured, so that the subsequent feature extraction and model learning are directly influenced.

In each protein, the sequence of amino acids in the polypeptide chain, including the position of the disulfide bonds, is referred to as the primary structure of the protein, also called the primary structure or basic structure. The primary structure of the protein is a necessary basis for understanding the structure, action mechanism and physiological function of the homologous protein, and the sequence of amino acids needs to be embodied in a spatial structure, so that the application provides a spatial structure-based method for extracting the primary sequence feature of the protein, and corresponding equipment and devices.

Referring to fig. 1, fig. 1 is a schematic flow chart of an embodiment of a feature extraction method provided in the present application.

Specifically, the feature extraction method of the embodiment of the present disclosure may include the following steps:

step S11: based on the obtained amino acid sequence information, the initialization characteristics of each amino acid are determined.

In the embodiment of the disclosure, the amino acid sequences of a group of proteins are obtained, wherein the amino acid sequences include a plurality of amino acids arranged according to a certain sequence, which represents the sequence of the peptide chain (or polypeptide) formed by connecting the amino acids, and therefore, the type, the number and the combination sequence of the amino acids represent the properties of the corresponding proteins.

The disclosed embodiments can encode each amino acid in an amino acid sequence to obtain an initialization characteristic of each amino acid; the code content includes the kind of the amino acid and the connection relation between the amino acid and other surrounding amino acids, namely the local information of the amino acid. The encoding method may be, but not limited to, one-hot encoding, and may also be encoding methods such as spatial distance encoding, and is not limited herein.

Step S12: based on the initialized features for each amino acid, global features for each amino acid in the amino acid sequence are obtained.

In the embodiment of the present disclosure, since the initialized features obtained by the one-hot coding method only include the feature information of the amino acid itself and the local information of other surrounding amino acids, sufficient information cannot be provided for constructing the spatial structure information. Thus, embodiments of the present disclosure can obtain global features of the initialized features of each amino acid in the amino acid sequence of the protein.

Specifically, the initialization features of a certain amino acid and the initialization features of other amino acids are calculated respectively, and the global features of the amino acid in the whole amino acid sequence are calculated according to the relationship between the initialization features of the amino acid and the initialization features of the other amino acids. The global characteristic information of an amino acid is affected by other amino acids in the amino acid sequence, and therefore, the global characteristic information of an amino acid includes the characteristic information of the amino acid and the characteristic information of the amino acid in the amino acid sequence.

Step S13: and performing feature fusion on the global features of each amino acid and the global features of other amino acids in the amino acid sequence to obtain a pair-by-pair feature map consisting of the fusion features of each pair of amino acids.

In the embodiment of the disclosure, each pair of amino acids is obtained to form an amino acid pair, and the global features of each pair of amino acid pairs are subjected to feature fusion, so as to extract the pair-by-pair fusion features of each pair of amino acid pairs. The embodiment of the disclosure further combines all pair-by-pair fusion features into a pair-by-pair feature map, namely a spatial two-dimensional feature map; through the space two-dimensional feature map, the feature extraction equipment can autonomously learn the space structure information of the amino acid sequence.

Step S14: and obtaining a protein feature vector based on the space features of the pair-by-pair feature map.

In the embodiment of the present disclosure, the spatial two-dimensional feature map obtained in step S13 may be regarded as an image, and the embodiment of the present disclosure further performs convolution and pooling on the spatial two-dimensional feature map, extracts a feature vector of an amino acid sequence, that is, a protein feature vector, from the spatial two-dimensional feature map, and is used in application scenarios such as protein bulk analysis or drug target protein reaction prediction.

In the scheme, the initialization characteristics of each amino acid are obtained based on the obtained amino acid sequence information; acquiring global characteristics of each initialization characteristic in an amino acid sequence of the protein; performing feature fusion on each pair of global features to obtain a pair-by-pair feature map consisting of the fusion features of each pair of amino acids; the protein feature vector is obtained from the space features of the pairwise feature map, the feature learning can be directly carried out on any amino acid in the amino acid sequence, and the space structure information of the protein is embodied through the protein feature vector.

Referring to fig. 2 and fig. 3, fig. 2 is a schematic flow chart of another embodiment of the feature extraction method provided in the present application, and fig. 3 is a schematic frame diagram of the feature extraction flow provided in the present application.

Specifically, the feature extraction method of the embodiment of the present disclosure may include the following steps:

step S21: inputting the amino acid sequence information into a one-dimensional convolution network, and extracting the initialization characteristic of each amino acid.

In the embodiment of the disclosure, the primary amino acid sequence structure of the protein is obtained, and the one-hot coding is carried out on the primary amino acid sequence structure, so as to obtain the one-hot coding of each amino acid. The characteristic extraction equipment inputs the one-hot codes of the primary amino acid sequence structure into a multi-layer one-dimensional convolution network to obtain the initialization characteristic of each amino acid.

Step S22: inputting the initialization characteristic of each amino acid into a recurrent neural network, and obtaining the global characteristic of each amino acid in the amino acid sequence based on the position information of each amino acid in the amino acid sequence and the initialization characteristic.

In the embodiment of the disclosure, hidden states in two directions of each amino acid are extracted as new features, namely global features, through a recurrent neural network; among these, global features are features of each amino acid compared to the overall amino acid sequence. A Recurrent Neural Network (RNN) is a type of Recurrent Neural Network (Recurrent Neural Network) in which sequence data is input, recursion is performed in the direction of evolution of the sequence, and all nodes (Recurrent units) are connected in a chain.

Specifically, referring to fig. 4, fig. 4 is a schematic specific flowchart of step S22 in the feature extraction method shown in fig. 2, and the method for obtaining global features of amino acids through a recurrent neural network according to the embodiment of the present disclosure may include the following steps:

step S221: the amino acid sequence is traversed from left to right based on the initialized features for each amino acid to obtain a first global feature for each amino acid.

In the embodiment of the present disclosure, as in the cyclic neural network portion of fig. 3, the cyclic neural network takes each amino acid and amino acid sequence as input, traverses the amino acid sequence from left to right, and further calculates the characteristic relationship between the input amino acid and all amino acids to obtain the first global characteristic of the input amino acid. The first global feature includes the position relation and chemical connection relation between the input amino acid and other amino acids in the amino acid sequence.

Step S222: the sequence of amino acids is traversed from right to left based on the initialized features for each amino acid to obtain a second global feature for each amino acid.

In the embodiment of the present disclosure, as in the cyclic neural network part of fig. 3, the cyclic neural network uses each amino acid and each amino acid sequence as input, traverses the amino acid sequences from right to left, and further calculates the characteristic relationship between the input amino acids and all the amino acids to obtain the second global characteristic of the input amino acids. The second global characteristics comprise the position relation, chemical connection relation and the like of the input amino acid and other amino acids in the amino acid sequence.

Step S223: and fusing the first global feature and the second global feature to obtain the global feature of the amino acid in the amino acid sequence of the protein.

In the disclosed embodiment, the first global feature and the second global feature are fused to obtain the global features of the input amino acids in the whole amino acid sequence. In some possible embodiments, the worker may change the evolution direction of the amino acid sequence, i.e., the traversal direction or the traversal times, etc., by setting the recurrent neural network, which is not described herein again.

Step S23: and performing feature fusion on the global features of each amino acid and the global features of other amino acids in the amino acid sequence to obtain a pair-by-pair feature map consisting of the fusion features of each pair of amino acids.

In the embodiment of the present disclosure, the global feature of each amino acid is multiplied by the global features of other amino acids to obtain the fusion feature of each pair of amino acids, i.e. the amino acid pair. Then, the fusion features of the amino acid pairs are learned through a shared sensing machine, and a pair-by-pair feature map of a two-dimensional space is constructed. In the frame diagram of fig. 3, only a partial structural frame of the shared sensor is shown or only the shared sensor is shown as an example, and the shared sensor in the embodiment of the present disclosure may be any one of the shared sensors in the prior art, which is not described herein again.

Step S24: and inputting the pair-by-pair feature maps into a convolution network, and extracting the spatial features of the pair-by-pair feature maps.

In the embodiment of the present disclosure, the pair-by-pair feature map may be regarded as one image, the pair-by-pair feature map is input into the two-dimensional convolution network, and the spatial feature output by the two-dimensional convolution network is obtained. The space characteristics output after the pair-by-pair characteristic diagram passes through the two-dimensional convolution network can show the position relation and the connection relation of two amino acids on a two-dimensional space.

Step S25: and performing global pooling on the spatial features of the pair-by-pair feature maps to obtain protein feature vectors.

In the embodiment of the disclosure, the spatial features of the pair-by-pair feature map are globally pooled, and finally, a fixed-length protein feature vector is obtained. Specifically, referring to fig. 5, fig. 5 is a schematic flowchart illustrating a specific flow of step S25 in the feature extraction method shown in fig. 2, and step S25 may include the following steps:

step S251: and acquiring the spatial feature with the minimum length in the pair-by-pair feature map.

In the embodiment of the present disclosure, the spatial feature with the minimum length in the pair-by-pair feature map is obtained, and the length of the spatial feature is used as the standard length of the global pooling. In some possible embodiments, the area may also be used as a measure of global pooling of spatial features, which is not described herein.

Step S252: and pooling other spatial features in the pair-by-pair feature map based on the spatial feature with the minimum length to obtain the spatial feature with the fixed length.

In the embodiment of the present disclosure, the minimum length is set as a pooling parameter of the global pooling layer, and all pairs of feature maps are input into the global pooling layer, so as to extract a spatial feature with a fixed length.

Step S253: and acquiring a protein feature vector from the space features with fixed length.

In the embodiment of the disclosure, the fixed-length spatial features are combined to obtain the fixed-length protein feature vector. The fixed-length protein feature vector avoids the problem that features between different amino acid sequence lengths are difficult to align, and is beneficial to the scenes of protein analysis and the like in the research and development of new drugs.

According to the scheme, the amino acid local characteristics are extracted through the multilayer one-dimensional convolution network, compared with the method that single thermal coding is used, more effective characteristics can be learned end to end, and gradient instability caused by direct input into a cyclic neural network is avoided; the feature extraction equipment also extracts the features of each amino acid compared with the global sequence through a cyclic neural network, and introduces the features at all distances to influence each other; the feature extraction equipment fuses and relearns the features of each pair of amino acids to construct a spatial two-dimensional feature map, so that the states of two pairs of amino acids at any distance in the amino acid sequence can be directly learnt, and the spatial structure information of the amino acid sequence is potentially expressed; the feature extraction equipment extracts spatial features through two-dimensional convolution and pooling, and finally obtains a fixed-length protein feature vector through global pooling, so that the problem that features among different amino acid sequence lengths are difficult to align is solved.

It will be understood by those skilled in the art that in the method of the present invention, the order of writing the steps does not imply a strict order of execution and any limitations on the implementation, and the specific order of execution of the steps should be determined by their function and possible inherent logic.

Referring to fig. 6, fig. 6 is a schematic diagram of a framework of an embodiment of a feature extraction apparatus provided in the present application. In some possible implementations, the executing subject of the feature extraction method may be a feature extraction device, for example, the feature extraction method may be executed by a terminal device or a server or other processing device, where the terminal device may be a User Equipment (UE), a mobile device, a User terminal, a cellular phone, a cordless phone, a Personal Digital Assistant (PDA), a handheld device, a computing device, a vehicle-mounted device, a wearable device, or the like. In some possible implementations, the feature extraction method may be implemented by a processor calling computer readable instructions stored in a memory. In other possible implementations, the execution body may also be in other forms, and is not limited herein.

The feature extraction device 60 includes a feature extraction module 61, a feature fusion module 62, and a feature acquisition module 63.

The feature extraction module 61 is configured to obtain an initialization feature of each amino acid based on the obtained amino acid sequence information; the feature extraction module 61 is further configured to obtain a global feature of each amino acid in the amino acid sequence based on the initialized feature of each amino acid; a feature fusion module 62, configured to perform feature fusion on the global feature of each amino acid and the global features of other amino acids in the amino acid sequence to obtain a pair-by-pair feature map composed of the fusion features of each pair of amino acids; and a feature obtaining module 63, configured to obtain a protein feature vector based on the spatial features of the pair-by-pair feature map.

The feature extraction module 61 is further configured to input the amino acid sequence information into a one-dimensional convolutional network, and extract an initialization feature of each amino acid, where the initialization feature includes local feature information of the amino acid in the amino acid sequence.

Wherein, the feature extraction module 61 is further configured to input the initialized feature of each amino acid into a recurrent neural network; obtaining the global characteristic of each amino acid in the amino acid sequence based on the position information of each amino acid in the amino acid sequence and the initialized characteristic.

Wherein, the feature extraction module 61 is further configured to traverse the amino acid sequence from left to right based on the initialized feature of each amino acid to obtain a first global feature of each amino acid; traversing the sequence of amino acids from right to left based on the initialized feature for each amino acid to obtain a second global feature for each amino acid; and fusing the first global characteristic and the second global characteristic of each amino acid to obtain the global characteristic of each amino acid in the amino acid sequence.

The feature fusion module 62 is further configured to fuse the global feature of each amino acid with the global features of the other amino acids, respectively, to obtain a fusion feature of each pair of amino acids; and learning the fusion characteristics of each pair of amino acids through a shared sensing machine to construct a pair-by-pair characteristic diagram of a two-dimensional space.

The feature obtaining module 63 is further configured to input the pair-wise feature map into a convolutional network, and extract spatial features of the pair-wise feature map; and globally pooling the spatial features of the pair-by-pair feature maps to obtain the protein feature vector.

The feature obtaining module 63 is further configured to obtain a spatial feature with a minimum length in the pair-wise feature map; pooling other spatial features in the pair-by-pair feature map based on the spatial feature of the minimum length to obtain fixed-length spatial features; and acquiring the protein feature vector from the fixed-length spatial features.

Wherein, the feature extraction module 61 is further configured to determine a unique hot code of each amino acid based on the obtained amino acid sequence information.

Referring to fig. 7, fig. 7 is a schematic diagram of a frame of another embodiment of a feature extraction apparatus provided in the present application. The feature extraction device 70 comprises a memory 71 and a processor 72 coupled to each other, the processor 72 being configured to execute program instructions stored in the memory 71 to implement the steps of any of the above-described embodiments of the feature extraction method. In one particular implementation scenario, the feature extraction device 70 may include, but is not limited to: a microcomputer, a server, and in addition, the feature extraction device 70 may also include a mobile device such as a notebook computer, a tablet computer, and the like, which is not limited herein.

In particular, the processor 72 is configured to control itself and the memory 71 to implement the steps of any of the above-described embodiments of the feature extraction method. The processor 72 may also be referred to as a CPU (Central Processing Unit). The processor 72 may be an integrated circuit chip having signal processing capabilities. The Processor 72 may also be a general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. Additionally, the processor 72 may be collectively implemented by an integrated circuit chip.

Referring to fig. 8, fig. 8 is a block diagram illustrating an embodiment of a computer-readable storage medium according to the present application. The computer readable storage medium 80 stores program instructions 801 that can be executed by the processor, the program instructions 801 being for implementing the steps of any of the above-described embodiments of the feature extraction method.

In some embodiments, functions of or modules included in the apparatus provided in the embodiments of the present disclosure may be used to execute the method described in the above method embodiments, and specific implementation thereof may refer to the description of the above method embodiments, and for brevity, will not be described again here.

The foregoing description of the various embodiments is intended to highlight various differences between the embodiments, and the same or similar parts may be referred to each other, and for brevity, will not be described again herein.

In the several embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a module or a unit is merely one type of logical division, and an actual implementation may have another division, for example, a unit or a component may be combined or integrated with another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some interfaces, and may be in an electrical, mechanical or other form.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

15页详细技术资料下载

Feature extraction method, related device, equipment and storage medium

相关技术

网友询问留言