Information processing apparatus, information processing method, and computer program

文档序号：1863159 发布日期：2021-11-19 浏览：21次中文

阅读说明：本技术 信息处理装置和信息处理方法 (Information processing apparatus, information processing method, and computer program ) 是由林慧镔石自强刘柳刘汝杰于 2020-05-13 设计创作，主要内容包括：本申请公开一种信息处理装置和信息处理方法。该信息处理装置包括：基础特征提取单元,被配置成提取声音的基础特征；多尺度特征提取单元,被配置成基于通过所述基础特征提取单元所提取的所述基础特征,提取所述声音的多尺度特征；初级分类单元,被配置成基于通过所述基础特征提取单元所提取的所述基础特征对所述声音进行初级分类,以获得初级分类结果；次级分类单元,被配置成基于所述声音的所述初级分类结果和所述多尺度特征对所述声音进行次级分类,以获得次级分类结果；以及分类结果融合单元,被配置成对所述声音的初级分类结果和次级分类结果进行融合,以获得所述声音的最终分类结果。(The application discloses an information processing apparatus and an information processing method. The information processing apparatus includes: a basic feature extraction unit configured to extract a basic feature of a sound; a multi-scale feature extraction unit configured to extract multi-scale features of the sound based on the basic features extracted by the basic feature extraction unit; a primary classification unit configured to perform primary classification on the sound based on the basic features extracted by the basic feature extraction unit to obtain a primary classification result; a secondary classification unit configured to secondarily classify the sound based on the primary classification result and the multi-scale features of the sound to obtain a secondary classification result; and a classification result fusion unit configured to fuse the primary classification result and the secondary classification result of the sound to obtain a final classification result of the sound.)

1. An information processing apparatus comprising:

a basic feature extraction unit configured to extract a basic feature of a sound;

a multi-scale feature extraction unit configured to extract multi-scale features of the sound based on the basic features extracted by the basic feature extraction unit;

a primary classification unit configured to perform primary classification on the sound based on the basic features extracted by the basic feature extraction unit to obtain a primary classification result;

a secondary classification unit configured to secondarily classify the sound based on the primary classification result and the multi-scale features of the sound to obtain a secondary classification result; and

a classification result fusion unit configured to fuse the primary classification result and the secondary classification result of the sound to obtain a final classification result of the sound.

2. The information processing apparatus according to claim 1, wherein the preliminary classification result includes a probability that the sound belongs to each of a plurality of root classes,

wherein the secondary classification unit includes a plurality of secondary classifiers in one-to-one correspondence with the plurality of root classes, an

Wherein, for each of the plurality of secondary classifiers, the secondary classifier is activated if the probability that the sound belongs to the root class to which the secondary classifier corresponds is equal to or greater than a predetermined threshold.

3. The information processing apparatus according to claim 2, wherein each of the plurality of root categories includes a plurality of subcategories, and a total number of subcategories included in the plurality of root categories is N, where N is a natural number greater than 1,

wherein the secondary classification result comprises at least one N-dimensional probability vector obtained by the activated secondary classifier, each element of each N-dimensional probability vector representing a probability that the sound belongs to a respective sub-category and being preset to 0,

wherein each of the secondary classifiers is configured to obtain, if activated, a respective N-dimensional probability vector by performing an iterative process,

in the first round of processing, the secondary classifier:

obtaining a first probability vector based on the multi-scale features of the sound, wherein each element in the first probability vector represents a probability that the sound belongs to each sub-category included in a root category corresponding to the secondary classifier;

selecting subcategories corresponding to the first m largest elements in the first probability vector as candidate subcategories;

generating m sub-category vectors based on the candidate sub-categories, wherein m is a natural number greater than 1; and

for each of the m sub-category vectors, calculating a score for the sub-category vector based on elements in the first probability vector that correspond to elements that the sub-category vector includes;

in the i ≧ 2 rounds of processing, the secondary classifier:

selecting the top n most-scored subcategory vectors of the subcategory vectors generated by the previous round of processing as candidate subcategory vectors, wherein n is a natural number greater than 1 and n is less than or equal to m; and

for each of the candidate subcategory vectors: obtaining an ith probability vector based on the candidate sub-category vector and the multi-scale features of the sound, wherein each element in the ith probability vector represents the probability that the sound belongs to each sub-category included in the root category corresponding to the secondary classifier; selecting the subcategories corresponding to the first m largest elements in the ith probability vector as candidate subcategories; and adding each of the candidate subcategories to the candidate subcategory vector, respectively, to newly generate a subcategory vector, and calculating a score of the newly generated subcategory vector based on the score of the candidate subcategory vector and elements of the ith probability vector corresponding to the newly added candidate subcategory,

wherein the iterative process terminates if the score of each of the newly generated sub-category vectors is less than the score of each of the candidate sub-category vectors selected from the sub-category vectors generated from the previous round of processing, and wherein the secondary classifier obtains a respective N-dimensional probability vector based on the probabilities corresponding to the elements included in the highest-scoring one of the sub-category vectors generated in the second-to-last round of processing.

4. The information processing apparatus according to any one of claims 1 to 3, wherein the information processing apparatus includes a plurality of the multi-scale feature extraction units,

wherein the information processing apparatus further includes a multi-scale feature fusion unit configured to fuse a plurality of multi-scale features extracted by a plurality of the multi-scale feature extraction units and obtain fused multi-scale features, and

wherein the secondary classification unit is further configured to secondary classify the sound based on the primary classification result of the sound and the fused multi-scale features to obtain the secondary classification result.

5. The information processing apparatus according to claim 4, wherein each of the multi-scale feature extraction units includes:

a plurality of feature extraction subunits, each configured to extract a feature of the sound based on the basic feature extracted by the basic feature extraction unit; and

a first feature fusion subunit configured to fuse the plurality of features of the sound extracted by the plurality of feature extraction subunits, output a fusion result as a multi-scale feature of the sound extracted by the corresponding multi-scale feature extraction unit, and output the fusion result to a next multi-scale feature extraction unit as an input of a next multi-scale feature extraction unit.

6. The information processing apparatus according to claim 5, wherein each multi-scale feature extraction unit further comprises:

a plurality of global pooling sub-units, each global pooling sub-unit corresponding to a feature extraction sub-unit, and each global pooling sub-unit configured to globally pool features of the sounds extracted by the feature extraction sub-unit corresponding to the global pooling sub-unit; and

a second feature fusion subunit configured to fuse a plurality of features of the sound pooled via the global pooling subunit, and output a fusion result as a multi-scale feature of the sound extracted by the corresponding multi-scale feature extraction unit.

7. The information processing apparatus according to claim 6, wherein the feature extraction subunit is a two-dimensional convolution unit, and

wherein each multi-scale feature extraction unit further comprises: a pre-processing subunit configured to process an input to reduce a dimensionality of the input.

8. The information processing apparatus according to claim 6, wherein the information processing apparatus includes three of the multi-scale feature extraction units, and each multi-scale feature extraction unit includes three of the feature extraction sub-units and three of the global pooling sub-units.

9. The information processing apparatus according to claim 6,

the first feature fusion subunit is further configured to fuse, in a stitched manner, a plurality of features of the sound extracted by the plurality of feature extraction subunits, and

wherein the second feature fusion subunit is further configured to fuse, in a stitched manner, a plurality of features of the sound pooled via the global pooling subunit.

10. An information processing method comprising:

a basic feature extraction step of extracting a basic feature of the sound;

a multi-scale feature extraction step of extracting multi-scale features of the sound based on the basic features extracted by the basic feature extraction step;

a primary classification step of primary-classifying the sound based on the basic features extracted by the basic feature extraction step to obtain a primary classification result;

a secondary classification step for secondary classifying the sound based on the primary classification result and the multi-scale features of the sound to obtain a secondary classification result; and

and a classification result fusion step, which is used for fusing the primary classification result and the secondary classification result of the sound to obtain the final classification result of the sound.

Technical Field

The present disclosure relates to the field of information processing, and in particular, to an information processing apparatus and an information processing method.

Background

Sound carries a large amount of environmental information as well as information about various events in the environment. By analyzing the sounds, events occurring in the environment may be distinguished and/or identified.

Disclosure of Invention

The following presents a simplified summary of the disclosure in order to provide a basic understanding of some aspects of the disclosure. However, it should be understood that this summary is not an exhaustive overview of the disclosure. It is not intended to identify key or critical elements of the disclosure or to delineate the scope of the disclosure. Its sole purpose is to present some concepts of the disclosure in a simplified form as a prelude to the more detailed description that is presented later.

An object of the present disclosure is to provide an improved information processing apparatus and information processing method.

According to an aspect of the present disclosure, there is provided an information processing apparatus including: a basic feature extraction unit configured to extract a basic feature of a sound; a multi-scale feature extraction unit configured to extract multi-scale features of the sound based on the basic features extracted by the basic feature extraction unit; a primary classification unit configured to perform primary classification on the sound based on the basic features extracted by the basic feature extraction unit to obtain a primary classification result; a secondary classification unit configured to secondarily classify the sound based on the primary classification result and the multi-scale features of the sound to obtain a secondary classification result; and a classification result fusion unit configured to fuse the primary classification result and the secondary classification result of the sound to obtain a final classification result of the sound.

According to another aspect of the present disclosure, there is provided an information processing method including: a basic feature extraction step of extracting a basic feature of the sound; a multi-scale feature extraction step of extracting a multi-scale feature of the sound based on the basic feature extracted by the basic feature extraction unit; a primary classification step of primary-classifying the sound based on the basic features extracted by the basic feature extraction unit to obtain a primary classification result; a secondary classification step for secondary classifying the sound based on the primary classification result and the multi-scale features of the sound to obtain a secondary classification result; and a classification result fusion step for fusing the primary classification result and the secondary classification result of the sound to obtain a final classification result of the sound.

According to other aspects of the present disclosure, there are also provided computer program code and a computer program product for implementing the above-described method according to the present disclosure, and a computer readable storage medium having recorded thereon the computer program code for implementing the above-described method according to the present disclosure.

Additional aspects of the disclosed embodiments are set forth in the description section that follows, wherein the detailed description is presented to fully disclose the preferred embodiments of the disclosed embodiments without imposing limitations thereon.

Drawings

The disclosure may be better understood by reference to the following detailed description taken in conjunction with the accompanying drawings, in which like or similar reference numerals are used throughout the figures to designate like or similar components. The accompanying drawings, which are incorporated in and form a part of the specification, further illustrate preferred embodiments of the present disclosure and explain the principles and advantages of the present disclosure, are incorporated in and form a part of the specification. Wherein:

fig. 1 is a block diagram showing a functional configuration example of an information processing apparatus according to an embodiment of the present disclosure;

FIG. 2 is a diagram illustrating an example of hierarchical tags;

fig. 3 is a diagram showing an architecture example of an information processing apparatus according to an embodiment of the present disclosure;

fig. 4 is a diagram showing an example of an architecture of a secondary classifier in an example case where the secondary classifier includes a recurrent neural network;

fig. 5 is a block diagram showing a functional configuration example of a multi-scale feature extraction unit according to an embodiment of the present disclosure;

fig. 6 is a diagram showing an example of an architecture of a multi-scale feature extraction unit according to an embodiment of the present disclosure;

fig. 7 is a flowchart illustrating an example of a flow of an information processing method 700 according to an embodiment of the present disclosure; and

fig. 8 is a block diagram showing an example structure of a personal computer employable in the embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure will be described hereinafter with reference to the accompanying drawings. In the interest of clarity and conciseness, not all features of an actual implementation are described in the specification. It will of course be appreciated that in the development of any such actual embodiment, numerous implementation-specific decisions must be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which will vary from one implementation to another. Moreover, it will be appreciated that such a development effort might be complex and time-consuming, but would nevertheless be a routine undertaking for those of ordinary skill in the art having the benefit of this disclosure.

Note that in this specification and the drawings, components having substantially the same or similar functional configurations will be denoted by the same or similar reference numerals, and redundant description will be omitted.

Further, in the present specification and the drawings, there are also the following cases: a plurality of components having substantially the same functional configuration are distinguished by attaching different letters after the same reference numeral. For example, a plurality of components having substantially the same functional configuration are distinguished as necessary into the multi-scale feature extraction unit 104a and the multi-scale feature extraction unit 104 b. However, in the case where it is not necessary to particularly distinguish each of the plurality of components having substantially the same functional configuration, only the same reference numeral is attached. For example, the multi-scale feature extraction unit 104a and the multi-scale feature extraction unit 104b are simply referred to as the multi-scale feature extraction unit 104 without particularly distinguishing the multi-scale feature extraction unit 104a and the multi-scale feature extraction unit 104 b.

Here, it should be further noted that, in order to avoid obscuring the present disclosure with unnecessary details, only the device structures and/or processing steps closely related to the scheme according to the present disclosure are shown in the drawings, and other details not so relevant to the present disclosure are omitted.

Embodiments according to the present disclosure are described in detail below with reference to the accompanying drawings.

First, a functional configuration example of an information processing apparatus according to an embodiment of the present disclosure will be described with reference to fig. 1. Fig. 1 is a block diagram showing a functional configuration example of an information processing apparatus according to an embodiment of the present disclosure. As shown in fig. 1, an information processing apparatus 100 according to an embodiment of the present disclosure may include a base feature extraction unit 102, a multi-scale feature extraction unit 104, a primary classification unit 106, a secondary classification unit 108, and a classification result fusion unit 110.

The base feature extraction unit 102 may be configured to extract base features of the sound, thereby primarily classifying the sound by the primary classification unit 106. For example, the underlying features of the sound may be features that are more conducive to rough classification of the sound. For example, the base feature extraction unit 102 may be a neural network (such as a convolutional neural network, a recurrent neural network RNN, or the like) based feature extraction unit. For example, the base feature extraction unit 102 may include a plurality of stacked two-dimensional convolution units 2D Conv, where each two-dimensional convolution unit 2D Conv may include a two-dimensional convolution layer. In addition, the two-dimensional convolution unit 2D Conv may further include a batch normalization processing layer, a non-linear activation layer, and/or a max-pooling layer. Note that the specific configuration and parameters of the respective two-dimensional convolution units 2D Conv may not be the same.

As an example, a mel spectrum (fbank) obtained by performing a series of processes (short-time fourier transform, logarithm, etc.) on a sound may be input to the basic feature extracting unit 102, and the basic feature extracting unit 102 may extract basic features of the sound using the input mel spectrum.

The multi-scale feature extraction unit 104 may be configured to extract multi-scale features of the sound based on the basic features extracted by the basic feature extraction unit 102. The multi-scale features may represent features of sound extracted from different spatial angles and/or temporal scales. For example, in the case where the multi-scale feature extraction unit 104 includes a convolutional neural network, the multi-scale feature may be a feature obtained by fusing (e.g., stitching) corresponding features extracted by two or more convolutional layers having different receptive fields via the convolutional neural network. In a convolutional neural network, the receptive field of a convolutional layer represents the region size (mapping) of the input layer corresponding to one element in the output of the convolutional layer.

The primary classification unit 106 may be configured to perform a primary classification of the sound based on the base features extracted by the base feature extraction unit 102 to obtain a primary classification result. For example, the primary classification unit 106 may be a neural network (such as a convolutional neural network) based classification unit. For example, the primary classification unit 106 may include, but is not limited to, a global pooling layer, a fully connected layer, and a non-linear activation unit (e.g., an activation unit that includes a sigmoid function).

The secondary classification unit 108 may be configured to secondarily classify the sound based on the primary classification result and the multi-scale features of the sound to obtain a secondary classification result. For example, the secondary classification unit 108 may be a recurrent neural network-based classification unit, but is not limited thereto.

The classification result fusion unit 110 may be configured to fuse the primary classification result and the secondary classification result of the sound to obtain a final classification result of the sound.

Sound carries a large amount of environmental information as well as information of various events in the environment. By analyzing the sounds, events occurring in the environment may be distinguished and/or identified. Sound has features of different scales, however in conventional classification of sound events such features of different scales of sound are not used. The information processing apparatus 100 according to the embodiment of the present disclosure extracts the basic features and the multi-scale features of the sound, performs primary classification and secondary classification on the sound based on the extracted basic features and the multi-scale features, and acquires a final classification result based on the primary classification result and the secondary classification result, making it possible to distinguish more categories of sound events and/or improve the accuracy of sound event classification.

The tags (i.e., categories) of sound events may be defined as hierarchical tags based on a tree structure description, as shown in fig. 2. For example, in fig. 2, tags of the first row (e.g., people, animals, things, etc.) may be considered a root category, and tags other than the tags of the first row (e.g., people's voice, vehicles, airplanes, etc.) may be considered a subcategory. Further, for example, in a case where it is known that the sound is a human sound, it is possible to consider tags one level lower than the human sound (e.g., human voice, wrestling, etc.) as a root category, and tags two or more levels lower than the human sound (e.g., talk, etc.) as a subcategory. That is, in the case where a label of a certain level is regarded as a root category, a label of a level lower than the level may be regarded as a sub-category.

According to an embodiment of the present disclosure, the primary classification unit 106 may perform primary classification on the sound based on the basic features for a plurality of root classes. In this case, the primary classification may be regarded as a coarse classification. For example, the preliminary classification result may include a probability that the sound belongs to each of a plurality of root classes. The secondary classification unit 108 may perform a secondary classification of the sound for a plurality of sub-categories based on the primary classification result and the multi-scale features. In this case, the secondary classification may be regarded as a fine classification. For example, the secondary classification unit 108 may include a plurality of secondary classifiers corresponding to the plurality of root classes one to one, such as the 1 st to K-th secondary classifiers shown in fig. 3 (K is a natural number greater than 1). Wherein, for each of a plurality of secondary classifiers, the secondary classifier is activated if the probability that a sound belongs to the root class to which the secondary classifier corresponds is equal to or greater than a predetermined threshold. For example, as shown in fig. 3, the information processing apparatus 100 may further include a control unit 112, and the control unit 112 may be configured to compare, for each of K secondary classifiers, the probability that a sound belongs to the root class to which the secondary classifier corresponds with a predetermined threshold, and activate the secondary classifier in a case where the above probability is equal to or greater than the predetermined threshold. For example, in the case where the probability that the sound belongs to the 1 st root class is equal to or greater than a predetermined threshold, the control unit 112 may activate the 1 st sub-classifier corresponding to the 1 st root class.

The predetermined threshold value can be set by a person skilled in the art according to actual needs. For example, one skilled in the art may obtain the predetermined threshold by practicing under different scenario tasks, as desired.

By obtaining a primary classification result of the sound in relation to the root class as described above, and further sub-classifying the sound based on the primary classification result, it is possible to distinguish more classes of sound events and/or to improve the accuracy of sound event classification. In addition, for each secondary classifier, the sound only needs to be secondarily classified for the subcategories included in the root category to which the secondary classifier corresponds, and thus the amount of calculation of each secondary classifier can be reduced.

According to one embodiment of the present disclosure, each of the plurality of root categories may include a plurality of sub-categories, and a total number of sub-categories included in the plurality of root categories is N, where N is a natural number greater than 1. For example, there may be overlap in sub-categories included in different root categories. For example, as shown in FIG. 2, where vehicles and engines are used as the root category, both of these root categories include the sub-category "jet engines". The secondary classification result may include at least one N-dimensional probability vector obtained by the activated secondary classifier. The respective elements in each N-dimensional probability vector represent the probability that a sound belongs to the respective subcategory and are set to 0 in advance. Each of the secondary classifiers comprised by the secondary classification unit 108 may be configured to obtain, if activated, a respective N-dimensional probability vector by performing the following iterative process:

in the first round of processing, the secondary classifier may obtain a first probability vector based on the multi-scale features of the sound, where each element in the first probability vector represents a probability that the sound belongs to each subcategory included in the root category corresponding to the secondary classifier; selecting subcategories corresponding to the first m largest elements in the first probability vector as candidate subcategories; generating m sub-category vectors based on the candidate sub-categories, wherein m is a natural number greater than 1; and for each of the m sub-category vectors, calculating a score for the sub-category vector based on elements in the first probability vector that correspond to elements included in the sub-category vector.

In the ith ≧ 2 rounds of processing, the secondary classifier can select the top n most-scored subcategory vectors of the subcategory vectors generated by the previous round of processing as candidate subcategory vectors, where n is a natural number greater than 1 and n ≦ m; and for each of the candidate subcategory vectors: obtaining an ith probability vector based on the candidate sub-category vector and the multi-scale features of the sound, wherein each element in the ith probability vector represents the probability that the sound belongs to each sub-category included in the root category corresponding to the secondary classifier; selecting the subcategories corresponding to the first m largest elements in the ith probability vector as candidate subcategories; and adding each of the candidate subcategories to the candidate subcategory vector, respectively, to newly generate a subcategory vector, and calculating a score of the newly generated subcategory vector based on the score of the candidate subcategory vector and an element in the ith probability vector corresponding to the newly added candidate subcategory.

In the above-described iterative process performed by the secondary classifier, the iterative process is terminated in a case where the score of each of the newly generated sub-category vectors is smaller than the score of each of the candidate sub-category vectors selected from the sub-category vectors generated in the previous round of processing. The secondary classifier may obtain a corresponding N-dimensional probability vector based on the probability corresponding to each element included in the sub-category vector with the highest score in the sub-category vectors generated in the second last round of processing.

For example, the classification result fusion unit 110 may be based onAnd carrying out weighted average on the secondary classification results according to the primary classification results to obtain final classification results. For example, the classification result fusion unit 110 may fuse the primary classification result and the secondary classification result according to the following formula (1) to obtain the final classification result R_fusion。

In the formula (1), R_r(j) Represents the probability that a sound belongs to the jth root class, and R_sjRepresenting an N-dimensional probability vector obtained by the secondary classifier corresponding to the jth root class. It is to be noted that in R_r(j) R is less than a predetermined threshold and the secondary classifier corresponding to the jth root class is not activated_sjIs 0.

In addition, the final classification result obtained according to equation (1) is an N-dimensional probability vector, in which each element represents the probability that a sound belongs to each sub-category.

By performing the iterative process described above to obtain at least one N-dimensional probability vector and further fusing the obtained at least one N-dimensional probability vector with the primary classification result to obtain a final classification result, it is possible to obtain probabilities that sounds belong to respective subcategories, and based thereon, it is possible to classify sound events more accurately.

By way of illustration and not limitation, the secondary classification unit 108 may include a recurrent neural network. For example, where the secondary classification unit 108 includes a plurality of secondary classifiers, each secondary classifier may include a respective recurrent neural network. Fig. 4 shows an architecture example of the secondary classifier in an example case where the secondary classifier includes a recurrent neural network.

As shown in fig. 4, the secondary classifier may include a first transformation unit 1082, a recurrent neural network 1084, a projection unit 1086, a second transformation unit 1088, and an output processing unit 1090. The current label may be represented as a vector v, each element in the vector v has a one-to-one correspondence with each sub-category included in the root category to which the corresponding secondary classifier corresponds, and each element has a value of 0 or 1.

The first transformation unit 1082 may perform a dimensionality reduction operation on the input vector v. For example, the operation performed by the first conversion unit 1082 can be represented by the following expression (2):

e＝U₁v (2)

where e represents a vector of the tag embedding space into which the vector v is converted by the first conversion unit 1082, and U₁Represents a matrix employed by the first transformation unit 1082 for the dimension reduction operation.

The recurrent neural network 1084 may be a recurrent neural network based on gated recurrent units gru. For example, the recurrent neural network 1084 may obtain the state vector o based on the vector e.

The projection unit 1086 may perform a dimension conversion on the state vector o and the multi-scale feature y to convert the state vector o and the multi-scale feature y to the same dimensional space.

The second transformation unit 1088 may perform a nonlinear operation on the state vector o subjected to the dimension conversion by the projection unit 1086 and the multi-scale feature y. For example, the operations performed by the projection unit 1086 and the second transform unit 1088 can be represented by the following expression (3):

x＝f(U₂o+U₃y) (3)

in formula (3), U₂And U₃Respectively, a matrix employed by the projection unit 1086 for performing the dimensional conversion on the state vector o and the multi-scale feature y, f denotes a nonlinear transformation operation performed by the second transformation unit 1088, and x denotes a result obtained by the second transformation unit 1088.

The output processing unit 1090 can perform dimension conversion on the result x obtained by the second transformation unit 1088 according to the following expression (4) to obtain the probability vector P_S. Probability vector P_SThe number of elements included is the same as the number of elements included in the vector v.

P_s＝U₄x (4)

In formula (4), U₄A matrix used when the output processing unit 1090 performs dimension conversion is indicated.

An example manner of obtaining secondary classification results in the case where the secondary classifier has the example architecture shown in fig. 4 will be described below in connection with a specific example. For convenience of description, the following will include, for example, 6 sub-categories S in the corresponding root category₁、S₂、S₃、S₄、S₅And S₆The 1 st sub-classifier of (1) is described as an example.

For example, in round 1 processing, the 1 st sub-classifier may perform the following processing:

a current label (i.e., vector v) is randomly initialized, the initialized current label is input to the first transformation unit 1082, and the multi-scale features of the sound extracted by the multi-scale feature extraction unit 104 are input to the projection unit 1086, thereby obtaining a first probability vector P_S1＝[p₁₁,p₁₂,p₁₃,p₁₄,p₁₅,p₁₆]Wherein p is₁₁、p₁₂、p₁₃、p₁₄、p₁₅And p₁₆Respectively representing sounds belonging to sub-categories S₁、S₂、S₃、S₄、S₅And S₆The probability of (c).

The subclass corresponding to the top m (m is 2 in this example) largest elements in the obtained first probability vector is selected as the candidate subclass (assuming that the candidate subclass selected in this first round of processing is S)₁And S₂)。

Generating m-2 sub-category vector [ S ] based on candidate sub-categories₁]And [ S ]₂]。

And for each of the m-2 sub-category vectors, calculating a score for the sub-category vector based on elements in the first probability vector that correspond to elements included in the sub-category vector. For example, for each sub-category vector, the elements in the first probability vector that correspond to the elements that the sub-category vector includes may be used as the scores for the sub-category vector. For example, for sub-category vector [ S ]₁]And [ S ]₂]The corresponding score can be calculated as p₁₁And p₁₂. However, one skilled in the art can follow the practiceIt is also desirable to calculate the score of the sub-category vector based on the first probability vector in other ways, which will not be described in detail here.

In the 2 nd round of processing, the 1 st sub-classifier may perform the following processing:

the top n (in this example, n ═ 1) most-scored sub-category vectors of the sub-category vectors generated by the 1 st round of processing are selected as candidate sub-category vectors (assuming that the candidate sub-category vector selected in this 2 nd round of processing is [ S ]₁])。

And, for the candidate subcategory vector [ S ]₁]: matching the vector v with the candidate sub-category vector [ S₁]The corresponding element of the included element (i.e., the sub-category) is set to 1 and the remaining elements in vector v are set to 0, the vector v so set (for the candidate sub-category vector S₁]Vector v is set to [1,0,0,0,0,0]) Input to the first transformation unit 1082 and thereby obtain a 2 nd probability vector P based on the vector v and the multi-scale features of the sound_S2＝[p₂₁,p₂₂,p₂₃,p₂₄,p₂₅,p₂₆]Wherein p is₂₁、p₂₂、p₂₃、p₂₄、p₂₅And p₂₆Respectively representing sounds belonging to sub-categories S₁、S₂、S₃、S₄、S₅And S₆The probability of (d); selecting the 2 nd probability vector P_S2The subclass corresponding to the top m-2 largest elements in (i) is taken as the candidate subclass (in this example, it is assumed that the candidate subclass selected in the 2 nd round of processing is S₃And S₄) (ii) a And adding each of the candidate sub-categories to the candidate sub-category vector [ S ] respectively₁]To generate a new sub-category vector (in this example, the new generated sub-category vector is [ S ]₁,S₃]And [ S ]₁,S₄]) And based on the score of the candidate sub-category vector and the 2 nd probability vector P_S2The score of the newly generated sub-category vector is calculated for the elements in (1) corresponding to the newly added candidate sub-category. For example, the score of the candidate subcategory vector may be related to the 2 nd probability vector P_S2Corresponding to the newly added candidate subcategory ofThe product of the elements is used as the score of the newly generated sub-category vector. For example, for a newly generated sub-category vector [ S [ ]₁,S₃]The scores can be computed as candidate subcategory vectors S, respectively₁]And the newly added candidate subcategory S in the 2 nd probability vector₃Corresponding element (i.e., p)₂₃) The product of (a).

In subsequent processes (e.g., the 3 rd round process, the 4 th round process, etc.), the 1 st sub-classifier may perform processes similar to those in the 2 nd round process described above, which will not be described again.

Assuming that in the 4 th round of processing, the score of each of the newly generated sub-category vectors is smaller than the score of each of the candidate sub-category vectors selected from the sub-category vectors generated in the 3 rd round of processing, the iterative processing is terminated. In this case, the 1 st sub-classifier may obtain an N-dimensional probability vector based on probabilities corresponding to respective elements included in the highest-scoring sub-category vector among the sub-category vectors generated in the 2 nd last round of processing (i.e., 3 rd round of processing). For example, the secondary classifier may set respective elements corresponding to respective elements included in the highest-score sub-category vector generated in the 3 rd round of processing among the N-dimensional probability vectors to probabilities corresponding to respective elements included in the highest-score sub-category vector, and set the remaining elements in the N-dimensional probability vectors to 0, thereby obtaining the N-dimensional probability vectors. For example, assume that the highest-score subcategory vector generated in round 3 processing is [ S ]₁,S₃,S₅]Then the N-dimensional probability vector may be set to [ P ]₁₁,0,P₂₃,0,P₃₅,0,0,0…]Wherein P is₃₅Is obtained in 3 rounds of processing with S₅The corresponding probability.

Note that two or more identical elements may be included in the same sub-category vector. For example, in the case where the highest-score subcategory vector generated in the 2 nd from last round of processing includes two or more identical elements, the element corresponding to the element in the N-dimensional probability vector may be set as the highest probability corresponding to the element. For example, in the above example, the score generated in the 3 rd round of processing is the mostThe high subcategory vector includes two or more identical elements (e.g., element S)₃When the sub-category vector with the highest score is [ S ]₁,S₃,S₃]) In the case of (2), the AND element S in the N-dimensional probability vector is set to₃The corresponding element is set as the subcategory vector [ S ]₁,S₃,S₃]2 nd element S in (b)₃Corresponding probability (i.e., obtained in 2 rounds of processing with S)₃Corresponding probability P₂₃) And the 3 rd element S₃Corresponding probability (i.e., obtained in 3 rounds of processing with S)₃Corresponding probability P₃₃) The larger of them.

In addition, although m is assumed to be 2 and n is assumed to be 1 in the above example for convenience of description, in practice, a person skilled in the art may select appropriate values of m and n according to actual needs.

According to one embodiment of the present disclosure, the information processing apparatus 100 may include a plurality of multi-scale feature extraction units 104. In this case, the information processing apparatus 100 may further include a multi-scale feature fusion unit 114, and the multi-scale feature fusion unit 114 may be configured to fuse the plurality of multi-scale features extracted by the plurality of multi-scale feature extraction units 104 and obtain fused multi-scale features. Wherein the secondary classification unit 108 may be further configured to secondary classify the sound based on the primary classification result of the sound and the fused multi-scale features to obtain a secondary classification result. As an example, the multi-scale feature fusion unit 114 may be further configured to fuse the plurality of multi-scale features extracted by the plurality of multi-scale feature extraction units 104 in a stitched manner.

Fig. 5 is a block diagram showing a functional configuration example of the multi-scale feature extraction unit 104 according to an embodiment of the present disclosure. As shown in fig. 5, each multi-scale feature extraction unit 104 may include: a plurality of feature extraction subunits 1042, each feature extraction subunit 1042 being configured to extract a feature of the sound based on the basic feature extracted by the basic feature extraction unit 102; and a first feature fusion subunit 1044 which may be configured to fuse the plurality of features of the sound extracted by the plurality of feature extraction subunits 1042, output the fusion result as the multi-scale features of the sound extracted by the respective multi-scale feature extraction units 104, and output the fusion result to the next multi-scale feature extraction unit 104 as an input of the next multi-scale feature extraction unit.

For example, the feature extraction sub-unit 1042 may be a two-dimensional convolution unit 2D Conv, but is not limited thereto. In the example case where the feature extraction sub-unit 1042 is a two-dimensional convolution unit 2D Conv, each multi-scale feature extraction unit 104 may further include a pre-processing sub-unit 1048, which pre-processing sub-unit 1048 may be configured to process the input of the multi-scale feature extraction unit 104 to reduce the dimensionality of the input. For example, the preprocessing sub-unit 1048 may be a 1 × 1 convolution unit (1 × 1Conv), but is not limited thereto.

It is noted that the specific structural parameters of each feature extraction subunit 1042 may be different.

As an example, each multi-scale feature extraction unit 104 may further include: a plurality of global pooling subunits 1046, each of which may correspond to a different one of the plurality of feature extraction subunits 1042, respectively, and each global pooling subunit 1046 may be configured to globally pool features of the sound extracted by the feature extraction subunit 1042 corresponding to the global pooling subunit 1046; and a second feature fusion subunit 1050 that may be configured to fuse a plurality of features of the sound pooled via the global pooling subunit 1046 and output the fusion result as multi-scale features of the sound extracted by the corresponding multi-scale feature extraction unit 104. By way of illustration and not limitation, the second feature fusion subunit 1050 may be further configured to fuse, in a stitched manner, a plurality of features of the sound pooled via the global pooling subunit 1046.

The multi-scale feature extraction unit 104 will be described in detail below with reference to an architecture example of the multi-scale feature extraction unit 104 shown in fig. 6. Fig. 6 is a diagram showing an example of the architecture of the multi-scale feature extraction unit 104 included in the information processing apparatus 100.

As shown in fig. 6, the information processing apparatus 100 may include three multi-scale feature extraction units 104a, 104b, and 104c, and the multi-scale feature extraction unit 104a may include three feature extraction sub-units 1042a, 1042b, and 1042 c. In addition, the multi-scale feature extraction unit 104a may further include a preprocessing subunit 1048a and/or global pooling subunits 1046a, 1046b, and 1046c corresponding to the feature extraction subunits 1042a, 1042b, and 1042c, respectively.

In fig. 6, for convenience of description, only an example configuration of the multi-scale feature extraction unit 104a is shown. The multi-scale feature extraction units 104b and 104c may have a configuration similar to that of the multi-scale feature extraction unit 104 a. Note that the multi-scale feature extraction units 104a, 104b, and 104c may not include a different number of feature extraction sub-units. In addition, the specific structural parameters of the multi-scale feature extraction units 104a, 104b, and 104c may be different.

For example, the base features extracted by the base feature extraction unit 102 are input into the multi-scale feature extraction unit 104 a. In case the multi-scale feature extraction unit 104a does not include the preprocessing subunit 1048a, the base features are input to the feature extraction subunit 1042 a. In addition, in the case where the multi-scale feature extraction unit 104a includes the preprocessing subunit 1048a, the base feature is input to the preprocessing subunit 1048a, the preprocessing subunit 1048a processes the base feature to reduce the dimensionality of the base feature, and then the processed base feature is input to the feature extraction subunit 1042 a. For example, the size of the base feature may be (C)₁,H₁,W₁) Wherein, C₁、H₁And W₁The number of feature maps included in the basic feature, the length in the time direction, and the length in the frequency direction are respectively indicated. The preprocessing subunit 1048a may process the input basic features to obtain a size of (CM)₁,HM₁,WM₁) Of the processed base features of (1), wherein CM₁Indicates the number of output channels (i.e., the number of output feature maps), HM, of the preprocessing subunit 1048a₁And WM₁Respectively representing the length in the time direction and the length in the frequency direction of the processed basic feature, and HM₁And WM₁May be respectively equal to H₁And W₁。

For convenience of description, it is assumed herein that the input and output of the feature extraction subunits 1042a, 1042b, and 1042c are all (CM)₁,HM₁,WM₁). However, in practice, the size of the inputs of the feature extraction subunits 1042a, 1042b and 1042c may not be the same, and the size of the outputs of the feature extraction subunits 1042a, 1042b and 1042c may not be the same.

As shown in fig. 6, the features of the sound extracted by the feature extraction sub-unit 1042a may be input to the next feature extraction sub-unit 1042b, and the features of the sound extracted by the feature extraction sub-unit 1042b may be input to the next feature extraction sub-unit 1042 c.

The first feature fusion subunit 1044 (see fig. 5) may fuse (e.g., concatenate by channel) the respective features of the sounds extracted via the feature extraction subunits 1042a, 1042b, and 1042c, respectively, and output the fusion result to the next multi-scale feature extraction unit 104b as an input of the multi-scale feature extraction unit 104 b. The size of the feature fused by the first feature fusion subunit 1044 may be (3 × CM)₁,HM₁,WM₁). In the case where the multi-scale feature extraction unit 104a does not include the global pooling sub-units 1046a, 1046b, and 1046c, the first feature fusion sub-unit 1044 outputs the fusion result as the multi-scale feature extracted by the multi-scale feature extraction unit 104 a. On the other hand, in the case where the multi-scale feature extraction unit 104a includes the global pooling subunits 1046a, 1046b and 1046c, the global pooling subunits 1046a, 1046b and 1046c pool the respective features of the sounds extracted via the feature extraction subunits 1042a, 1042b and 1042c, respectively, to obtain the sum of the sizes (CM) of the respective features₁1, 1). The second feature fusion subunit 1050 (see fig. 5) may fuse (e.g., stitch by channel) features of the sounds pooled via the global pooling subunits 1046a, 1046b, and 1046c, and output the fusion result as a sound pooled through the multi-scaleThe multi-scale features extracted by the degree feature extraction unit 104 a. The size of the feature fused by the second feature fusion subunit 1050 may be (3 × CM)₁,1,1)。

Similarly, in the multi-scale feature extraction units 104b and 104c, similar processing to that in the multi-scale feature extraction unit 104a may be performed, and the description will not be repeated here.

Further, in the case where the multi-scale feature extraction units 104b and 104c do not include the global pooling subunit 1046, the size of the multi-scale features extracted by the multi-scale feature extraction units 104b and 104c may be (3 × CM), respectively₂,HM₂,WM₂) And (3 CM)₃,HM₃,WM₃) Wherein CM₂,HM₂And WM₂Respectively represent the number of feature maps included in the features extracted by each feature extraction subunit of the multi-scale feature extraction unit 104b, the length in the time direction, and the length in the frequency direction, and CM₃,HM₃And WM₃The number of feature maps included in the features extracted by each feature extraction subunit of the multi-scale feature extraction unit 104c, the length in the time direction, and the length in the frequency direction are respectively represented. In the case where the multi-scale feature extraction units 104b and 104c include the global pooling sub-unit 1046, the size of the multi-scale features extracted by the multi-scale feature extraction units 104b and 104c may be (3 × CM), respectively₂1,1) and (3 CM)₃,1,1)。

The multi-scale feature fusion unit 114 (see fig. 1) may fuse (e.g., stitch by channel) the multi-scale features extracted by the multi-scale feature extraction units 104a, 104b, and 104c and obtain fused multi-scale features. In the case where the multi-scale feature extraction units 104a, 104b, and 104c include global pooling sub-units, the size of the fused multi-scale features may be (3 × CM)₁+3*CM₂+3*CM₃,1,1)。

It is to be noted that, although in the example shown in fig. 6, the information processing apparatus 100 includes three multi-scale feature extraction units 104a, 104b, and 104c, and each of the multi-scale feature extraction units 104a, 104b, and 104c includes three feature extraction sub-units 1042a, 1042b, and 1042c, a person skilled in the art can select an appropriate number of multi-scale feature extraction units and an appropriate number of feature extraction sub-units according to actual needs.

The information processing apparatus according to the embodiment of the present disclosure has been described above with reference to fig. 1 to 6, and in correspondence with the embodiment of the information processing apparatus described above, the present disclosure also provides an embodiment of the following information processing method.

Fig. 7 is a flowchart illustrating an example of a flow of an information processing method 700 according to an embodiment of the present disclosure. As shown in fig. 7, an information processing method 700 according to an embodiment of the present disclosure may include a base feature extraction step S702, a multi-scale feature extraction step S704, a primary classification step S706, a secondary classification step S708, and a classification result fusion step S710. The information processing method 700 may begin at a start step S701 and end at an end step S712.

In the basic feature extraction step S702, the basic feature of the sound may be extracted. For example, in the basic feature extraction step S702, the basic feature of the sound may be extracted via a neural network (such as a convolutional neural network, a recurrent neural network RNN, or the like). For example, the basic feature extracting step S702 may be implemented by the basic feature extracting unit 102 of the information processing apparatus 100, and details are not described herein again.

In the multi-scale feature extraction step S704, multi-scale features of the sound may be extracted based on the basic features extracted by the basic feature extraction step S702. For example, the multi-scale feature extraction step S704 may be implemented by the multi-scale feature extraction unit 104 of the information processing apparatus 100, and details are not described herein again.

In the primary classification step S706, the sound may be primarily classified based on the basic features extracted by the basic feature extraction step S702 to obtain a primary classification result. For example, the sound may be primarily classified via a neural network (such as a convolutional neural network). For example, the primary classification step S706 may be implemented by the primary classification unit 106 of the information processing apparatus 100, and details thereof are not described herein again.

In the secondary classification step S708, the sound may be secondarily classified based on the primary classification result and the multi-scale features of the sound to obtain a secondary classification result. For example, the sound may be sub-classified via a recurrent neural network to obtain a sub-classification result. For example, the secondary classification step S708 can be implemented by the secondary classification unit 108 of the information processing apparatus 100, and details thereof are not described herein.

In the classification result fusion step S710, the primary classification result and the secondary classification result of the sound may be fused to obtain a final classification result of the sound. For example, the classification result fusion step S710 can be implemented by the classification result fusion unit 110 of the information processing apparatus 100, and details are not described herein again.

Sound carries a large amount of environmental information as well as information of various events in the environment. By analyzing the sounds, events occurring in the environment may be distinguished and/or identified. Sound has features of different scales, however in conventional classification of sound events such features of different scales of sound are not used. Similar to the information processing apparatus 100 according to the above-described embodiment of the present disclosure, the information processing method 700 according to this embodiment of the present disclosure extracts basic features and multi-scale features of sound, primary-classifies and secondary-classifies the sound based on the extracted basic features and multi-scale features, and acquires a final classification result based on the primary classification result and the secondary classification result, making it possible to distinguish more classes of sound events and/or improve the accuracy of sound event classification.

According to an embodiment of the present disclosure, in the primary classification step S706, the sound may be primarily classified based on the underlying features for a plurality of root classes. In this case, the primary classification may be regarded as a coarse classification. For example, the primary classification result obtained by the primary classification step S706 may include a probability that the sound belongs to each of a plurality of root classes. In the secondary classification step S708, the sound may be secondarily classified for a plurality of subcategories according to the primary classification result and the multi-scale features. In this case, the secondary classification may be regarded as a fine classification. For example, the secondary classification step S708 may include a plurality of secondary classification sub-steps that correspond one-to-one to a plurality of root classes. Wherein, for each of a plurality of sub-classification sub-steps, the sub-classification sub-step is performed in case the probability that a sound belongs to the root class to which the sub-classification sub-step corresponds is equal to or greater than a predetermined threshold.

According to one embodiment of the present disclosure, each of the plurality of root categories may include a plurality of sub-categories, and a total number of sub-categories included in the plurality of root categories is N, where N is a natural number greater than 1. For example, there may be overlap in sub-categories included in different root categories. The secondary classification result may comprise at least one N-dimensional probability vector obtained by performing the secondary classification sub-step. The respective elements in each N-dimensional probability vector represent the probability that a sound belongs to the respective subcategory and are set to 0 in advance. In each sub-classification sub-step, a corresponding N-dimensional probability vector may be obtained by performing the following iterative process:

in the first round of processing, a first probability vector may be obtained based on the multi-scale features of the sound, wherein each element in the first probability vector represents a probability that the sound belongs to each subcategory included in the root category corresponding to the secondary classification sub-step; selecting subcategories corresponding to the first m largest elements in the first probability vector as candidate subcategories; generating m sub-category vectors based on the candidate sub-categories, wherein m is a natural number greater than 1; and for each of the m sub-category vectors, calculating a score for the sub-category vector based on elements in the first probability vector that correspond to elements included in the sub-category vector.

In the ith round of processing more than or equal to 2, the first n most-scored subcategory vectors in the subcategory vectors generated by the previous round of processing can be selected as candidate subcategory vectors, wherein n is a natural number greater than 1 and n is less than or equal to m; and for each of the candidate subcategory vectors: obtaining an ith probability vector based on the candidate sub-category vector and the multi-scale features of the sound, wherein each element in the ith probability vector represents the probability that the sound belongs to each sub-category included in the root category corresponding to the secondary classification sub-step; selecting the subcategories corresponding to the first m largest elements in the ith probability vector as candidate subcategories; and adding each of the candidate subcategories to the candidate subcategory vector, respectively, to newly generate a subcategory vector, and calculating a score of the newly generated subcategory vector based on the score of the candidate subcategory vector and an element in the ith probability vector corresponding to the newly added candidate subcategory.

In the above iterative process performed in the secondary classification sub-step, the iterative process terminates in the event that the score of each of the newly generated sub-category vectors is less than the score of each of the candidate sub-category vectors selected from the sub-category vectors generated from the previous round of processing. In the sub-classification sub-step, the corresponding N-dimensional probability vector may be obtained based on the probability corresponding to each element included in the sub-category vector with the highest score in the sub-category vectors generated in the second last round of processing.

For example, the secondary classification sub-step may be implemented by a secondary classifier included in the secondary classification unit 108 of the information processing apparatus 100, and details will not be described again.

For example, in the classification result fusion step S710, the secondary classification results are weighted-averaged based on the primary classification results to obtain the final classification results. For example, the primary classification result and the secondary classification result may be fused according to the above formula (1) to obtain the final classification result R_fusion。

According to an embodiment of the present disclosure, the information processing method 700 may further include a multi-scale feature fusion step S707, and in the multi-scale feature fusion step S707, the plurality of multi-scale features extracted by the multi-scale feature extraction step S704 may be fused and the fused multi-scale features may be obtained. Therein, in the secondary classification step S708, the sound may be secondarily classified based on the primary classification result of the sound and the fused multi-scale features to obtain a secondary classification result. For example, in the multi-scale feature fusion step S707, the three multi-scale features extracted by the multi-scale feature extraction step S704 may be fused, and the fused multi-scale features may be obtained. For example, the multi-scale feature fusion step S707 may be implemented by the multi-scale feature fusion unit 114 of the information processing apparatus 100, and details will not be described again.

According to an embodiment of the present disclosure, the multi-scale feature extraction step S704 may include: a feature extraction substep in which features of the sound can be extracted based on the basic features extracted by the basic feature extraction step S702; and a first feature fusion sub-step of, in the first feature fusion sub-step, fusing a plurality of features of the sound extracted by the feature extraction sub-step, and outputting a fusion result as a multi-scale feature of the sound.

For example, in the feature extraction sub-step, features of the sound may be extracted via a two-dimensional convolution unit 2D Conv. In this case, the multi-scale feature extraction step S704 may further include a preprocessing sub-step in which the input for the two-dimensional convolution unit 2D Conv is processed to reduce the dimensionality of the input.

As an example, the multi-scale feature extraction step S704 may further include: a global pooling sub-step in which a plurality of features of the sound extracted by the feature extraction sub-step can be globally pooled; and a second feature fusion sub-step in which a plurality of features of the sound pooled via the global pooling sub-step may be fused and the fusion result may be output as a multi-scale feature of the sound. For example, in the second feature fusion sub-step, three features of the sound via global pooling may be fused, and the fusion result may be output as a multi-scale feature of the sound.

For example, the feature extraction sub-step, the first feature fusion sub-step, the global pooling sub-step, the pre-processing sub-step, and the second feature fusion sub-step may be implemented by the feature extraction sub-unit 1042, the first feature fusion sub-unit 1044, the global pooling sub-unit 1046, the pre-processing sub-unit 1048, and the second feature fusion sub-unit 1050 of the multi-scale feature extraction unit 104 of the information processing apparatus 100, and details thereof will not be described again.

It should be noted that although the functional configurations and operations of the information processing apparatus and the information processing method according to the embodiments of the present disclosure are described above, this is merely an example and not a limitation, and a person skilled in the art may modify the above embodiments according to the principles of the present disclosure, for example, functional modules and operations in the respective embodiments may be added, deleted, or combined, and such modifications fall within the scope of the present disclosure.

In addition, it should be further noted that the method embodiments herein correspond to the apparatus embodiments described above, and therefore, the contents that are not described in detail in the method embodiments may refer to the descriptions of the corresponding parts in the apparatus embodiments, and the description is not repeated here.

In addition, the present disclosure also provides a storage medium and a program product. It should be understood that the machine-executable instructions in the storage medium and the program product according to the embodiments of the present disclosure may also be configured to perform the above-described information processing method, and thus, the contents not described in detail herein may refer to the description of the corresponding parts previously, and the description will not be repeated herein.

Accordingly, storage media for carrying the above-described program products comprising machine-executable instructions are also included in the present disclosure. Including, but not limited to, floppy disks, optical disks, magneto-optical disks, memory cards, memory sticks, and the like.

Further, it should be noted that the above series of processes and means may also be implemented by software and/or firmware. In the case of implementation by software and/or firmware, a program constituting the software is installed from a storage medium or a network to a computer having a dedicated hardware structure, such as a general-purpose personal computer 800 shown in fig. 8, which is capable of executing various functions and the like when various programs are installed.

In fig. 8, a Central Processing Unit (CPU)801 executes various processes in accordance with a program stored in a Read Only Memory (ROM)802 or a program loaded from a storage section 808 to a Random Access Memory (RAM) 803. In the RAM 803, data necessary when the CPU 801 executes various processes and the like is also stored as necessary.

The CPU 801, the ROM802, and the RAM 803 are connected to each other via a bus 804. An input/output interface 805 is also connected to the bus 804.

The following components are connected to the input/output interface 805: an input section 806 including a keyboard, a mouse, and the like; an output section 807 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker and the like; a storage section 808 including a hard disk and the like; and a communication section 809 including a network interface card such as a LAN card, a modem, and the like. The communication section 809 performs communication processing via a network such as the internet.

A drive 810 is also connected to the input/output interface 805 as needed. A removable medium 811 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 810 as necessary, so that a computer program read out therefrom is installed in the storage portion 808 as necessary.

In the case where the above-described series of processes is realized by software, a program constituting the software is installed from a network such as the internet or a storage medium such as the removable medium 811.

It will be understood by those skilled in the art that such a storage medium is not limited to the removable medium 811 shown in fig. 7 in which the program is stored, distributed separately from the apparatus to provide the program to the user. Examples of the removable medium 811 include a magnetic disk (including a floppy disk (registered trademark)), an optical disk (including a compact disk read only memory (CD-ROM) and a Digital Versatile Disk (DVD)), a magneto-optical disk (including a Mini Disk (MD) (registered trademark)), and a semiconductor memory. Alternatively, the storage medium may be the ROM802, a hard disk included in the storage section 808, or the like, in which programs are stored and which are distributed to users together with the apparatus including them.

The preferred embodiments of the present disclosure are described above with reference to the drawings, but the present disclosure is of course not limited to the above examples. Various changes and modifications within the scope of the appended claims may be made by those skilled in the art, and it should be understood that these changes and modifications naturally will fall within the technical scope of the present disclosure.

For example, a plurality of functions included in one unit may be implemented by separate devices in the above embodiments. Alternatively, a plurality of functions implemented by a plurality of units in the above embodiments may be implemented by separate devices, respectively. In addition, one of the above functions may be implemented by a plurality of units. Needless to say, such a configuration is included in the technical scope of the present disclosure.

In this specification, the steps described in the flowcharts include not only the processing performed in time series in the described order but also the processing performed in parallel or individually without necessarily being performed in time series. Further, even in the steps processed in time series, needless to say, the order can be changed as appropriate.

In addition, the technique according to the present disclosure can also be configured as follows.

Supplementary note 1. an information processing apparatus comprising:

a basic feature extraction unit configured to extract a basic feature of a sound;

a multi-scale feature extraction unit configured to extract multi-scale features of the sound based on the basic features extracted by the basic feature extraction unit;

a classification result fusion unit configured to fuse the primary classification result and the secondary classification result of the sound to obtain a final classification result of the sound.

Supplementary note 2. the information processing apparatus according to supplementary note 1, wherein the preliminary classification result includes a probability that the sound belongs to each of a plurality of root classes,

wherein the secondary classification unit includes a plurality of secondary classifiers in one-to-one correspondence with the plurality of root classes, an

Supplementary note 3. the information processing apparatus according to supplementary note 2, wherein each of the plurality of root categories includes a plurality of subcategories, and a total number of subcategories included in the plurality of root categories is N, where N is a natural number greater than 1,

wherein each of the secondary classifiers is configured to obtain, if activated, a respective N-dimensional probability vector by performing an iterative process,

in the first round of processing, the secondary classifier:

selecting subcategories corresponding to the first m largest elements in the first probability vector as candidate subcategories;

generating m sub-category vectors based on the candidate sub-categories, wherein m is a natural number greater than 1; and

in the i ≧ 2 rounds of processing, the secondary classifier:

Note 4. the information processing apparatus according to any one of notes 1 to 3, wherein the information processing apparatus includes a plurality of the multi-scale feature extraction units,

Note 5. the information processing apparatus according to note 4, wherein each of the multi-scale feature extraction units includes:

a plurality of feature extraction subunits, each configured to extract a feature of the sound based on the basic feature extracted by the basic feature extraction unit; and

Supplementary note 6. the information processing apparatus according to supplementary note 5, wherein each multi-scale feature extraction unit further includes:

Note 7 the information processing apparatus according to note 6, wherein the feature extraction subunit is a two-dimensional convolution unit, and

wherein each multi-scale feature extraction unit further comprises: a pre-processing subunit configured to process an input to reduce a dimensionality of the input.

Supplementary note 8. the information processing apparatus according to supplementary note 6, wherein the information processing apparatus includes three of the multi-scale feature extraction units, and each multi-scale feature extraction unit includes three of the feature extraction sub-units and three of the global pooling sub-units.

Supplementary note 9. the information processing apparatus according to any one of supplementary notes 1 to 3, wherein the basic feature extraction unit includes a convolutional neural network, and/or the secondary classification unit includes a recurrent neural network.

Note 10 the information processing apparatus according to note 6, wherein the first feature fusion subunit is further configured to fuse, in a concatenated manner, the plurality of features of the sound extracted by the plurality of feature extraction subunits, and

wherein the second feature fusion subunit is further configured to fuse, in a stitched manner, a plurality of features of the sound pooled via the global pooling subunit.

Note 11. an information processing method includes:

a basic feature extraction step of extracting a basic feature of the sound;

a multi-scale feature extraction step of extracting multi-scale features of the sound based on the basic features extracted by the basic feature extraction step;

a primary classification step of primary-classifying the sound based on the basic features extracted by the basic feature extraction step to obtain a primary classification result;

Supplementary note 12 the information processing method according to supplementary note 11, wherein the preliminary classification result includes a probability that the sound belongs to each of a plurality of root classes,

wherein the secondary classification step includes a plurality of secondary classification sub-steps in one-to-one correspondence with the plurality of root classes, an

Wherein, for each of the plurality of sub-classification sub-steps, the sub-classification sub-step is performed in case the probability that the sound belongs to the root class to which it corresponds is equal to or greater than a predetermined threshold.

Supplementary note 13. the information processing method according to supplementary note 12, wherein each of the plurality of root categories includes a plurality of subcategories, and a total number of subcategories included in the plurality of root categories is N, where N is a natural number greater than 1,

wherein the secondary classification result comprises at least one N-dimensional probability vector obtained by performing one or more of the plurality of secondary classification sub-steps, each element of each N-dimensional probability vector representing a probability that the sound belongs to a respective sub-category and being preset to 0,

wherein, in each of the plurality of sub-classification sub-steps, a respective N-dimensional probability vector is obtained by performing an iterative process,

in the first round of processing:

obtaining a first probability vector based on the multi-scale features of the sound, wherein each element in the first probability vector represents the probability that the sound belongs to each sub-category included in the root category corresponding to the secondary classification sub-step;

selecting subcategories corresponding to the first m largest elements in the first probability vector as candidate subcategories;

generating m sub-category vectors based on the candidate sub-categories, wherein m is a natural number greater than 1; and

in the i ≧ 2 rounds of processing:

for each of the candidate subcategory vectors: obtaining an ith probability vector based on the candidate sub-category vector and the multi-scale features of the sound, wherein each element in the ith probability vector represents the probability that the sound belongs to each sub-category included in the root category corresponding to the secondary classification sub-step; selecting the subcategories corresponding to the first m largest elements in the ith probability vector as candidate subcategories; and adding each of the candidate subcategories to the candidate subcategory vector, respectively, to newly generate a subcategory vector, and calculating a score of the newly generated subcategory vector based on the score of the candidate subcategory vector and elements of the ith probability vector corresponding to the newly added candidate subcategory,

wherein the iterative process terminates if the score of each of the newly generated sub-category vectors is less than the score of each of the candidate sub-category vectors selected from the sub-category vectors generated from the previous round of processing, and wherein the respective N-dimensional probability vectors are obtained based on the probabilities corresponding to the respective elements included in the sub-category vector having the highest score among the sub-category vectors generated in the second-to-last round of processing.

Supplementary notes 14. the information processing method according to any one of supplementary notes 11 to 13, wherein the information processing method further comprises a multi-scale feature fusion step of fusing the plurality of multi-scale features extracted by the multi-scale feature extraction step and obtaining fused multi-scale features, and

wherein, in the secondary classification step, the sound is secondarily classified based on the primary classification result of the sound and the fused multi-scale features to obtain the secondary classification result.

Supplementary notes 15. the information processing method according to supplementary notes 14, wherein the multi-scale feature extraction step includes:

a feature extraction substep of extracting features of the sound based on the basic features extracted by the basic feature extraction step; and

a first feature fusion substep of fusing the plurality of features of the sound extracted by the feature extraction substep and outputting a fusion result as a multiscale feature of the sound.

Supplementary note 16. the information processing method according to supplementary note 15, wherein the multi-scale feature extraction step further includes:

a global pooling sub-step for globally pooling the features of the sounds extracted by the feature extraction sub-step; and

a second feature fusion sub-step for fusing a plurality of features of the sound pooled via the global pooling sub-step and outputting the fusion result as a multi-scale feature of the sound.

Note 17. the information processing method according to note 16, wherein,

in the feature extraction sub-step, a feature of the sound is extracted via a two-dimensional convolution unit; and is

Wherein the multi-scale feature extraction step further comprises: a pre-processing sub-step for processing the input for the two-dimensional convolution unit to reduce the dimensionality of said input.

Supplementary note 18. the information processing method according to supplementary note 16, wherein, in the multi-scale feature extraction step, three multi-scale features of the sound are extracted, and wherein each of the three multi-scale features is obtained by fusing three features of the sound through global pooling.

Note 19. the information processing method according to any one of note 16, wherein,

in the first feature fusion sub-step, fusing a plurality of features of the sound extracted by the feature extraction sub-step in a stitched manner; and

wherein, in the second feature fusion sub-step, a plurality of features of the sound pooled via the global pooling sub-step are fused in a concatenated manner.

Reference 20. a computer readable storage medium storing program instructions for performing the method of any of the references 11 to 19 when executed by a computer.

29页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种固定声源识别方法及装置

Information processing apparatus, information processing method, and computer program

相关技术

网友询问留言