Signal processing apparatus, method and program

文档序号：1836311 发布日期：2021-11-12 浏览：31次中文

阅读说明：本技术 信号处理装置、方法和程序 (Signal processing apparatus, method and program ) 是由高桥直也于 2020-03-13 设计创作，主要内容包括：本技术涉及信号处理装置、方法和程序,该信号处理装置、方法和程序促进声源分离。该信号处理装置包括声源分离单元,该声源分离单元根据预先学习的预定声源分离模型对输入声学信号递归地执行声源分离以从包括预定声源的训练声学信号中分离预定声源。本技术适用于信号处理装置。(The present technology relates to a signal processing device, method, and program that facilitate sound source separation. The signal processing apparatus includes a sound source separation unit that recursively performs sound source separation on an input acoustic signal according to a predetermined sound source separation model learned in advance to separate a predetermined sound source from a training acoustic signal including the predetermined sound source. The present technique is applicable to a signal processing apparatus.)

1. A signal processing apparatus comprising:

a sound source separation unit recursively performing sound source separation on an input acoustic signal by using a predetermined sound source separation model learned in advance to separate a predetermined sound source from an acoustic signal for learning including the predetermined sound source.

2. The signal processing apparatus according to claim 1,

the sound source separation unit performs the sound source separation to separate a separation signal of an utterance of a speaker from the acoustic signal.

3. The signal processing apparatus according to claim 2,

the sound source separation unit performs the sound source separation on the acoustic signal of which the number of speakers is unknown.

4. The signal processing apparatus according to claim 2,

the sound source separation model is a speaker separation model that is learned to separate the acoustic signal for learning including utterances of two speakers into a separation signal including an utterance of one speaker and a separation signal including an utterance of the other speaker.

5. The signal processing apparatus according to claim 2,

the sound source separation model is a speaker separation model that is learned to separate the acoustic signal for learning including utterances of three speakers into three separated signals, each of the three separated signals including an utterance of a corresponding one of the utterances of the three speakers.

6. The signal processing apparatus according to claim 2,

the sound source separation model is a speaker separation model that is learned to separate the acoustic signal for learning including utterances of an arbitrary plurality of speakers into a separation signal including an utterance of one speaker and a separation signal including utterances of remaining speakers other than the one speaker among the plurality of speakers.

7. The signal processing apparatus according to claim 2,

the sound source separation unit recursively performs the sound source separation by using a plurality of sound source separation models different from each other as the predetermined sound source separation model.

8. The signal processing apparatus of claim 2, further comprising:

an end determination unit that determines whether to end recursive sound source separation based on the separation signal obtained by the sound source separation.

9. The signal processing apparatus according to claim 8,

the end determination unit determines to end the recursive sound source separation in a case where one of the separated signals obtained by the sound source separation is an unvoiced signal.

10. The signal processing apparatus according to claim 8,

in a case where it is determined that the number of speakers of utterances included in the separated signal obtained by the sound source separation is 1 based on a single speaker determination model for determining whether the number of speakers of utterances included in the separated signal is 1 and the separated signal, the end determination unit determines that the recursive sound source separation is to be ended.

11. The signal processing apparatus of claim 2, further comprising:

a same-speaker determining unit that performs same-speaker determination as to whether or not a plurality of separated signals obtained by recursive sound source separation are signals of a same speaker, and synthesizes separated signals from the plurality of separated signals of the same speaker.

12. The signal processing apparatus according to claim 11,

the same speaker determination unit performs the same speaker determination by clustering the separated signals.

13. The signal processing apparatus of claim 12,

the same speaker determination unit calculates feature values of the separated signals, and determines that two separated signals are signals of the same speaker in a case where a distance between the feature values of the two separated signals is equal to or smaller than a threshold value.

14. The signal processing apparatus of claim 12,

the same speaker determination unit performs the same speaker determination based on a correlation between temporal energy variations of two separate signals.

15. The signal processing apparatus according to claim 11,

the same speaker determination unit performs the same speaker determination based on language information of the plurality of separated signals.

16. The signal processing apparatus according to claim 11,

the same speaker determination unit performs the same speaker determination based on a same speaker determination model used to determine whether two separate signals are signals of the same speaker.

17. A signal processing method, comprising:

sound source separation is recursively performed on an input acoustic signal by using a predetermined sound source separation model learned in advance by a signal processing device to separate a predetermined sound source from an acoustic signal for learning including the predetermined sound source.

18. A program for causing a computer to execute a process comprising the steps of:

sound source separation is recursively performed on an input acoustic signal by using a predetermined sound source separation model learned in advance to separate a predetermined sound source from an acoustic signal for learning including the predetermined sound source.

Technical Field

The present technology relates to a signal processing device, a signal processing method, and a program, and more particularly, to a signal processing device, a signal processing method, and a program that allow easier sound source separation.

Background

For example, there are many cases where it is desired to separately process simultaneous utterances of a plurality of speakers, such as speech recognition (see patent document 1, for example), subtitles, and speech clarification of a plurality of speakers.

As a sound source separation technique for separating an acoustic signal of a mixed speech including utterances of a plurality of speakers into an acoustic signal of each speaker, a technique using directional information (for example, see patent document 2) and a technique assuming independence of a sound source have conventionally been proposed.

However, these techniques have difficulty in being implemented with a single microphone and coping with a case where sounds from a plurality of sound sources arrive from the same direction.

Therefore, as a technique for separating voices uttered simultaneously in this case, deep clustering (see, for example, non-patent document 1) and permutation-invariant training (see, for example, non-patent document 2) are known.

Reference list

Patent document

Patent document 1: japanese unexamined patent application publication (translation of PCT application) No. 2017-515140.

Patent document 2: japanese patent application publication No. 2010-112995.

Non-patent document

Non-patent document 1: j.r.hershey, z.chen, and j.le Roux, "Deep Clustering: cognitive Embeddings for Segmentation and Separation ".

Non-patent document 2: M.Kolbeak, D.Yu, Z. -H.Tan, and J.Jensen, "Multitalker speed separation with an implementation-level personalization in a variant of deep recurrent networks," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol.25, No.10, pp.1901-1913, 2017.

Disclosure of Invention

Problems to be solved by the invention

However, in the above-described technique, it is not easy to separate the utterance of each speaker from a mixed speech in which the number of speakers is unknown.

For example, in deep clustering and permutation-invariant training, it is assumed that the number of speakers speaking simultaneously is known.

However, in general, there are many cases where the number of speakers is unknown. In this case, these techniques additionally require a model for estimating the number of speakers, and it is necessary to switch between the algorithms by, for example, preparing a sound source separation model (separation algorithm) for separating utterances of each of the speakers.

Therefore, when these techniques are used to separate a mixed speech in which the number of speakers is unknown into utterances of each speaker, development time increases, and the amount of memory for retaining a sound source separation model increases. Furthermore, in the case where the estimation of the number of speakers is not performed correctly, the performance deteriorates significantly.

The present technology is proposed in view of such a situation, and allows easier sound source separation.

Solution to the problem

One aspect of the present technology provides a signal processing apparatus including: a sound source separation unit recursively performing sound source separation on an input acoustic signal by using a predetermined sound source separation model learned in advance to separate a predetermined sound source from an acoustic signal for learning including the predetermined sound source.

One aspect of the present technology provides a signal processing method or program including the steps of: sound source separation is recursively performed on an input acoustic signal by using a predetermined sound source separation model learned in advance to separate a predetermined sound source from an acoustic signal for learning including the predetermined sound source.

In one aspect of the present technology, sound source separation is recursively performed on an input acoustic signal by using a predetermined sound source separation model learned in advance to separate a predetermined sound source from an acoustic signal for learning including the predetermined sound source.

Drawings

Fig. 1 is a diagram illustrating recursive sound source separation.

Fig. 2 is a diagram showing a configuration example of the signal processing apparatus.

Fig. 3 is a flowchart showing the sound source separation process.

Fig. 4 is a diagram illustrating recursive sound source separation.

Fig. 5 is a diagram showing a configuration example of the signal processing apparatus.

Fig. 6 is a flowchart illustrating the sound source separation process.

Fig. 7 is a diagram showing a configuration example of a computer.

Detailed Description

Embodiments to which the present technology is applied will be described below with reference to the drawings.

< first embodiment >

< present technology >

First, an outline of the present technology will be described. Here, an example of separating an utterance (voice) of each speaker from an input acoustic signal obtained by collecting mixed voices uttered by a plurality of speakers at the same time or at different timings with one or more microphones by using a single sound source separation model will be described.

Specifically, here, the number of speakers included in the mixed speech based on the input acoustic signal is unknown. The present technology recursively performs sound source separation on an input acoustic signal by using a single sound source separation model, so that utterances (voices) of each of an unspecified unknown number of speakers can be more easily separated from the input acoustic signal.

Note that, in the example described herein, the sound of the sound source is the speech of the speaker, but the sound is not limited thereto, and may be any sound, such as an animal cry or a musical instrument sound.

The sound source separation model used in the present technology is a model such as a neural network that is learned to separate input speech on a speaker-to-speaker basis. That is, the sound source separation model has been learned in advance to separate the acoustic signal of the utterance of the speaker from the acoustic signal for learning including the mixed speech of the utterance of the speaker as the sound source.

The sound source separation model performs calculation using arithmetic coefficients according to a predetermined sound source separation algorithm to separate an input acoustic signal into acoustic signals (hereinafter, also referred to as separation signals) of each sound source (speaker), and is implemented by a sound source separation algorithm and arithmetic coefficients.

In the present technology, sound source separation is performed on an input acoustic signal of a mixed voice in which the number of speakers is unknown or known using a sound source separation model.

Then, based on the obtained separation signal, it is determined whether a predetermined end condition is satisfied. Sound source separation is recursively performed on the separated signals using the same sound source separation model until it is determined that the end condition is satisfied, and finally separated signals for each sound source (speaker) are obtained.

Here, as a specific example, a case will be described where two speaker separation models are used as the sound source separation model, the two speaker separation models being learned to separate an acoustic signal for learning including utterances of two speakers as sound sources into a separated signal including the utterance of one speaker and a separated signal including the utterance of the other speaker.

Such a sound source separation model can be obtained by learning using a learning technique such as deep clustering or displacement invariant training.

In the two-speaker separation model, when input acoustic signals of mixed voices of two speakers are input, it is desirable to output a separation signal of an utterance (voice) of each speaker as a sound source separation result.

Further, in the two-speaker separation model, when an input acoustic signal of speech of one speaker is input, it is desirable to output a separation signal of utterance of one speaker and a silence separation signal as a sound source separation result.

On the other hand, in the case where two speaker separation models are input, that is, in the case where the input acoustic signal is a signal of a mixed voice of three or more speakers, such a mixed voice is an input that does not appear when the two speaker separation models are learned.

In this case, in response to the input of the mixed speech of three speakers, sound source separation is performed so that the utterances (speech) of two speakers are included in one separated signal, for example, as shown in fig. 1.

In the example shown in fig. 1, the mixed speech based on the input acoustic signals includes utterances of three speakers PS1 to PS 3.

As a result of the sound source separation, i.e. speaker separation of such input acoustic signals using two speaker separation models, as indicated by arrow Q11, mixed speech separation is performed such that one separated signal comprises only utterances of speaker PS1, while the other separated signal comprises only utterances of speaker PS2 and speaker PS 3.

Further, as a result of further sound source separation of the separated signals including only the utterance of the speaker PS1 using two speaker separation models, for example, as shown by an arrow Q12, speech separation is performed such that one separated signal includes only the utterance of the speaker PS1 and the other separated signal is an unvoiced signal.

In a similar manner, as a result of further sound source separation of the separated signals including only utterances of speakers PS2 and PS3 using two speaker separation models, for example, as indicated by arrow Q13, mixed speech separation is performed such that one separated signal includes only utterances of speaker PS2, and the other separated signal includes only utterances of speaker PS 3.

In this way, when sound source separation is recursively performed on the input acoustic signals by using the same two speaker separation models, separated signals are obtained, each of which includes only a corresponding one of the speakers PS1 through PS 3.

In this example, when the first sound source separation indicated by the arrow Q11 is performed, the obtained separation signal includes only utterances of two speakers at the maximum. In most cases, the input acoustic signal is not separated into a separated signal of utterances of three speakers and a silence separated signal.

Therefore, when the first sound source separation has been performed, all the separated signals are speech that can be solved by using two speaker separation models, that is, signals of separated signals of each speaker can be obtained from the two speaker separation models. Then, as shown by arrows Q12 and Q13, recursive sound source separation is performed on such separated signals, so that separated signals for each speaker can be obtained.

Note that even in the case where the input acoustic signal is a mixed speech of utterances of four or more speakers, the number of times of sound source separation that is recursively performed can be increased, so that a separated signal of each speaker can be finally obtained.

Further, in the case where sound source separation is recursively performed to separate an input acoustic signal into a separated signal for each speaker (to extract a separated signal), an end condition for ending the recursive sound source separation is required when the number of speakers of a mixed speech of the input acoustic signal is unknown (unknown).

The end condition is a condition that is satisfied when the separated signal obtained by sound source separation includes only utterances of one speaker, in other words, a condition that is satisfied when the separated signal does not include utterances of two or more speakers.

Here, as an example, in the case where one separated signal obtained by sound source separation is an unvoiced signal, in more detail, in the case where the average level (energy) of one separated signal is equal to or less than a predetermined threshold value, it is assumed that an end condition is satisfied, that is, a separated signal of each speaker is obtained.

According to the present technology as described above, even in a case where the number of speakers inputting an acoustic signal is unknown, sound source separation can be easily performed without requiring a model for estimating the number of speakers, a sound source separation model for the number of each speaker, direction information indicating the direction of a sound source, and the like, and a separated signal for each sound source (speaker) can be obtained.

Therefore, the present technology significantly suppresses an increase in time for developing a sound source separation model or the like and an increase in storage amount for holding the sound source separation model.

That is, in the present technology, a separated signal for each speaker can be obtained by one sound source separation model regardless of the number of speakers inputting an acoustic signal, and it is possible to simplify the system, reduce the necessary memory amount, integrate development of a sound source separation model, and the like.

Further, in the present technology, sound source separation is recursively performed, so that the problem (task) to be solved by each sound source separation can be simplified, and thus separation performance can be improved.

Note that an example of using two speaker separation models as the sound source separation model has been described here. However, this is not restrictive, and recursive sound source separation may be performed by a speaker separation model (such as three speaker separation models) of a plurality of speakers that separate an input acoustic signal into a separation signal of each of three or more speakers.

For example, the three speaker separation models are speaker separation models that are learned to separate acoustic signals for learning including utterances of three speakers as sound sources into three separation signals, each of which includes an utterance of a corresponding one of the utterances of the three speakers, that is, a separation signal of each of the three speakers.

< example of configuration of Signal processing apparatus >

Next, a signal processing apparatus to which the present technique is applied will be described.

For example, a signal processing apparatus to which the present technology is applied is configured as shown in fig. 2.

The signal processing apparatus 11 shown in fig. 2 has a sound source separating unit 21 and an end determining unit 22.

The sound source separating unit 21 receives an input acoustic signal from the outside. Further, the sound source separation unit 21 retains a sound source separation model obtained in advance by learning.

Note that, in this embodiment, a description will be given assuming that an input acoustic signal is an acoustic signal of a mixed speech in which the number of speakers (particularly, the number of speakers who uttered speech simultaneously) is unknown. Further, here, the sound source separation models retained by the sound source separation unit 21 are two speaker separation models.

According to the result of the end determination supplied from the end determination unit 22, the sound source separation unit 21 recursively performs sound source separation on the supplied input acoustic signals based on the retained sound source separation model to obtain separated signals, and supplies the resulting separated signals to the end determination unit 22.

The end determination unit 22 performs end determination based on the separation signal supplied from the sound source separation unit 21 to determine whether or not to end the recursive sound source separation, that is, whether or not an end condition is satisfied, and supplies the determination result to the sound source separation unit 21.

Further, if it is determined that the end condition is satisfied, the end determination unit 22 outputs the separated signal obtained by the sound source separation to the subsequent stage as an acoustic signal of the utterance of each speaker.

< description of Sound Source separation Process >

Next, the sound source separation process performed by the signal processing device 11 will be described with reference to a flowchart in fig. 3.

In step S11, the sound source separation unit 21 performs sound source separation on the supplied input acoustic signal based on the retained sound source separation model to obtain a separated signal, and supplies the resulting separated signal to the end determination unit 22.

Specifically, the sound source separation unit 21 performs arithmetic processing according to a sound source separation algorithm corresponding to the sound source separation model based on an arithmetic coefficient constituting the sound source separation model and an input acoustic signal, and obtains two separation signals as outputs of the sound source separation model.

In step S12, based on the separation signal supplied from the sound source separation unit 21, the end determination unit 22 performs end determination on each pair (group) of two separation signals obtained by one sound source separation, and determines whether all pairs satisfy an end condition.

Specifically, for example, for a pair, if the average level of one of the two split signals constituting the pair is equal to or less than a predetermined threshold, the end determination unit 22 determines that the pair satisfies the end condition.

If it is determined in step S12 that no pair satisfies the end condition, the end determining unit 22 supplies information indicating the pair that does not satisfy the end condition to the sound source separating unit 21 as a result of the end determination, and then the process proceeds to step S13.

In step S13, based on the result of the end determination supplied from the end determination unit 22, the sound source separation unit 21 performs sound source separation on each of the separated signals constituting the pair that does not satisfy the end condition using a sound source separation model to obtain a separated signal, and supplies the resultant separated signal to the end determination unit 22.

For example, in step S13, the same sound source separation model as that used in step S11 is used for sound source separation.

Note that sound source separation may be recursively performed using a plurality of sound source separation models different from each other. For example, three speaker separation models may be used for sound source separation in step S11, and two speaker separation models may be used for sound source separation in step S13.

After recursive sound source separation is performed in the process of step S13, the process returns to step S12, and the above-described process is repeated until it is determined that all pairs satisfy the end condition.

For example, in the example shown in fig. 1, since one separated signal is a silent signal in the sound source separation shown by the arrow Q12, a pair of separated signals obtained as a result of the sound source separation shown by the arrow Q12 satisfy the end condition.

On the other hand, since a soundless separated signal cannot be obtained by sound source separation shown by arrow Q13 in fig. 1, it is not certain that the end condition is satisfied, and recursive sound source separation is performed on each of two separated signals obtained by sound source separation shown by arrow Q13 in step S13.

Further, if it is determined in step S12 of fig. 3 that all pairs satisfy the end condition, the input acoustic signal has been separated into a separated signal for each speaker, and thus the process proceeds to step S14.

In step S14, the end determination unit 22 outputs the separated signal for each speaker obtained by the sound source separation that has been performed to the subsequent stage, and the sound source separation process ends.

As described above, the signal processing apparatus 11 recursively performs sound source separation on the input acoustic signal until the end condition is satisfied, and obtains a separated signal for each speaker. In this way, sound source separation can be performed more easily and with sufficient separation performance.

< second embodiment >

< Synthesis based on the separation result >

Meanwhile, in the case where sound source separation is recursively performed on an input acoustic signal by using a speaker separation model as a sound source separation model, utterances of a certain speaker may be dispersed into different separation results, i.e., different separation signals.

Specifically, for example, as shown in fig. 1, a case is assumed in which sound source separation is performed on an input acoustic signal of a mixed speech including utterances of a speaker PS1 to a speaker PS3 by using two speaker separation models.

In this case, for example, as a result of sound source separation as shown by an arrow Q11 in fig. 1, the utterance of a certain speaker may appear not only in one separated signal but also in two separated signals in a dispersed manner as shown in fig. 4. Note that in fig. 4, the same reference numerals are given to portions corresponding to the case of fig. 1, and the description thereof will be omitted as appropriate.

In the example shown in fig. 4, sound source separation (speaker separation) is recursively performed on an input acoustic signal of a mixed speech including utterances of the speaker PS1 through the speaker PS3 by using two speaker separation models.

Here, first, as indicated by an arrow Q21, sound source separation is performed on the input acoustic signal.

Thus, a separate signal is obtained comprising the utterance of speaker PS1 and a part of the utterance of speaker PS2 and a separate signal comprising the utterance of speaker PS3 and a part of the utterance of speaker PS 2.

That is, although the utterances of speaker PS1 and speaker PS3 only appear in one separate signal, the utterance of speaker PS2 is split into two separate signals.

Here, recursive sound source separation is performed on the separated signal including the utterance of the speaker PS1 and a part of the utterance of the speaker PS2 obtained as a result of sound source separation shown by arrow Q21 using a two-speaker separation model as shown by arrow Q22, thereby obtaining a separated signal for each speaker.

That is, in this example, as a result of sound source separation shown by arrow Q22, a separated signal including only the utterance of speaker PS1 and a separated signal including only a part of the utterance of speaker PS2 are obtained.

In a similar manner, a separated signal including the utterance of the speaker PS3 and a part of the utterance of the speaker PS2 obtained as a result of sound source separation shown by arrow Q21 is subjected to recursive sound source separation using a two-speaker separation model shown by arrow Q23, thereby obtaining a separated signal for each speaker.

That is, in this example, as a result of sound source separation shown by arrow Q23, a separated signal including only the utterance of speaker PS3 and a separated signal including only a part of the utterance of speaker PS2 are obtained.

Even in such an example, each of the resulting separate signals includes only one speaker's utterance. Here, however, the utterance of speaker PS2 is split into two separate signals.

Thus, two or more separate voices, i.e. separate voices (utterances) of the same speaker dispersed into a plurality of separate signals, can be combined into one synthesized utterance of the speaker.

In this case, a speaker recognition model that inputs the separation signal and outputs the speaker recognition result may be used.

Specifically, for example, a neural network or the like that recognizes an arbitrary plurality of speakers is learned in advance as a speaker recognition model. Here, in the case where the number of speakers is large when learning the speaker recognition model, the speakers do not necessarily include a speaker as an actual target of sound source separation.

In this way, speaker recognition models are prepared, and then the speaker recognition models are used to cluster the separated signals obtained by sound source separation (i.e., speakers corresponding to the separated signals).

At the time of clustering, each separated signal is input to a speaker recognition model, and speaker recognition is performed.

At this time, the output of the speaker recognition model (i.e., the result of speaker recognition) or the activation (output) of the intermediate layer of the speaker recognition model (i.e., the intermediate calculation result of the arithmetic processing for obtaining the speaker recognition result) is obtained as a feature value (speaker embedding) representing the speaker corresponding to the input isolation signal.

Note that in calculating the feature value representing the speaker, the unvoiced portion of the separated signal may be ignored in the calculation.

When the feature values of each of the separated signals (separated voices) have been obtained, the distances between the feature values from each other, that is, the distances between the feature values are obtained. The separated signals whose distances between the feature values are equal to or smaller than the threshold value are determined as separated signals of the same speaker.

Further, as a result of the clustering, one separated signal is synthesized and obtained from a plurality of separated signals determined as the same speaker as a final separated signal of the speaker.

Thus, for example, in the example of fig. 4, it is assumed that a separated signal including only a part of the utterance of the speaker PS2 obtained by sound source separation shown by arrow Q22 and a separated signal including only a part of the utterance of the speaker PS2 obtained by sound source separation shown by arrow Q23 belong to the same speaker.

Then, the separated signals are added so that one separated signal is synthesized, and the resultant signal is output as a final separated signal including the utterance of the speaker PS 2.

< example of configuration of Signal processing apparatus >

In the case where clustering of separated signals obtained by sound source separation is performed as described above, the signal processing apparatus is configured as shown in fig. 5, for example. Note that in fig. 5, the same reference numerals are given to portions corresponding to the case of fig. 2, and the description thereof will be omitted as appropriate.

The signal processing apparatus 51 shown in fig. 5 has a sound source separating unit 21, an end determining unit 22, and an identical speaker determining unit 61.

The configuration of the signal processing apparatus 51 differs from that of the signal processing apparatus 11 in that the same speaker determining unit 61 is newly provided, but is otherwise the same as that of the signal processing apparatus 11.

The same speaker determining unit 61 performs the same speaker determination that determines whether or not the plurality of separated signals obtained by the recursive sound source separation are signals of the same speaker, and then synthesizes and generates final separated signals of the speaker from the plurality of separated signals of the same speaker according to the result of the determination.

More specifically, the same speaker determining unit 61 retains speaker recognition models obtained in advance by learning, and performs clustering based on the retained speaker recognition models and the separated signal of each speaker supplied from the end determining unit 22. That is, the same speaker determining unit 61 performs the same speaker determination by performing clustering.

Further, the same speaker determining unit 61 performs clustering to synthesize a final separated signal of speakers from separated signals determined to belong to the same speaker, and outputs a finally obtained separated signal of each speaker to a subsequent stage.

< description of Sound Source separation Process >

Next, the sound source separation process performed by the signal processing device 51 will be described with reference to a flowchart in fig. 6.

Note that the processing of steps S41 to S43 is similar to that of steps S11 to S13 in fig. 3, and the description thereof will be omitted.

When recursive sound source separation is performed and a separation signal for each speaker is obtained in steps S41 to S43, the separation signal is supplied from the end determination unit 22 to the same speaker determination unit 61, and then the process proceeds to step S44. That is, if it is determined in step S42 that all pairs satisfy the end condition, the process proceeds to step S44.

In step S44, the same speaker determination unit 61 calculates a feature value representing the speaker for each separated signal based on the retained speaker recognition model and the separated signal supplied from the end determination unit 22.

That is, the same speaker determination unit 61 calculates a feature value representing a speaker for each separated signal by performing a calculation using a speaker recognition model with the separated signal as an input.

In step S45, the same-speaker determining unit 61 determines whether there is a separated signal of the same speaker based on the feature values obtained in step S44. I.e. the same speaker determination is performed.

For example, for any two separated signals of all the separated signals, the same speaker determination unit 61 obtains the distance between the feature values of the two separated signals. If the distance is equal to or less than a predetermined threshold, it is determined that the two separated signals are signals (signals) of the same speaker.

For all separated signals, the same speaker determination unit 61 determines whether the two separated signals belong to the same speaker for all possible combinations of the two separated signals.

Then, if determination results indicating that the same speaker does not belong are obtained for all combinations, the same speaker determining unit 61 determines in step S45 that there is no separated signal of the same speaker.

The same speaker determining unit 61 performs the processing of step S44 and step S45 described above as clustering processing.

If it is determined in step S45 that there is a separated signal of the same speaker, the same speaker determining unit 61 synthesizes a final separated signal of the speaker from the plurality of separated signals determined to belong to the same speaker in step S46.

After synthesizing and obtaining the final separated signal for each speaker from the separated signals for the same speaker, the process proceeds to step S47.

On the other hand, if it is determined in step S45 that there is no separated signal of the same speaker, a separated signal of each speaker has already been obtained, so the process of step S46 is skipped, and the process proceeds to step S47.

If it is determined in step S45 that the separated signal of the same speaker does not exist, or if the process of step S46 is performed, in step S47, the same speaker determining unit 61 outputs the finally obtained separated signal of each speaker to the subsequent stage, and the sound source separating process ends.

As described above, the signal processing apparatus 51 recursively performs sound source separation on the input acoustic signals until the end condition is satisfied, and performs clustering of the separated signals to perform synthesis from the separated signals of the same speaker, and obtains a final separated signal for each speaker.

In this way, sound source separation can be performed more easily and with sufficient separation performance. Specifically, the signal processing device 51 performs synthesis from the separated signals of the same speaker, and this further improves the separation performance as compared with the case of the signal processing device 11.

< third embodiment >

< one-to-many talker detach mode >

Meanwhile, in the above, an example has been described in which sound source separation is performed by using m speaker separation models that are learned to separate an acoustic signal of a mixed voice including utterances of m (where m ≧ 2) speakers into m separation signals of each speaker.

In particular, when sound sources are separated, utterances of a predetermined speaker may appear in a dispersed manner in a plurality of separated signals. Therefore, in the second embodiment, an example of performing clustering and appropriately synthesizing separation signals has been described.

However, not only such speaker separation models but also other speaker separation models such as a speaker separation model obtained by performing learning on an indefinite number of speakers (hereinafter, also referred to as a one-to-many speaker separation model) can be used for sound source separation.

The one-to-many speaker separation model is a speaker separation model (such as a neural network) that is learned to separate an acoustic signal for learning a mixed speech of any unknown (uncertain) number of speakers into a separated signal including only an utterance (speech) of a predetermined one of the speakers and a separated signal including utterances of remaining speakers other than the predetermined one of the speakers included in the mixed speech.

Here, the separation result of sound source separation using the one-to-many speaker separation model (i.e., the output of the one-to-many speaker separation model) is also referred to as a header.

Specifically, here, the side that outputs the separated signal including the utterance of one speaker is also referred to as a head 1, and the side that outputs the separated signal including the utterances of the other remaining speakers is also referred to as a head 2. Further, in the case where it is not particularly necessary to distinguish the head 1 from the head 2, the head 1 and the head 2 are simply referred to as heads.

In learning the one-to-many speaker separation model, learning is performed such that the loss function L is minimized by using the acoustic signals for learning the number m of speakers while randomly changing the number m of speakers of the acoustic signals for learning.

At this time, the number M of speakers is set equal to or smaller than the maximum number M of speakers. Further, the one-to-many speaker separation model is learned such that the separated signal of the utterance of only the least lossy one of m speakers included in the mixed speech including the acoustic signal for learning is the output of the head 1, and the separated signal including the utterances of the remaining (m-1) speakers is always the output of the head 2.

The loss function L in learning the one-to-many speaker separation model is expressed by, for example, the following equation (1).

[ mathematics 1]

Note that in formula (1), j is an index indicating an acoustic signal used for learning (i.e., a mixed voice used for learning), and i is an index indicating a speaker of an utterance included in the j-th mixed voice.

Further, in the formula (1), L_i ^1jAcoustic signal x representing learning when for j-th mixed speech^jOutput s 'of head 1 when sound source separation is performed'¹(x^j) Acoustic signal s of an utterance of an ith speaker_i ^jLoss function when making a comparison. Loss function L_i ^1jCan be defined, for example, by the square error shown in the following equation (2).

[ mathematics 2]

Further, L in the formula (1)_i ^2jAcoustic signal x representing learning when for j-th mixed speech^jOutput s 'of head 2 at the time of sound source separation'²(x^j) With acoustic signals s of the remaining speakers k, except the ith speaker_k ^jIs compared to the loss function. Loss function L_i ^2jCan be defined, for example, by the square error shown in the following equation (3).

[ mathematics 3]

In the one-to-many speaker separation model obtained by learning as described above, it is desirable to always obtain the separated signal of the utterance of only one speaker as the output of the head 1 and obtain the separated signals of the utterances of the remaining speakers as the output of the head 2.

Thus, for example, in a manner similar to the example shown in fig. 1, it may be desirable to sequentially separate separated signals including only utterances of each speaker by only recursively performing sound source separation on input acoustic signals using a one-to-many speaker separation model.

In the case where the one-to-many talker separation model is used in this manner, for example, the sound source separation unit 21 of the signal processing apparatus 11 retains the one-to-many talker separation model obtained in advance by learning as a sound source separation model. Then, the signal processing device 11 performs the sound source separation process described with reference to fig. 3 to obtain a separated signal for each speaker.

In this case, however, in step S11 or step S13, the sound source separating unit 21 performs sound source separation based on the one-to-many talker separating model. At this time, since the output of the head 1 is a separated signal of the utterance of one speaker, sound source separation is recursively performed on the output of the head 2 (separated signal) using a one-to-many speaker separation model.

Further, in step S12, in a case where the average level of the output (separation signal) of the head 2 of the sound source separation performed most recently is equal to or less than the predetermined threshold, it is determined that the end condition is satisfied, and the process proceeds to step S14.

Note that an example of using a one-to-many speaker separation model in which two outputs of two heads, i.e., the head 1 and the head 2, are obtained by using one input acoustic signal as an input has been described here.

However, this is not restrictive. For example, sound source separation may be performed by using a one-to-many speaker separation model that can obtain outputs of three heads.

In this case, for example, learning is performed such that, among the heads 1 to 3, the outputs of the heads 1 and 2 are separate signals each including only an utterance of one speaker, and the output of the head 3 is a separate signal including utterances of the other remaining speakers.

< fourth embodiment >

< combination of one-to-many speaker separation model and clustering >

Further, even in the case where a one-to-many speaker separation model is used as the sound source separation model, it is not necessarily possible to completely separate sound sources, i.e., utterances of each speaker. That is, for example, an utterance of a speaker that should be output to the head 1 may slightly leak into the output of the head 2.

Therefore, in this case, as described with reference to fig. 4, utterances of the same speaker are dispersed in a plurality of separated signals obtained by recursive sound source separation. However, in this case, the utterance of the speaker included in one separated signal is a slightly leaked component, and has a much lower volume than the volume of the utterance of the speaker included in the other separated signal.

Therefore, even in the case where the one-to-many talker separation model is used as the sound source separation model, clustering can be performed in a manner similar to the second embodiment.

In this case, for example, the sound source separating unit 21 of the signal processing device 51 retains a one-to-many talker separating model obtained in advance by learning as a sound source separating model.

Then, the signal processing device 51 performs the sound source separation process described with reference to fig. 6 to obtain a separated signal for each speaker.

In this case, however, as in the case of the third embodiment, in step S41 and step S43, the sound source separating unit 21 performs sound source separation based on the one-to-many talker separation model.

Further, in step S44, the output of the above-described speaker recognition model or the like is calculated as a feature value representing the speaker, and if the distance between the feature values of the two separated signals is equal to or smaller than a threshold value, it is determined that the two separated signals belong to the same speaker.

In addition, for example, in a case where a temporal energy variation of the separated signals is obtained as a feature value representing a speaker, and a correlation between feature values of two separated signals (i.e., a correlation between energy variations of the separated signals) is equal to or larger than a threshold value, the two separated signals may be determined to belong to the same speaker.

< other modified example 1>

< use of Single speaker determination mode >

Meanwhile, in each of the embodiments described above, an example has been described in which if the average level (energy) of the separated signal obtained by sound source separation becomes sufficiently small, that is, if the average level becomes equal to or less than the threshold value, it is determined that the end condition of the recursive sound source separation is satisfied.

In this case, when sound source separation is performed on a separated signal including only utterances of a single speaker, a silence separated signal is obtained, and it is determined that an end condition is satisfied.

Therefore, although the separated signal of each speaker is obtained first when the separated signal including only the utterance of a single speaker is obtained, sound source separation needs to be performed again, and thus the number of times of sound source separation processing increases accordingly. This is not preferable for applications, for example, where processing time is limited.

Thus, the end determination may be performed by using a single speaker determination model that is an acoustic model that receives the separated signal as an input and determines whether the separated signal is an acoustic signal including only an utterance of a single speaker or an acoustic signal of a mixed voice including utterances of a plurality of speakers.

In other words, the single speaker determination model is an acoustic model for determining whether the number of speakers of an utterance included in the input separated signal is 1.

In such an example, for example, a single speaker determination model obtained in advance by learning remains in the signal processing apparatus 11 or the end determination unit 22 of the signal processing apparatus 51.

Then, for example, in step S12 of fig. 3 or step S42 of fig. 6, the end determination unit 22 performs calculation based on the retained single speaker determination model and the separated signal obtained by sound source separation, and determines whether the number of speakers of the utterance included in the separated signal is 1. In other words, it is determined whether the isolated signal includes only utterances of a single speaker.

Then, if the obtained determination result indicates that the number of speakers of the utterance included in all the separated signals is 1, that is, the separated signals include only the utterance of a single speaker, the end determination unit 22 determines that the end condition is satisfied.

In the determination using such a single speaker determination model, the task is simplified compared to using a speaker number estimation model for estimating the number of speakers of an utterance included in a separate signal. Therefore, there is an advantage that a higher performance acoustic model (single speaker determination model) can be obtained with a smaller model size. That is, sound source separation can be performed more easily than in the case of using the speaker number estimation model.

As described above, by determining whether the end condition is satisfied using the single speaker determination model, the entire processing amount (the number of times of processing) and the processing time of the sound source separation processing described with reference to fig. 3 and 6 can be reduced.

Further, for example, in the case where the end determination is performed using a single speaker determination model or the like, in the sound source separation process described with reference to fig. 3 and 6, it is also possible to perform the end determination first, that is, whether or not the end condition is satisfied, and then perform recursive sound source separation according to the result of the determination.

In this case, for example, when the single speaker determination model is used for the end determination, recursive sound source separation is performed on the separated signal determined to be not the separated signal including only the utterance of the single speaker by using the single speaker determination model.

In addition, the sound source separation unit 21 may select a sound source separation model for recursive sound source separation using a speaker number determination model for determining a rough number of speakers.

Specifically, for example, assume a case where the sound source separation unit 21 retains a speaker number determination model, two speaker separation models, and three speaker separation models for determining whether the input acoustic signal is a signal including utterances of two or fewer speakers or a signal including utterances of three or more speakers.

In this case, the sound source separation unit 21 determines the number of speakers by inputting an acoustic signal or a separation signal obtained by sound source separation using the speaker number determination model, and selects two speaker separation models or three speaker separation models as sound source separation models for sound source separation.

That is, for example, for an input acoustic signal or a separation signal determined as a signal including utterances of three or more speakers, the sound source separation unit 21 performs sound source separation using three speaker separation models.

On the other hand, for an input acoustic signal or a separation signal determined as a signal including utterances of two or less speakers, the sound source separation unit 21 performs sound source separation using two speaker separation models.

In this way, an appropriate sound source separation model can be selectively used for sound source separation.

< other modified example 2>

< use of language information >

Further, in the second embodiment or the fourth embodiment, the same speaker determination may be performed based on the language information of a plurality of separated signals. Specifically, here, text information indicating the content of speech (utterance) based on a separate signal will be described as an example of language information.

In this case, for example, the same speaker determination unit 61 of the signal processing apparatus 51 performs a speech recognition process on the separated signal of each speaker supplied from the end determination unit 22, and converts the speech of the separated signal of each speaker into text. That is, text information indicating the content of an utterance based on a separation signal is generated by a speech recognition process.

Then, in a case where the texts indicated by the text information of any two or more of the separated signals, that is, the contents of the utterance, are merged (integrated) and the merged texts form a sentence, the same speaker determining unit 61 determines that the separated signals belong to the same speaker.

Specifically, for example, in a case where the utterance of each of two separate signals indicated by text information is the same in time and content, the two separate signals are assumed to belong to the same speaker.

Further, for example, in a case where utterances indicated by text information of two separate signals are temporally different, but the utterances form meaningful sentences when integrated into one utterance, the two separate signals are assumed to belong to the same speaker.

In this way, using language information such as text information improves the accuracy of determining the same speaker, and thus separation performance can be improved.

< other modified example 3>

< use of same speaker determination model >

Further, in the second embodiment or the fourth embodiment, the same speaker determination may be performed based on the same speaker determination model for determining whether each of any two separated signals includes an utterance of the same speaker, that is, whether the two separated signals are signals of the same speaker.

Here, the same speaker determination model is an acoustic model that inputs two separate signals and outputs a determination result as to whether speakers of utterances included in each of the separate signals are the same or different.

In this case, for example, the same speaker determination unit 61 of the signal processing apparatus 51 retains the same speaker determination model obtained in advance by learning.

Based on the retained same speaker determination model and the separate signal for each speaker supplied from the end determination unit 22, the same speaker determination unit 61 determines whether the speakers of the utterances included in each of the two separate signals are the same for all possible combinations.

In the same speaker determination using such a same speaker determination model, the task is simplified as compared with the case of the above speaker recognition model. Therefore, there is an advantage that a higher performance acoustic model (same speaker determination model) can be obtained with a smaller model size.

Note that, in determining the same speaker, the separated signals of the same speaker may be specified by combining a plurality of optional methods such as the above-described method using the distance between feature values, the method using language information, and the method using the same speaker determination model.

< example of configuration of computer >

Meanwhile, the series of processes described above may be executed not only by hardware but also by software. In the case where a series of processes is executed by software, a program constituting the software is installed on a computer. Here, the computer includes, for example, a computer incorporated in dedicated hardware or a general-purpose personal computer capable of executing various functions with various programs installed.

Fig. 7 is a block diagram showing a configuration example of hardware of a computer that executes the above-described series of processing according to a program.

In the computer, a Central Processing Unit (CPU)501, a Read Only Memory (ROM)502, and a Random Access Memory (RAM)503 are connected to each other by a bus 504.

The bus 504 is further connected to an input/output interface 505. The input/output interface 505 is connected to the input unit 506, the output unit 507, the recording unit 508, the communication unit 509, and the drive 510.

The input unit 506 includes a keyboard, a mouse, a microphone, an imaging element, and the like. The output unit 507 includes a display, a speaker, and the like. The recording unit 508 includes a hard disk, a nonvolatile memory, and the like. The communication unit 509 includes a network interface and the like. The drive 510 drives a removable recording medium 511 such as a magnetic disk, an optical disk, a magneto-optical disk, or a semiconductor memory.

In order to execute the above-described series of processes, the computer having the above-described configuration causes the CPU 501 to load a program recorded in the recording unit 508 into the RAM 503 via the input/output interface 505 and the bus 504, for example, and then execute the program.

The program to be executed by the computer (CPU 501) can be provided by, for example, being recorded on a removable recording medium 511 as a package medium or the like. Further, the program may be provided via a wired or wireless transmission medium such as a local area network, the internet, or digital satellite broadcasting.

Inserting the removable recording medium 511 into the drive 510 allows the computer to install a program into the recording unit 508 via the input/output interface 505. Further, the program may be received by the communication unit 509 via a wired or wireless transmission medium and installed in the recording unit 508. In addition, the program may be installed in the ROM 502 or the recording unit 508 in advance.

Note that the program to be executed by the computer may be a program that executes processing in chronological order as described in this specification, or may be a program that executes processing in parallel or when necessary (for example, when processing is called).

Furthermore, the embodiments of the present technology are not limited to the above-described embodiments, but may be modified in various ways within the scope of the present technology.

For example, the present technology may have a cloud computing configuration in which a plurality of devices share one function and cooperate via a network.

Further, each step described in the above-described flowcharts may be executed by one apparatus or may be shared by a plurality of apparatuses.

Further, in the case where a plurality of processes are included in one step, the plurality of processes included in the step may be executed by one apparatus or may be shared by a plurality of apparatuses.

Further, the present technology may also have the following configuration.

(1) A signal processing apparatus comprising:

(2) The signal processing apparatus according to (1), wherein,

the sound source separation unit performs sound source separation to separate a separated signal of an utterance of a speaker from an acoustic signal.

(3) The signal processing apparatus according to (2), wherein,

the sound source separation unit performs sound source separation on an acoustic signal whose number of speakers is unknown.

(4) The signal processing apparatus according to (2) or (3), wherein,

the sound source separation model is a speaker separation model that is learned to separate an acoustic signal for learning including utterances of two speakers into a separation signal including an utterance of one speaker and a separation signal including an utterance of the other speaker.

(5) The signal processing apparatus according to (2) or (3), wherein,

the sound source separation model is a speaker separation model that is learned to separate an acoustic signal for learning including utterances of three speakers into three separated signals, each of which includes an utterance of a corresponding one of the utterances of the three speakers.

(6) The signal processing apparatus according to (2) or (3), wherein,

the sound source separation model is a speaker separation model that is learned to separate an acoustic signal for learning including utterances of arbitrary plural speakers into a separation signal including an utterance of one speaker and a separation signal including utterances of remaining speakers other than one speaker among the plural speakers.

(7) The signal processing apparatus according to any one of (2) to (6),

the sound source separation unit recursively performs sound source separation by using a plurality of sound source separation models different from each other as a predetermined sound source separation model.

(8) The signal processing apparatus according to any one of (2) to (7), further comprising:

an end determination unit that determines whether to end the recursive sound source separation based on the separation signal obtained by the sound source separation.

(9) The signal processing apparatus according to (8), wherein,

in the case where one of the separated signals obtained by sound source separation is an unvoiced signal, the end determining unit determines to end the recursive sound source separation.

(10) The signal processing apparatus according to (8), wherein,

in a case where it is determined that the number of speakers of an utterance included in a separation signal obtained by sound source separation is 1 based on a single speaker determination model for determining whether the number of speakers of an utterance included in the separation signal is 1 and the separation signal, the end determination unit determines that recursive sound source separation is to be ended.

(11) The signal processing apparatus according to any one of (2) to (10), further comprising:

an identical speaker determination unit that performs identical speaker determination as to whether or not a plurality of separated signals obtained by recursive sound source separation are signals of an identical speaker, and synthesizes separated signals from the plurality of separated signals of the identical speaker.

(12) The signal processing apparatus according to (11), wherein,

the same speaker determination unit performs the same speaker determination by clustering the separated signals.

(13) The signal processing device according to (12), wherein,

the same speaker determination unit calculates feature values of the separated signals, and determines that the two separated signals are signals of the same speaker in a case where a distance between the feature values of the two separated signals is equal to or smaller than a threshold value.

(14) The signal processing device according to (12), wherein,

the same speaker determination unit performs the same speaker determination based on a correlation between temporal energy variations of two separated signals.

(15) The signal processing apparatus according to (11), wherein,

the same speaker determination unit performs the same speaker determination based on the language information of the plurality of separated signals.

(16) The signal processing apparatus according to (11), wherein,

the same speaker determination unit performs the same speaker determination based on the same speaker determination model for determining whether the two separated signals are signals of the same speaker.

(17) A signal processing method, comprising:

(18) A program for causing a computer to execute a process comprising the steps of:

List of reference marks

11 Signal processing device

21 sound source separating unit

22 end determination unit

51 Signal processing device

61 the same speaker determination unit.

24页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：在视频聊天过程中提供情绪修改的方法和系统

Signal processing apparatus, method and program

相关技术

网友询问留言