Signal processing device and method, and program

文档序号：863756 发布日期：2021-03-16 浏览：25次中文

阅读说明：本技术 信号处理装置与方法以及程序 (Signal processing device and method, and program ) 是由大迫庆一光藤祐基高桥将文池宫由乐于 2019-07-30 设计创作，主要内容包括：本技术涉及一种易于使得难以听到泄露声音的信号处理装置与方法、以及程序。信号处理装置设置有掩蔽声音生成单元,在通过使用扬声器阵列的波场合成使第一内容在第一区域中再现并且使第二内容在第二区域中再现时,生成用于掩蔽在第一区域与第二区域之间的区域中听到的第一内容的声音和第二内容的声音的掩蔽声音。本技术能够应用于内容再现系统。(The present technology relates to a signal processing apparatus and method, and a program that easily make it difficult to hear a leakage sound. The signal processing apparatus is provided with a masking sound generating unit that generates a masking sound for masking a sound of the first content and a sound of the second content heard in a region between the first region and the second region when the first content is reproduced in the first region and the second content is reproduced in the second region by wave field synthesis using the speaker array. The present technology can be applied to a content reproduction system.)

1. A signal processing apparatus comprising:

a masking sound generating unit that generates a masking sound for masking a sound of the first content and a sound of the second content heard in a region between a first region and a second region in a case where the first content is reproduced in the first region and the second content is reproduced in the second region by wave field synthesis using a speaker array.

2. The signal processing apparatus of claim 1, further comprising:

an output unit that causes the speaker array to output the masking sound.

3. The signal processing apparatus of claim 1, further comprising:

an output unit that causes a speaker different from the speaker array to output the masking sound.

4. The signal processing apparatus according to claim 1,

wherein a reproduction level of the masking sound is greater than a background noise level.

5. The signal processing apparatus of claim 1, further comprising:

a wave field synthesis filter unit that performs a filtering process on the masking sound generated by the masking sound generation unit to generate a sound of each of a plurality of channels for reproducing the masking sound in a masking region between the first region and the second region by wave field synthesis.

6. The signal processing apparatus according to claim 1,

wherein the masking sound generating unit generates the masking sound based on external information.

7. The signal processing apparatus according to claim 6,

wherein the external information includes information indicating at least one of a time zone, a day of the week, a number of visitors, and a climate.

8. The signal processing apparatus of claim 1, further comprising:

a detection unit that detects a person as a subject from an image including at least regions around the first region and the second region;

wherein the masking sound generating unit generates the masking sound based on a result of detection of the person by the detecting unit.

9. The signal processing apparatus of claim 1, further comprising:

an analysis unit that analyzes a feature of background noise of a surrounding environment;

wherein the masking sound generating unit generates the masking sound based on an analysis result of the feature.

10. The signal processing apparatus according to claim 9,

wherein the masking sound generating unit generates the masking sound having a frequency characteristic from the analysis result of the characteristic.

11. The signal processing apparatus of claim 9, further comprising:

a reproduction level adjustment unit that adjusts a reproduction level of the masking sound based on the analysis result of the feature.

12. The signal processing apparatus of claim 9, further comprising:

an echo cancellation unit that extracts the background noise by performing echo cancellation based on the sound of the first content and the sound of the second content with respect to the collected ambient sound.

13. The signal processing apparatus according to claim 1,

wherein the masking sound generating unit changes the frequency characteristic of the masking sound according to the frequency characteristics of the first content and the second content.

14. The signal processing apparatus of claim 1, further comprising:

a reproduction level adjustment unit that changes a reproduction level of the masking sound in accordance with reproduction levels of the first content and the second content.

15. A signal processing method, wherein, a signal processing device,

in a case where first content is reproduced in a first area and second content is reproduced in a second area by wave field synthesis using a speaker array, a masking sound for masking a sound of the first content and a sound of the second content heard in an area between the first area and the second area is generated.

16. A program for causing a computer to execute a process, the process comprising the steps of:

Technical Field

The present technology relates to a signal processing apparatus and method, and a program, and more particularly, to a signal processing apparatus and method, and a program that can easily make it difficult to hear a leakage sound.

Background

In recent years, multi-region reproduction using a wave field synthesis technique is known. In multi-region reproduction, spatial division of auditory sound is achieved by arbitrarily changing the reproduction sound pressure of each of a plurality of divided regions (zones).

By using this multi-region reproduction, for example, it is possible to make voice guidance about the drawing heard only in the area in front of the drawing of the museum and make no sound heard in the other areas.

Further, for example, in a public facility such as an airport or a station, voice information may also be presented to facility users of each of a plurality of areas through different languages.

Incidentally, in multi-region reproduction, when a listener hears a leaked sound of another region while hearing a sound presented in a predetermined region, it becomes extremely difficult for the listener to acquire sound information. Therefore, it is important to be able to hear the sound only in the target area. In other words, it is required to prevent sound from leaking around the boundary of the target region.

Therefore, a technique is proposed in which, for example, a pair of speakers are arranged such that the distance between the speakers is one eighth wavelength to one wavelength of a radiated sound wave, and the sound waves from these speakers are made to interfere with each other to cancel sound (for example, see patent document 1).

In patent document 1, a filter for canceling sound is adjusted based on detection output by a microphone provided in front of a speaker pair. Then, by outputting sound waves from the pair of speakers based on the acoustic signals subjected to the filtering processing by the obtained filters, cancellation due to interference of the sound waves is achieved at the control points where the microphones are arranged.

Further, there is also proposed a technique in which a movement of a user into a predetermined guide area is recognized using a sensor, and when the user enters the guide area, a voice corresponding to the guide area is reproduced based on an acoustic signal subjected to a filtering process by a predetermined filter (for example, see patent document 2).

In patent document 2, by generating filters that make observation signals at a plurality of control points become desired plane waves, it is possible to generate plane waves with suppressed expansion and realize voice reproduction in a guide area.

Reference list

Patent document

Patent document 1: japanese patent application laid-open publication No.: 2000-295697

Patent document 2: japanese patent application laid-open publication No.: 2017-161448

Disclosure of Invention

Problems to be solved by the invention

However, with the above-described technology, it is difficult to easily make the leakage sound leaking out of the target area not heard.

For example, in the technique described in patent document 1, sound is canceled at a control point where a microphone is installed, but sound leaks at a position other than the control point. Specifically, the farther from the control point, the greater the sound leakage.

Further, for example, in the technique described in patent document 2, a large number of control points are required to form one guidance area, but the number of control points that can be created is, in principle, a number that is one or more smaller than the number of speakers constituting a speaker array used for voice reproduction. Therefore, a large number of speakers are required to form one guide area.

The present technology is proposed in view of this situation and easily makes it difficult for the leakage sound to be heard.

Solution to the problem

A signal processing apparatus according to an aspect of the present technology includes: a masking sound generating unit that generates a masking sound for masking a sound of the first content and a sound of the second content heard in a region between the first region and the second region in a case where the first content is reproduced in the first region and the second content is reproduced in the second region by wave field synthesis using the speaker array.

A signal processing method or program according to an aspect of the present technology includes the steps of: in a case where the first content is reproduced in the first area and the second content is reproduced in the second area by wave field synthesis using the speaker array, a masking sound for masking a sound of the first content and a sound of the second content heard in an area between the first area and the second area is generated.

According to an aspect of the present technology, in a case where first content is reproduced in a first area and second content is reproduced in a second area by wave field synthesis using a speaker array, a masking sound for masking a sound of the first content and a sound of the second content heard in an area between the first area and the second area is generated.

Effects of the invention

According to an aspect of the present technology, it is possible to easily make it difficult to hear the leakage sound.

It should be noted that the effects described herein are not necessarily limited, but may also be any of the effects described in the present disclosure.

Drawings

Fig. 1 is a diagram illustrating multi-region reproduction.

Fig. 2 is a diagram illustrating sound pressure distributions of a content sound and a background noise.

Fig. 3 is a diagram showing a configuration example of the content reproduction system.

FIG. 4 is a diagram illustrating an embodiment of a parameter table.

Fig. 5 is a diagram illustrating a division area and a reproduction range of a masking sound.

Fig. 6 is a diagram illustrating sound pressure distributions of a content sound and a masking sound.

Fig. 7 is a flowchart illustrating the content reproduction process.

Fig. 8 is a diagram showing a configuration example of the content reproduction system.

Fig. 9 is a diagram illustrating a division area and a mask area.

Fig. 10 is a diagram illustrating sound pressure distributions of a content sound and a masking sound.

Fig. 11 is a flowchart illustrating the content reproduction process.

Fig. 12 is a diagram showing a configuration example of the content reproduction system.

Fig. 13 is a flowchart illustrating the content reproduction process.

Fig. 14 is a diagram showing a configuration example of the content reproduction system.

FIG. 15 is a diagram illustrating an embodiment of a parameter table.

Fig. 16 is a flowchart illustrating the content reproduction process.

FIG. 17 is a diagram illustrating an embodiment of a parameter table.

Fig. 18 is a diagram showing a configuration example of the content reproduction system.

Fig. 19 is a flowchart illustrating the content reproduction process.

Fig. 20 is a diagram showing a configuration example of the content reproduction system.

Fig. 21 is a diagram showing a configuration example of a computer.

Detailed Description

Embodiments to which the present technology is applied are described below with reference to the drawings.

< first embodiment >

< present technology >

In the case where sounds of different pieces of content are reproduced in different regions by masking the sound of each item of content existing in the region between the regions with a masking sound of a predetermined level, the present technology can easily make it difficult to hear a leakage sound.

First, a leakage sound at the time of multi-region reproduction will be described.

For example, as shown in fig. 1, it is considered to perform multi-zone reproduction using the speaker array SP 11. It should be noted that, for the sake of simplifying the description, a control point as a position where a sound is eliminated (i.e., a position where a sound is muted) is not drawn here.

Now, it is assumed that the content a (i.e., the sound of the content a) is reproduced in the area a and the content B is reproduced in the area B by wave field synthesis.

Here, the divided region R11 is a region a, i.e., a listening region in which the content a is heard, and the divided region R12 is a region B, i.e., a listening region in which the content B is heard. Hereinafter, the region where the content is heard, that is, the listening region of the content is also particularly referred to as a divided region.

Further, the content a and the content B are considered as different pieces of music and voice. Note that, hereinafter, the sound of the predetermined content is also referred to as a content sound.

For example, in the case where the content a and the content B are reproduced simultaneously, the sound pressure distribution on the x-axis is represented by a straight line L11 in the drawing as shown in fig. 2. Note that, in fig. 2, the horizontal axis represents a position in the x-axis direction, and the vertical axis represents a sound pressure at each position.

In the embodiment shown in fig. 2, a curve L21 represents the sound pressure distribution of the sound of the content a, and a curve L22 represents the sound pressure distribution of the sound of the content B.

Further, a straight line L23 represents the sound pressure level of background noise around the speaker array SP 11. The background noise should include sound from any ambient sound source in the vicinity of the speaker array SP11, i.e., voice and footstep sound of a person passing through the ambient environment, air conditioning sound, and the like.

Generally, in the area a and the area B, the sounds of the content a and the content B are reproduced at a higher sound pressure than the background noise level so that the sounds are easily heard by the listener.

Specifically, for example, in the case where the background noise is about 60dB, the sounds of the content a and the content B are reproduced at 80dB or the like.

At this time, leakage sound from the divided regions is generated in the vicinity of the boundary of the divided regions.

Note that the leakage sound referred to herein refers to a content sound that leaks outside the divided area and is heard. That is, the content sound heard outside the divided area is a leakage sound.

For example, in the embodiment of fig. 2, the sounds of the content a and the content B in the portion surrounded by the broken line CR11 are leaking sounds heard in the region between the region a and the region B (i.e., the divided region) (i.e., the region outside the divided region).

Particularly in the portion surrounded by the broken line CR11, the sound pressure of the content a and the content B is higher than the background noise level indicated by the straight line L23, and the content sounds can be heard by a person outside the divided area.

Further, in the embodiment of fig. 2, the portion enclosed by the broken line CR12 is a region within the region a, and the sound pressure of the content B in the region is higher than the background noise level. Therefore, not only the sound of the content a but also the sound of the content B is leaked to and heard by the listener in the vicinity of the boundary on the side of the area B in the area a. That is, the sound of the content B leaks into the area a.

Also, the portion enclosed by the broken line CR13 is an area within the area B, but in this area, the sound pressure of the content a is higher than the background noise level, and the sound of the content a leaks into the area B and is heard in the area B.

In the case where the content sound in the divided region is small or in the case of a mute portion, when the content sound in the other divided region is large, the listener hears the leakage sound.

Generally, in the case where the leakage sound is voice or music, human hearing is attracted to these sounds, and it is unpleasant to hear the leakage sound semi-intensely.

In this way, at the time of multi-region reproduction, it is necessary to prevent human perception of leaked sound by reducing leakage of content sound between divided regions and reducing leakage of content sound in regions other than the divided regions. For this purpose, the above-mentioned patent document 1 and patent document 2 are proposed.

However, with the prior art, it is difficult to easily make it difficult to hear the leaking sound in multi-zone reproduction using a plurality of speakers.

For example, in patent document 1, a content sound (i.e., a leakage sound) in an area other than a control point (i.e., a position where a sound is eliminated) cannot be eliminated, and an undesired content sound is leaked and heard around the control point.

It should be noted that although the number of control points in patent document 1 can be increased, a huge number of speakers and microphones are required to make it difficult to hear the leaking sound outside the divided area. Also, since the microphones must be installed at the control points, an increase in the number of control points becomes an obstacle to layout of the microphones and the like in the course of operating in real space.

Further, also in patent document 2, an enormous number of speakers are required to reduce leakage of content sounds.

Therefore, in the present technology, in the case of performing multi-region reproduction using the wave field synthesis technology, that is, in the case of dividing the reproduction space into a plurality of divided regions and causing the contents of different segments to be reproduced in these divided regions, not only the content sound but also the masking sound is output at the same time. Therefore, without increasing the number of speakers or using an enormous number of microphones, it is possible to easily make it difficult to hear the leakage sound.

It should be noted that, for example, in determining the reproduction level of the masking sound, external information or an external sensor can be used.

For example, as the external information, visitor number information indicating the number of visitors (the number of attendees) at a facility or a site where the content is reproduced, time zone information indicating a time zone when the content is reproduced, climate information indicating weather (climate) when the content is reproduced, and the like can be used.

By using the external information, it is possible to output a masking sound of an appropriate level according to the number of persons at the time of content reproduction, time zone, climate, and the like and mask a content sound that has become a leakage sound. That is, the masking sound makes it difficult to hear the leakage sound.

Further, as an external sensor used to determine the reproduction level of the masking sound, for example, a microphone or a video camera can be employed.

For example, since the level of background noise of the surrounding environment can be evaluated by using a microphone, the reproduction level of a masking sound can be appropriately determined according to the background noise level.

Similarly, since the number of people in the vicinity can be evaluated by using the camera, the reproduction level of the masking sound can be appropriately determined according to the evaluation result.

It should be noted that the external information and the external sensor can be used not only to determine the reproduction level of the masking sound but also to determine characteristics such as the frequency characteristics of the masking sound.

Further, the reproduction level of the masking sound may be changed in accordance with a change in the reproduction level of the content sound in the divided regions.

For example, in the case of using a microphone as the external sensor, the level of background noise of the surrounding environment can be detected by using the microphone. Therefore, it is sufficient if the reproduction level of the content sound is changed according to the detection result of the background noise level and the reproduction level of the masking sound is determined according to the change of the reproduction level of the content sound.

Specifically, for example, it is assumed that when the reproduction level of the content sound is higher, the reproduction level of the masking sound is increased, and conversely, when the reproduction level of the content sound is lower, the reproduction level of the masking sound is decreased.

Further, it is also conceivable to increase the reproduction level of the masking sound when the difference between the reproduction level of the content sound and the background noise level is large, and conversely, decrease the reproduction level of the masking sound when the difference between the reproduction level of the content sound and the background noise level is small.

This is because the content sound is heard more largely when the difference between the reproduction level of the content sound and the background noise level is large, so that the reproduction level of the masking sound increases by such an amount that the leakage sound becomes harder to be heard.

Also, the levels of the content sound and the background noise may be compared for each frequency band to evaluate at which level the content sound in each frequency band leaks out of the background noise, and according to the evaluation result, the reproduction level of the masking sound may be determined for each frequency band to enable masking of the leaking sound in terms of auditory characteristics.

Further, the level of background noise of the surrounding environment may be detected using a microphone as an external sensor, and the reproduction level of the masking sound may be determined based on the detection result. In this case, the reproduction level of the content sound can be kept unchanged specifically.

Specifically, for example, it is conceived to decrease the reproduction level of the masking sound when the background noise level is high and it is difficult to hear the leakage sound, and conversely, increase the reproduction level of the masking sound when the background noise level is low.

It should be noted that the reproduction level or characteristic of the masking sound may be determined by any combination of masking sound reproduction level control using the above-described external information, masking sound reproduction level control using an external sensor, masking sound reproduction level control according to the reproduction level of the content sound, and the like.

In this case, for example, a parameter table may be prepared in advance in which combinations of external information or information obtained by using an external sensor are associated with the content sound reproduction level and the masking sound reproduction level of the respective combinations. Then, for example, the reproduction level of the content sound and the reproduction level of the masking sound can be easily and appropriately determined by using the parameter table.

Further, in order to determine the reproduction levels of the content sound and the masking sound, for example, external information or information obtained by using an external sensor may be input, and a predictor that outputs a reproduction level suitable for the input content sound and a reproduction level of the masking sound may be used. For example, it is sufficient if the predictor is generated in advance by machine learning represented by a neural network.

Also, in a case where the leakage sound is small and a person cannot detect the leakage sound in auditory sense, it is not possible to make the masking sound reproduced (output).

Further, by taking the background noise level into consideration, the reproduction level of the masking sound can be determined more appropriately. This is because, in order to mask the leakage sound, it is preferable to set the reproduction level of the masking sound higher than the background noise level.

For example, in the case where the reproduction level of the masking sound is determined using the external information, the background noise level can be evaluated from the external information. Therefore, for example, when a reproduction level (i.e., a preliminary determination made for given external information) is used as a reproduction level of a masking sound, the masking sound can be reproduced at a higher level than a background noise level evaluated for the external information.

Further, in the case of using a camera as an external sensor, for example, face recognition or individual detection can be performed on an image captured by the camera to evaluate the number of persons in the vicinity of a speaker that causes reproduction of content, and a level preliminarily determined for the evaluation result can be set as a background noise level. In this case, the reproduction level of the masking sound with respect to the background noise level can be appropriately determined.

Also, in the case of using a microphone as the external sensor, the sound of the surrounding environment can be collected by the microphone. The sound obtained by this sound collection includes at least background noise, and also includes content sound depending on the timing of sound collection.

Therefore, sound collection using a microphone is performed only in a silent part of the content sound, and the sound obtained by the sound collection is used as background noise, so that the background noise level can be evaluated more accurately.

Further, in the case of using a microphone as an external sensor, it is possible to cancel content sound reproduced from a collected sound and extract only background noise using echo cancellation. Even in this case, the background noise level can be accurately estimated.

Also, characteristics such as frequency characteristics of a masking sound for masking a leakage sound can be the same as characteristics such as crowd noise, bubble noise, pink noise, and the like. In this way, the leakage sound can be masked without causing discomfort.

Further, the frequency characteristic of the masking sound may be the same as that of the sound collected by the microphone as the external sensor. Further, the feature of the masking sound may be a feature that removes a feature of the content sound, that is, a feature that makes it difficult to hear the content sound.

Also, the region for reproducing the masking sound may be the entire region for enabling the system for reproducing the content to reproduce the sound.

Further, by generating a sound beam having directivity as a sound beam (wave surface) of a masking sound using wave field synthesis, the masking sound may be made to be reproduced only in a region between the plurality of divided regions, or the masking sound may be made to be reproduced in a region between the plurality of divided regions and a region near the boundary of each divided region.

Also, in the case of using a microphone as the external sensor, the microphone can be basically installed anywhere, but, for example, when the microphone is installed outside the divided region formed by the wave field synthesis, the levels of the background noise and the leakage sound can be directly evaluated.

< example of configuration of content reproduction System >

Next, more specific embodiments of the present technology described above will be described.

Fig. 3 is a diagram showing a configuration example of an embodiment of a content reproduction system to which the present technology is applied.

The content reproduction system shown in fig. 3 includes a signal processing apparatus 11 and a speaker array 12.

The signal processing device 11 generates an output sound for reproducing the content sound and the masking sound, more specifically, an acoustic signal of the output sound, and supplies the acoustic signal to the speaker array 12.

For example, the speaker array 12 includes a line speaker array including a plurality of speakers, a ring speaker array, a ball speaker array, and the like. It should be noted that the speaker array 12 may be a speaker array having an arbitrary shape.

The speaker array 12 reproduces a plurality of content sounds and masking sounds, masking unwanted content sounds, i.e., leakage sounds at respective positions by outputting the output sound supplied from the signal processing device 11.

Accordingly, the content sound corresponding to the respective divided regions is reproduced in the multi-regions of the respective divided regions within the reproduction space by the wave field synthesis, and the masking sound is reproduced across the entire reproduction space. In the multi-region reproduction, the speaker array 12 outputs the output sound so that the wave surface of the content sound is formed in the divided regions of the reproduction space.

Note that, hereinafter, the entire space where sound is reproduced by the speaker array 12 is referred to as a reproduction space. Further, here, it is assumed that the positions and sizes of the contents reproduced in the reproduction space and the divided regions where the respective contents are sound-reproduced are preliminarily determined. However, the content to be reproduced and the position and size of the divided area may not be preliminarily determined but may be dynamically changed.

The signal processing apparatus 11 includes a masking sound generating unit 21, a wave field synthesis filter unit 22, a reproduction level adjusting unit 23, an amplifying unit 24, an amplifying unit 25, an adding unit 26, and a digital-to-analog (DA) converting unit 27.

The masking sound generating unit 21 generates a masking sound based on external information supplied from the outside and supplies it to the amplifying unit 24.

The wave field synthesis filter unit 22 has in advance a wave field synthesis filter that causes the content to be reproduced only in the divided regions by, for example, wave field synthesis.

It should be noted that the wave field synthesis filter unit 22 may generate a wave field synthesis filter.

To generate the wave field synthesis filter, for example, the wave field synthesis filter is generated by calculation based on coordinate information indicating the positions and sizes of the divided regions for causing the content sounds to be reproduced, the arrangement positions of the respective speakers constituting the speaker array 12, and the like.

In the wave field synthesis using this wave field synthesis filter, the amplitude and phase of the output sound output from the respective speakers constituting the speaker array 12 are controlled by the filtering process of the wave field synthesis filter so that the wave surface of the sound in the reproduction space is physically reproduced. That is, the wave surface of the content sound is formed only in the divided region.

Wave field synthesis is described in detail, for example, in Japanese patent application laid-open Nos. 2013-102389 and "Berkhout, Austenitis J., Diemer de Vries, and Peter Vogel. ' Acoustic control by way of wave field synthesis ' The Journal of The Acoustic Society of America 93.5(1993):2764-2778 '. For example, in the wave field synthesis filter unit 22, it is sufficient if the techniques described in these documents and the like are used.

The wave field synthesis filter unit 22 performs a filtering process on the supplied content sound data (i.e., acoustic signals for causing the content sound to be reproduced) using a wave field synthesis filter, and supplies synthesized output sounds (synthesized outputs) of the respective channels corresponding to the respective speakers constituting the speaker array 12 to the amplification unit 25.

More specifically, the wave field synthesis filter unit 22 has wave field synthesis filters for the respective contents and performs filter processing on the content sound data using the wave field synthesis filters for the respective contents. Then, the wave field synthesis filter unit 22 adds the content sounds of the respective items of content obtained by performing filter processing on the respective channels to obtain an output sound including the respective items of content sounds. That is, the acoustic signals of the content sounds obtained for the same channel are added to obtain the acoustic signal of the output sound of the channel.

The reproduction level adjustment unit 23 controls adjustment of the reproduction levels of the masking sound and the content sound based on at least one of external information supplied from the outside and supplied content sound data.

That is, the reproduction level adjustment unit 23 determines the reproduction level of the masking sound based on at least one of the external information and the content sound data and supplies a reproduction sound gain coefficient for reproducing the masking sound at the determined reproduction level to the amplification unit 24.

Similarly, the reproduction level adjustment unit 23 determines the reproduction level of the content sound based on at least one of the external information and the content sound data and supplies a content sound gain coefficient for reproducing the content sound at the determined reproduction level to the amplification unit 25.

The amplifying unit 24 includes an amplifier and performs adjustment of the masking sound level by multiplying the masking sound supplied from the masking sound generating unit 21 by the masking sound gain coefficient (gain adjustment) supplied from the reproduction level adjusting unit 23. The amplifying unit 24 supplies the masking sound after the level adjustment to the adding unit 26.

For example, the amplification unit 25 includes amplifiers provided for respective channels corresponding to respective speakers constituting the speaker array 12.

The amplification unit 25 performs adjustment of the level of the output sound (i.e., content sound) by multiplying the output sound of each channel supplied from the wave field synthesis filter unit 22 by the content sound gain coefficient (gain adjustment) supplied from the reproduction level adjustment unit 23. The amplifying unit 25 supplies the output sound after the level adjustment to the adding unit 26.

For example, the addition unit 26 includes adders provided for respective channels corresponding to respective speakers constituting the speaker array 12.

The addition unit 26 adds the masking sound supplied from the amplification unit 24 and the output sound of each channel supplied from the amplification unit 25 to generate a final output sound of each channel and supplies it to the DA conversion unit 27.

The DA conversion unit 27 performs DA conversion on the output sounds of the respective channels supplied from the addition unit 26 and supplies acoustic signals (i.e., analog signals) of the synthesized output sounds of the respective channels to speakers corresponding to the respective channels constituting the speaker array 12, and causes the speakers to output (reproduce) the output sounds. Thus, it can be said that the DA conversion unit 27 functions as an output unit that causes the speaker array 12 to output the masking sound together with the content sound.

< adjustment of the level of reproduction and Generation of masking Sound >

Here, generation of a masking sound and adjustment of a reproduction level in the signal processing apparatus 11 will be described.

For example, as described above, the external information supplied to the masking sound generating unit 21 and the reproduction level adjusting unit 23 may be at least one of visitor number information, time zone information, day of the week information indicating a day of the week when the content is reproduced, and climate information.

The masking sound generating unit 21 generates a masking sound according to the supplied external information, and the reproduction level adjusting unit 23 adjusts the reproduction level of the masking sound and the content sound according to the external information.

Specifically, for example, in the case of using information of a certain day of the week and time zone information as external information, the masking sound generating unit 21 and the reproduction level adjusting unit 23 hold the parameter table shown in fig. 4 in advance.

In fig. 4, features "day of the week" and "time zone" represent day of the week information and time zone information, respectively.

Further, the features "content sound reproduction level" and "masking sound reproduction level" respectively represent the reproduction level of a content sound and the reproduction level of a masking sound, that is, a content sound gain coefficient and a masking sound gain coefficient. Further, the feature "masking sound parameter" represents a masking sound parameter as information indicating a frequency feature of a masking sound.

The masking sound generating unit 21 refers to the parameter table and generates a masking sound indicated by a masking sound parameter preliminarily determined for a combination of day of the week information and time zone information as supplied external information.

For example, the masking sound parameter "air conditioning" indicates the frequency characteristics of air conditioning sounds, and masking sounds generated based on the masking sound parameter "air conditioning" are regarded as sounds having similar frequency characteristics to these air conditioning sounds. Therefore, in the case of reproducing the masking sound, the person listening to the masking sound will hear the air conditioning sound.

Further, the masking sound parameter "crowd + air conditioner" indicates the frequency characteristic of the mixed sound of the crowd sound and the air conditioner sound. Therefore, when the masking sound generated based on the masking sound parameter "crowd + air conditioner" is reproduced, the person who listens to the masking sound will hear the crowd sound and the air conditioner sound.

The masking sound generating unit 21 holds the respective masking sound parameters in advance together with the parameter table.

The reproduction level adjustment unit 23 refers to the parameter table and supplies the content sound gain coefficient and the masking sound gain coefficient, which are preliminarily determined for the combination of the information of the day of the week and the time zone information as the supplied external information, to the amplification unit 25 and the amplification unit 24.

The reproduction level adjustment unit 23 holds the content sound gain coefficient and the masking sound gain coefficient in advance together with the parameter table.

For example, in the case where information "sunday" and time zone information "8: 00-12: 00" of a certain day of the week are supplied as external information, the content sound is reproduced at 25dB, and the masking sound similar to the air conditioning sound is reproduced at 3 dB.

In the embodiment shown in fig. 4, since the reproduction space is expected to be relatively quiet at certain days and times when there are a small number of visitors, such as sunday morning, the reproduction level of the content sound is relatively low. Therefore, the reproduction level of the masking sound is also low. Further, in this case, since the reproduction space is expected to be relatively quiet, the masking sound is only the air conditioning sound.

On the other hand, for example, in the afternoon of sundays, since many visitors are expected to be present, the reproduction level of the content sound increases, and the reproduction level of the masking sound also increases accordingly, and the masking sound is also a crowd sound and an air-conditioning sound.

Further, in the case of using the visitor number information as the external information, for example, when the visitor number is large, the reproduction levels of the content sound and the masking sound may increase, and the masking sound may be a crowd sound and an air-conditioning sound.

Further, for example, in the case where the content reproduction system is used outdoors having a roof and the weather information is used as the external information, when the weather indicated by the weather information is rainy, the leakage sound is masked to some extent by the rain sound included in the background noise.

Therefore, in this case, the reproduction level of the content sound increases and the reproduction level of the masking sound can be lowered. Note that, in this case, the masking sound may be made non-reproducible. Further, the masking sound may be a rain sound.

As described above, the signal processing device 11 generates the control adjustment of the reproduction level of the masking sound or the content sound and the masking sound using the parameter table in which the external information, the content sound gain coefficient, the masking sound gain coefficient, and the masking sound parameter are associated with each other.

It can be said that the control is a control in which the reproduction level of the content sound is changed in accordance with the external information, and further, the reproduction level of the masking sound is changed in accordance with the change in the reproduction level of the content sound.

Specifically, in the embodiment shown in fig. 4, when the reproduction level of the content sound is higher, the reproduction level of the masking sound is increased, and conversely, when the reproduction level of the content sound is lower, the reproduction level of the masking sound is decreased.

Further, the masking sound parameter held in advance in the masking sound generating unit 21 is generated by measuring background noises such as air conditioning of an operation place of the content reproduction system and a crowd at the time of entrance and exit of a person in advance. For example, in the masking sound generating unit 21, a gaussian noise having the frequency characteristic of the background noise indicated by the masking sound parameter is generated as the masking sound.

It should be noted that the masking sound is not limited to gaussian noise, and may be any other noise such as pink noise, white noise, crowd noise, bubble noise, or another common noise.

Further, the masking sound generating unit 21 may generate a masking sound having a characteristic of removing a characteristic of the content sound using the content sound. That is, the frequency characteristic of the masking sound may be changed according to the frequency characteristic of the content sound. In this case, the content sound data is supplied to the masking sound generating unit 21.

Specifically, for example, in the case where the content sound is a voice, the masking sound generating unit 21 analyzes formants of the voice as the content sound and generates a masking sound having a frequency characteristic that fills valleys of the frequency of the content sound. That is, a masking sound having a frequency characteristic of increasing the level of the masking sound is generated at a frequency at which the level of the content sound is smaller among the frequencies.

By reproducing this masking sound together with the content sound, a feature unique to a voice as the content sound leaked from the divided area can be removed, and the masking effect can be improved. That is, it may be made difficult to perceive that the leakage sound is human voice.

Further, the reproduction level adjustment unit 23 may perform frequency analysis on the supplied content sound and determine the reproduction level of the masking sound for each frequency based on the analysis result. In this case, the level of the leaked content sound exceeding the background noise is evaluated for each frequency band, and the reproduction level of the masking sound is determined for each frequency band so that the leaked sound is masked in terms of auditory characteristics.

In addition to the parameter table, a predictor such as a neural network generated in advance by machine learning may be used.

In this case, for example, the masking sound generating unit 21 obtains a masking sound parameter as an output by inputting external information into a predictor held in advance and performing calculation, and generates a masking sound based on the obtained masking sound parameter.

Further, in this case, the reproduction level adjustment unit 23 inputs the external information and the content sound into a predictor held in advance and performs calculation to obtain a content sound gain coefficient and a masking sound gain coefficient as outputs.

< reproduction range with respect to masking sound >

Also, in the signal processing apparatus 11 having the configuration shown in fig. 3, for example, the entire reproduction space can be set as the reproduction range of the masking sound.

Specifically, for example, as shown in fig. 5, it is assumed that a region R21 in front of the speaker array 12 is a reproduction space.

In the present embodiment, two divided regions R22 and R23 that allow the content sound to be reproduced are formed in the region R21 (the entire region of the reproduction space). That is, the content a is reproduced in the divided area R22 corresponding to the above-described area a, and the content B is reproduced in the divided area R23 corresponding to the area B.

It should be noted that although the case where there are two divided regions is described here for the sake of simplifying the description, of course, the number of divided regions may be three or more.

In the embodiment of fig. 5, the masking sound is caused to be reproduced in the entire region R21 including the divided region R22 and the divided region R23.

In this case, the sound pressure distribution on the x-axis indicated by the straight line L31 in fig. 5 is as shown in fig. 6. Note that, in fig. 6, the horizontal axis represents a position in the x-axis direction, and the vertical axis represents a sound pressure at each position.

In the embodiment shown in fig. 6, a curve L41 represents the sound pressure distribution of the sound of the content a, and a curve L42 represents the sound pressure distribution of the sound of the content B.

Further, a straight line L43 represents the sound pressure distribution of the masking sound, and a straight line L44 represents the sound pressure distribution of the background noise around the speaker array 12.

In the present embodiment, the masking sound has the same sound pressure (reproduction level) at each position in the reproduction space, and the reproduction level of the masking sound is set higher than the background noise level.

Therefore, it is possible to make it difficult to hear the leakage sound larger than the background noise at various positions in the reproduction space. Specifically, here, not only in the region between the divided region R22 and the divided region R23 but also in the boundary portion within the divided region R22 and the divided region R23, the level of the masking sound is higher than the content sound, that is, the leakage sound, and it can be seen that the leakage sound is masked.

For example, when external information is used, the background noise level can be roughly estimated from the external information without actually measuring the background noise level. Therefore, in the method of determining the reproduction level of the masking sound based on the external information and the parameter table, the masking sound can be reproduced at a reproduction level higher than the background noise level by setting the reproduction level of the masking sound to the reproduction level set with respect to the external information. The background noise also masks a small leakage sound, but by reproducing the masking sound at a higher reproduction level than the background noise level, the masking sound also masks a larger leakage sound, and can make it difficult to hear the leakage sound.

< description of content reproduction processing >

Next, a content reproduction process performed by the content reproduction system will be described. That is, the content reproduction processing of the content reproduction system will be described below with reference to the flowchart of fig. 7. When a plurality of pieces of content are specified and reproduction of the pieces of content is instructed, the content reproduction processing is started.

In step S11, the masking sound generating unit 21 generates a masking sound based on the external information supplied from the outside and the parameter table saved in advance and supplies the masking sound to the amplifying unit 24.

For example, in step S11, the parameter table is referred to, and the masking sound is generated based on the masking sound parameter associated with the external information.

In step S12, the wave field synthesis filter unit 22 performs filter processing on the supplied content sound data using a wave field synthesis filter, and supplies synthesized output sounds of the respective channels to the amplification unit 25.

For example, as shown in the embodiment shown in fig. 5, when the content a and the content B are specified as the reproduced content, the output sound is generated by wave field synthesis in which the content a is caused to be reproduced only in the divided region R22 and the content B is caused to be reproduced only in the divided region R23.

In step S13, the reproduction level adjustment unit 23 determines the reproduction levels of the masking sound and the content sound based on at least one of the supplied external information and content sound data and the saved parameter table.

For example, in step S13, the parameter table is referred to, and the reproduction levels of the content sound and the masking sound are determined by specifying the gain coefficient associated with the external information. The reproduction level adjustment unit 23 supplies the determined masking sound gain coefficient to the amplification unit 24 and supplies the content sound gain coefficient to the amplification unit 25.

In step S14, the amplification unit 24 and the amplification unit 25 perform level adjustment.

That is, the amplifying unit 24 performs level adjustment by multiplying the masking sound supplied from the masking sound generating unit 21 by the masking sound gain coefficient supplied from the reproduction level adjusting unit 23, and supplies the masking sound after the level adjustment to the adding unit 26.

Further, the amplification unit 25 performs level adjustment by multiplying the output sound of each channel supplied from the wave field synthesis filter unit 22 by the content sound gain coefficient supplied from the reproduction level adjustment unit 23, and supplies the output sound of each channel after the level adjustment to the addition unit 26.

In step S15, the addition unit 26 performs an addition process of adding the masking sound supplied from the amplification unit 24 and the output sound of each channel supplied from the amplification unit 25, and supplies the final synthesized output sound of each channel to the DA conversion unit 27.

In step S16, the DA conversion unit 27 performs DA conversion on the output sounds of the respective channels supplied from the addition unit 26, and supplies the synthesized output sounds of the respective channels to speakers corresponding to the respective channels of the speaker array 12, and causes the speakers to reproduce the content sound.

Each speaker in the speaker array 12 outputs the output sound supplied from the DA conversion unit 27 to simultaneously cause the content sound reproduction and the masking sound reproduction.

Thus, for example, multi-region reproduction that causes the content a to be reproduced in the divided region R22 of fig. 5 and causes the content B to be reproduced in the divided region R23 is realized by wave field synthesis. Meanwhile, in the region R21 of the entire reproduction space, the masking sound is reproduced at a uniform sound pressure (reproduction level) at each position.

When the content sound is reproduced in this way, the content reproduction processing ends.

As described above, the content reproduction system generates a masking sound based on external information and causes the masking sound to be reproduced together with the content sound. As such, without increasing the number of speakers or using an enormous number of microphones, it is possible to easily make it difficult to hear the leakage sound.

< second embodiment >

< example of configuration of content reproduction System >

Note that, in the above, an embodiment has been described in which a masking sound is reproduced with a uniform sound pressure (level) in the entire reproduction space. However, it is not limited to the above, but the masking sound can be reproduced only in a specific area by using the wave field synthesis.

In this case, for example, the configuration of the content reproduction system is as shown in fig. 8. Note that portions in fig. 8 corresponding to those in fig. 3 are denoted by the same reference numerals, and the description is omitted as necessary.

The content reproduction system shown in fig. 8 includes a signal processing device 11 and a speaker array 12. Further, the configuration of the signal processing apparatus 11 shown in fig. 8 differs from the configuration of the signal processing apparatus 11 shown in fig. 3 in that a wave field synthesis filter unit 51 is newly provided and an amplification unit 52 is provided in place of the amplification unit 24 shown in fig. 3, and otherwise has the same configuration as the signal processing apparatus 11 of fig. 3.

For example, the wave field synthesis filter unit 51 has in advance a wave field synthesis filter that causes a masking sound to be reproduced only in a predetermined masking region by wave field synthesis. It should be noted that the wave field synthesis filter unit 51 may generate a wave field synthesis filter.

Here, the masking region is a region for masking a content sound (i.e., a leakage sound), and, for example, a region between a plurality of divided regions is defined as a masking region.

The wave field synthesis filter unit 51 performs a filtering process on the masking sounds (more specifically, the acoustic signals of the masking sounds) supplied from the masking sound generating unit 21 using a wave field synthesis filter, and supplies the synthesized masking sounds of the respective channels to the amplifying unit 52.

When the masking sounds of the respective channels obtained in this way are output from the respective speakers in the speaker array 12, the wave surfaces of the masking sounds are formed by wave field synthesis so that the masking sounds are reproduced only in the target masking region.

In other words, when the masking sounds of the respective channels are output from the respective speakers in the speaker array 12, sound beams having directivity are generated as sound beams of the masking sounds by wave field synthesis, and therefore, the masking sounds are made to be reproduced only in the masking regions.

By generating the masking sounds of the respective channels for wave field synthesis in this way, the masking regions can be divided, and the leakage sound of the content can be masked only in the target masking region. In other words, by setting the region where the leakage sound is generated as the masking region, only the leakage sound can be masked.

For example, the amplification unit 52 includes amplifiers provided for the respective channels corresponding to the respective speakers constituting the speaker array 12.

The amplification unit 52 performs level adjustment of the masking sound by multiplying the masking sound of each channel supplied from the wave field synthesis filter unit 51 by the masking sound gain coefficient supplied from the reproduction level adjustment unit 23, and supplies the masking sound after the level adjustment to the addition unit 26.

The addition unit 26 adds the masking sounds of the respective channels supplied from the amplification unit 52 and the output sounds of the respective channels supplied from the amplification unit 25 to generate final output sounds of the respective channels, and supplies them to the DA conversion unit 27. In the addition unit 26, the masking sound of the same channel is added to the output sound.

< about masking region >

In the signal processing apparatus 11 having the configuration shown in fig. 8, a masking region, not the entire reproduction space, is defined as a reproduction range of a masking sound.

Specifically, for example, as shown in fig. 9, it is assumed that the area in front of the speaker array 12 is a reproduction space. Note that portions in fig. 9 corresponding to portions in fig. 5 are denoted by the same reference numerals, and the description is omitted as necessary.

In the embodiment of fig. 9, two divided regions R22 and R23 are formed in the reproduction space, and the region between the divided region R22 and the divided region R23 is designated as a mask region R31. Therefore, in the present embodiment, the masking sound is reproduced only in the masking region R31 to mask the leakage sound, and the masking sound is not reproduced outside the masking region R31 and the leakage sound is not masked.

In this case, the sound pressure distribution on the x-axis indicated by the straight line L31 in fig. 9 is as shown in fig. 10. Note that, in fig. 10, the horizontal axis represents a position in the x-axis direction, and the vertical axis represents a sound pressure at each position. Further, portions in fig. 10 corresponding to portions in fig. 6 are denoted by the same reference numerals, and the description is omitted as necessary.

In the embodiment shown in fig. 10, a curve L51 represents the sound pressure distribution of the masking sound.

As shown by the curve L51, the sound pressure of the masking sound is greater than the background noise level only in the region between the divided region R22 and the divided region R23 (i.e., the masking region R31), and it can be seen that the sound pressure is less than the background noise level outside the masking region R31. In other words, it can be seen that the leakage sound is masked using the wave field synthesis only by the masking sound in the masking region R31.

< description of content reproduction processing >

In the content reproduction system shown in fig. 8 described above, the content reproduction processing shown in fig. 11 is executed. That is, the content reproduction processing of the content reproduction system shown in fig. 8 will be described below with reference to the flowchart in fig. 11.

When the content reproduction process is started, the process of step S41 is performed, but since the process of step S41 is similar to the process of step S11 in fig. 7, the description thereof will be omitted.

In step S42, the wave field synthesis filter unit 51 performs filter processing on the masking sounds supplied from the masking sound generation unit 21 using a wave field synthesis filter, and supplies synthesized masking sounds of the respective channels to the amplification unit 52.

For example, as shown in the embodiment shown in fig. 9, when the content a and the content B are specified as contents to be reproduced, masking sounds of the respective channels are generated so that the masking sounds are reproduced only in the masking region R31 by wave field synthesis.

When the process of step S42 is executed, the processes of steps S43 to S47 are then executed and the content reproduction process ends. This processing is similar to that of steps S12 to S16 in fig. 7, and the description thereof will be omitted.

However, in step S45, the amplification unit 52 adjusts the level of the masking sound of each channel, and the amplification unit 25 adjusts the level of the output sound of each channel. In step S46, the masking sound and the output sound are added for each channel.

For example, as shown in fig. 9, when the output sounds of the respective channels are output in the speaker array 12, the content a is caused to be reproduced in the divided region R22, the content B is caused to be reproduced in the divided region R23, and the masking sound is caused to be reproduced in the masking region R31.

As described above, the content reproduction system generates a masking sound based on external information, and reproduces the masking sound together with the content sound by wave field synthesis. As such, it is possible to easily make it difficult to hear the leakage sound. Also, it is possible to cause the leakage sound to be masked only in a desired masking region.

< third embodiment >

< example of configuration of content reproduction System >

Also, although the embodiment in which the masking sound is generated using the external information has been described above, the masking sound may be generated by using the output of the external sensor.

For example, in the case of using a video camera as an external sensor, the configuration of the content reproduction system is as shown in fig. 12. Note that portions in fig. 12 corresponding to those of fig. 8 are denoted by the same reference numerals, and the description is omitted as necessary.

The content reproduction system shown in fig. 12 includes a video camera 81, a signal processing device 11, and a speaker array 12.

Further, the configuration of the signal processing apparatus 11 shown in fig. 12 has a configuration in which the identification unit 91 is newly provided in addition to the configuration of the signal processing apparatus 11 shown in fig. 8, and otherwise has the same configuration as the signal processing apparatus 11 of fig. 8.

For example, the camera 81 provided as an external sensor is arranged in the reproduction space and captures an area around the entire reproduction space or a divided area as an object, and supplies the resultant captured image to the recognition unit 91. For example, the captured image includes at least a region around the divided region as an object.

The recognition unit 91 performs face recognition and person recognition on the captured image supplied from the camera 81 to detect a person from the captured image, thereby estimating the number of persons existing around the content reproduction system (i.e., around the entire reproduction space or divided area). In other words, the recognition unit 91 functions as a detection unit that detects a person from a captured image. The recognition unit 91 supplies person number information indicating the number of persons obtained as a result of estimation of the number of persons existing around the content reproduction system to the masking sound generation unit 21 and the reproduction level adjustment unit 23.

The masking sound generating unit 21 generates a masking sound based on the information of the number of persons supplied from the identifying unit 91 and supplies it to the wave field synthesis filter unit 51.

Specifically, for example, in a case where the number of persons indicated by the person number information is equal to or larger than a predetermined threshold value, that is, in a case where many persons exist in the surrounding environment, many noise sources exist in the surrounding environment, and the masking sound generating unit 21 generates gaussian noise as the masking sound. This is because the larger the number of noise sources, the closer the background noise of the sound including these noise sources is to gaussian noise.

On the other hand, for example, in a case where the number of persons indicated by the person number information is smaller than a predetermined threshold value, that is, in a case where a small number of persons are present in the surrounding environment, the masking sound generating unit 21 generates a super gaussian noise as the masking sound. This is because the kurtosis of the frequency characteristics of the background noise of the sound including the noise sources becomes large when the number of noise sources is small.

Note that noise having frequency characteristics of kurtosis corresponding to the number of persons indicated by the person number information may be generated as the masking sound. Further, the information on the number of persons may be input to a predictor such as a neural network and calculation may be performed to generate a masking sound having the obtained frequency characteristics as an output, or external information and the information on the number of persons may be combined to generate a masking sound.

By generating the masking sound according to the number of surrounding persons in this way, it is possible to generate a masking sound whose characteristic is close to that of the background noise.

The reproduction level adjustment unit 23 determines a content sound gain coefficient and a masking sound gain coefficient based on the information on the number of persons supplied from the recognition unit 91, and supplies the gain coefficients to the amplification unit 25 and the amplification unit 52.

Specifically, for example, the content sound gain coefficient is determined such that the reproduction level of the content sound increases as the number of persons indicated by the person number information increases. This is because the more people around, the more difficult it becomes to hear the content sound.

On the other hand, for example, the masking sound gain coefficient is determined such that the reproduction level of the masking sound decreases as the number of persons indicated by the person number information increases. This is because the more people around, the higher the background noise level and the more difficult it becomes to hear the leakage sound. It should be noted that in the case where the number of persons indicated by the person number information is equal to or greater than the predetermined number of persons and the evaluation background noise level is high, the masking sound may not be reproduced (generated).

Further, in the case where only one person exists in the masking region, the reproduction level of the content sound may be set to a normal level, and the reproduction level of the masking sound may be increased.

It can be said that the adjustment of the masking sound reproduction level using the information of the number of people is a process of determining an appropriate masking sound reproduction level with respect to the background noise level estimated from the information of the number of people.

It should be noted that, similarly to the case of the first embodiment, in the reproduction level adjustment unit 23, a content sound gain coefficient or a masking sound gain coefficient can be determined by using not only the information of the number of persons but also the content sound data.

Further, the head count information or the content sound data may be input into a predictor such as a neural network and calculation may be performed to obtain the determination results of the reproduction levels of the content sound and the masking sound as output, or the external information and the head count information may be combined to determine the reproduction levels of the content sound and the masking sound.

< description of content reproduction processing >

In the content reproduction system shown in fig. 12 described above, the content reproduction processing shown in fig. 13 is executed. That is, the content reproduction processing of the content reproduction system shown in fig. 12 will be described below with reference to the flowchart in fig. 13.

In step S71, the recognition unit 91 performs recognition processing based on the captured image supplied from the camera 81, and supplies the resultant information of the number of persons to the masking sound generating unit 21 and the reproduction level adjusting unit 23.

In step S72, the masking sound generating unit 21 generates a masking sound based on the information of the number of persons supplied from the identifying unit 91 and supplies it to the wave field synthesis filter unit 51.

For example, in step S72, gaussian noise or super gaussian noise is generated as the masking sound from the information on the number of people.

When the masking sound is generated, the masking sound is subsequently subjected to the filtering process in step S73, and the content sound is subjected to the filtering process in step S74. Note that since the processing is similar to that of steps S42 and S43 in fig. 11, the description thereof will be omitted.

In step S75, the reproduction level adjustment unit 23 determines the reproduction level of the content sound and the reproduction level of the masking sound based on the information on the number of persons supplied from the recognition unit 91.

For example, in step S75, the content sound gain coefficient and the masking sound gain coefficient are determined such that the larger the number of persons indicated by the person count information, the higher the reproduction level of the content sound and the smaller the reproduction level of the masking sound.

When the reproduction levels (i.e., gain coefficients) of the content sound and the masking sound are determined, the processes of steps S76 to S78 are then performed and the content reproduction process ends. This process is similar to that of steps S45 to S47 in fig. 11, and a description thereof will be omitted.

As described above, the content reproduction system generates a masking sound based on the information of the number of persons, adjusts the reproduction levels of the content sound and the masking sound, and reproduces the content sound and the masking sound by wave field synthesis. As such, it is possible to easily make it difficult to hear the leakage sound.

< fourth embodiment >

< example of configuration of content reproduction System >

Further, a microphone may be used as the external sensor. In this case, for example, the configuration of the content reproduction system is as shown in fig. 14. Note that portions in fig. 14 corresponding to those of fig. 8 are denoted by the same reference numerals, and the description is omitted as necessary.

The content reproduction system shown in fig. 14 includes a microphone 121, a signal processing device 11, and a speaker array 12.

Further, the configuration of the signal processing apparatus 11 shown in fig. 14 has a configuration in which an analog-to-digital (AD) configuration unit 131 and a background noise analysis unit 132 are newly provided in addition to the configuration of the signal processing apparatus 11 shown in fig. 8, and otherwise has the same configuration as the signal processing apparatus 11 in fig. 8.

For example, a microphone 121 as an external sensor is arranged at an arbitrary position in the reproduction space and acquires background noise in the reproduction space (for example, an area around the divided area). That is, the microphone 121 collects ambient sound (hereinafter, referred to as recording sound) and supplies it to the AD conversion unit 131. It should be noted that the number of the microphones 121 may be one, but of course, a plurality of microphones 121 may be arranged.

The AD conversion unit 131 performs AD conversion on the recording sound supplied from the microphone 121, and supplies the resulting digital recording sound to the background noise analysis unit 132.

The background noise analyzing unit 132 analyzes the level of the recording sound supplied from the AD converting unit 131, that is, the characteristics of the background noise of the surrounding environment, based on the content sound data supplied from the outside, and supplies the analysis result to the masking sound generating unit 21 and the reproduction level adjusting unit 23.

For example, in a state where the output sound is output through the speaker array 12, the recorded sound obtained through the microphone 121 includes not only the background noise of the surrounding environment but also the content sound and the masking sound.

Therefore, based on the supplied content sound data, the background noise analysis unit 132 treats the recording sound collected in the silent part (i.e., the part where the content is not reproduced) where the content sound is muted, as background noise. Then, the background noise analysis unit 132 performs analysis processing on the recorded sound in a portion regarded as background noise. Note that it is assumed that the masking sound is not reproduced in a portion where the content sound is muted.

Specifically, for example, the background noise level, that is, the level of the background noise is calculated by the analysis processing, the frequency characteristic of the background noise is obtained by frequency analysis (frequency analysis), or the amplitude characteristic of the background noise is obtained. The background noise level and the frequency characteristics obtained in this way are output from the background noise analysis unit 132 as a background noise analysis result.

For example, the masking sound generating unit 21 may generate a masking sound based on a parameter table similar to that of the first embodiment, or may generate a masking sound using a predictor such as a neural network.

The reproduction level adjustment unit 23 controls adjustment of the reproduction level of the masking sound and the content sound based on at least one of the analysis result supplied from the background noise analysis unit 32 and the supplied content sound data.

That is, the reproduction level adjustment unit 23 determines the reproduction level of the masking sound (i.e., the masking sound gain coefficient) based on at least one of the analysis result and the content sound data and supplies the determined gain coefficient to the amplification unit 52.

Similarly, the reproduction level adjustment unit 23 determines the reproduction level of the content sound (i.e., content sound gain coefficient) based on at least one of the analysis result and the content sound data and supplies the determined gain coefficient to the amplification unit 25.

For example, the reproduction level adjustment unit 23 may determine the gain coefficient based on a parameter table similar to that of the first embodiment, or may determine the gain coefficient using a predictor such as a neural network.

Here, as a specific embodiment, a case where a masking sound is generated and a gain coefficient is determined based on a parameter table will be described. In this case, for example, the masking sound generating unit 21 and the reproduction level adjusting unit 23 hold the parameter table shown in fig. 15 in advance.

In fig. 15, the characteristic "background noise sound pressure" represents a background noise level obtained as a result of analysis by the background noise analyzing unit 132, that is, a measured background noise sound pressure.

For example, the masking sound parameter "air conditioner" represents the frequency characteristic of an air conditioning sound similar to the case in fig. 4, and the masking sound parameter "frequency characteristic of a microphone acquisition sound" represents the frequency characteristic of a recording sound as background noise.

It should be noted that the masking sound generating unit 21 does not hold the masking sound parameter "frequency characteristic of the microphone acquiring sound" in advance, and uses the frequency characteristic of the background noise supplied as the analysis result of the background noise analyzing unit 132 as the masking sound parameter "frequency characteristic of the microphone acquiring sound".

In this case, gaussian noise corresponding to the frequency characteristics of background noise as a masking sound parameter may be generated as a masking sound.

When a masking sound is generated based on the masking sound parameter "frequency characteristic of a microphone acquisition sound", a masking sound having the same frequency characteristic as an actual background noise can be obtained, and a leakage sound can be naturally masked without causing discomfort.

Further, as for the reproduction levels of the content sound and the masking sound, the reproduction level of the content sound and the reproduction level of the masking sound increase as the background noise level increases.

In the embodiment shown in fig. 15, for example, in the case where a background noise level (i.e., background noise sound pressure "60 dBA") is obtained as an analysis result of the background noise, a content sound is reproduced at 10dB, and a masking sound similar to an air conditioning sound is reproduced at 3 dB.

In the case of using this parameter table shown in fig. 15, control is performed such that the reproduction level of the content sound is changed in accordance with the background noise level, and further, the reproduction level of the masking sound is determined in accordance with the change in the reproduction level of the content sound.

It should be noted that in the case where the masking sound parameter and the gain coefficient (reproduction level) are determined using the parameter table, not only information such as the analysis result of the background noise obtained from the output of the external sensor but also external information may be used in combination.

In this case, for example, for a combination of the analysis result of the background noise and the external information, a parameter table that associates the reproduction level (gain coefficient) of the content sound or the masking sound with the masking sound parameter can be used. In other words, the gain coefficients and the masking sound parameters of the content sound and the masking sound can be determined based on the analysis result of the background noise and the external information.

< description of content reproduction processing >

In the content reproduction system shown in fig. 14 described above, the content reproduction processing shown in fig. 16 is executed. That is, the content reproduction processing of the content reproduction system shown in fig. 14 will be described below with reference to the flowchart of fig. 16.

In step S101, the background noise analysis unit 132 performs analysis processing on the recording sound supplied from the AD conversion unit 131 (i.e., the background noise in the mute section where the content sound is muted) based on the supplied content sound data, and supplies the analysis result to the masking sound generation unit 21 and the reproduction level adjustment unit 23. Here, as the background noise analysis result, for example, a background noise level, a frequency characteristic, an amplitude characteristic, and the like can be obtained.

In step S102, the masking sound generating unit 21 generates a masking sound based on the analysis result supplied from the background noise analyzing unit 132 and the parameter table held in advance, and supplies the masking sound to the wave field synthesis filter unit 51.

For example, in step S102, the parameter table is referred to, and a masking sound is generated based on a masking sound parameter associated with the background noise analysis result.

When the masking sound is generated, the masking sound is subsequently subjected to filter processing in step S103, and the content sound is subjected to filter processing in step S104. Note that since the processing is similar to that of steps S42 and S43 in fig. 11, the description thereof will be omitted.

In step S105, the reproduction level adjustment unit 23 determines the reproduction levels of the masking sound and the content sound based on the supplied content sound data and the analysis result supplied from the background noise analysis unit 132, and at least one of the saved parameter tables.

For example, in step S105, the parameter table is referred to, and the reproduction levels of the content sound and the masking sound, that is, the gain coefficients are determined by specifying the gain coefficients associated with the background noise analysis results.

The reproduction level adjustment unit 23 supplies the determined masking sound gain coefficient to the amplification unit 52 and supplies the content sound gain coefficient to the amplification unit 25.

When the reproduction level is determined, the processing of steps S106 to S108 is performed thereafter and the content reproduction processing ends. This processing is similar to that of steps S45 to S47 in fig. 11, and the description thereof will be omitted.

As described above, the content reproduction system generates a masking sound based on the background noise analysis result, adjusts the reproduction levels of the content sound and the masking sound, and reproduces the content sound and the masking sound by wave field synthesis. As such, it is possible to easily make it difficult to hear the leakage sound.

< modification 1 of the fourth embodiment >

< other embodiments of the parameter Table >

Further, in the case of using the microphone 121 as an external sensor, the microphone 121 can be arranged in a region between the plurality of divided regions. Then, a mixed sound including the content sound, the background noise, and the masking sound reproduced in the respective divided areas can be obtained as a recording sound by the microphone 121.

In this case, by analyzing the recorded sound, it is possible to determine how much the masking sound should be increased, that is, to calculate how much the reproduction level of the masking sound should be increased in order to reliably mask the leakage sound.

Specifically, for example, the background noise analysis unit 132 sets the content sound to S (signal) and sets the mixed sound of the background noise and the masking sound to N (noise). That is, the background noise analysis unit 132 obtains, as the SN ratio, the difference between the sound pressure of the recording sound when the content sound is reproduced and the sound pressure of the recording sound when the content sound is not reproduced.

Therefore, in the case where the obtained SN ratio is larger than 0dB, the background noise analysis unit 132 determines that the content sound level is superior, that is, generates the leakage sound so that the masking sound is further increased and the reproduction level of the masking sound is increased.

On the other hand, in the case where the obtained SN ratio is less than 0dB, the background noise analysis unit 132 determines that the level of the mixed sound of the masking sound and the background noise is superior, that is, the leakage sound has been inaudible, and lowers the reproduction level of the masking sound.

By dynamically changing the reproduction level of the masking sound in this way, the masking sound can be reproduced at an appropriate reproduction level according to the surrounding environment or the like.

The adjustment control of the reproduction level of the masking sound described above can be realized by using the parameter table shown in fig. 17, for example.

In fig. 17, the characteristic "SN ratio" represents the SN ratio described above calculated based on the sound pressure of the recorded sound obtained as a result of the analysis by the background noise analysis unit 132.

Further, the feature "content sound reproduction level" represents a reproduction level of the content sound, that is, a content sound gain coefficient.

Further, the feature "masking sound reproduction level variation" represents an increase/decrease value of the reproduction level of the masking sound, and the feature "masking sound parameter" represents a masking sound parameter.

For example, a change in the reproduction level of the masking sound of "-6 dB" indicates a decrease in the reproduction level of the masking sound of-6 dB from the current level. In the embodiment shown in fig. 17, the reproduction level of the masking sound is increased or decreased according to the SN ratio, and in the case where the SN ratio is 0dB, it is assumed that the reproduction level of the masking sound at the current point in time is an appropriate level and the reproduction level is maintained. That is, the increase/decrease value is 0 dB.

Accordingly, the reproduction level adjustment unit 23 increases or decreases the reproduction level of the masking sound by an increase/decrease value corresponding to the SN ratio supplied from the background noise analysis unit 132 with reference to the parameter table. That is, the reproduction level adjustment unit 23 determines a new masking sound gain coefficient according to the increase/decrease value of the reproduction level of the masking sound, and supplies the new gain coefficient to the amplification unit 52.

In the case where the parameter table shown in fig. 17 is held in the masking sound generating unit 21 and the reproduction level adjusting unit 23, the SN ratio is calculated in step S101 of the content reproduction processing described with reference to fig. 16.

That is, the background noise analyzing unit 132 calculates an SN ratio as a background noise analysis based on the recorded sound obtained at the timing of not reproducing the content sound and the recorded sound obtained at the timing of reproducing the content sound and supplies the obtained SN ratio to the masking sound generating unit 21 and the reproduction level adjusting unit 23.

Then, in step S102, the masking sound generating unit 21 determines a masking sound parameter based on the SN ratio supplied from the background noise analyzing unit 132 and the saved parameter table, and generates a masking sound according to the determination result.

Further, in step S105, the reproduction level adjustment unit 23 determines the reproduction levels of the content sound and the masking sound, that is, the gain coefficients, based on the SN ratio supplied from the background noise analysis unit 132 and the saved parameter table.

For example, in the embodiment shown in fig. 17, the content sound gain coefficient is determined so that the reproduction level of the content sound is always 20 dB. Further, with respect to the masking sound, a gain coefficient corresponding to a reproduction level specified with respect to the reproduction level of the masking sound at the current point in time and an increase/decrease value corresponding to the SN ratio are determined.

By changing the reproduction level of the masking sound according to the SN (i.e., the relationship between the sound pressure of the content sound and the mixed sound of the background noise and the masking sound) ratio, the masking sound can be reproduced at a more appropriate reproduction level, and the leakage sound can be reliably masked.

The control of the reproduction level of the masking sound based on the above-described SN ratio can be regarded as control of increasing and decreasing the reproduction level of the masking sound according to the difference between the background noise level (more specifically, the level of the background noise and the masking sound) and the reproduction level of the content sound.

It should be noted that, here, an embodiment has been described in which the SN ratio and the parameter table are used to determine the masking sound parameter and the reproduction level of the masking sound, but a predictor such as a neural network generated in advance by machine learning may be used.

Further, in the case where the background noise level can be obtained by analyzing the recording sound, the background noise analyzing unit 132 may compare the content sound with the background noise level for each frequency band, and the reproduction level adjusting unit 23 may determine the reproduction level of the masking sound for each frequency band according to the comparison result. In this case, since it is possible to estimate at which level the content sound leaks out of the background noise for each frequency band, it is possible to more reliably mask the leaking sound in terms of auditory characteristics.

< fifth embodiment >

< example of configuration of content reproduction System >

Incidentally, for example, in the fourth embodiment and the modification 1 of the fourth embodiment described above, the background noise level is calculated from the recording sound at the time of the mute section, and analysis is performed using the recording sound mixed with the content sound.

However, in a case where the content sound continues and there is no mute section or there is a small number of mute sections, for example, in a case where the content is music, it is difficult to acquire only background noise as the recording sound. Further, it is also assumed that the microphone 121 cannot be installed at a position between the divided regions.

Therefore, only the recording sound not including the content sound, that is, the background noise can be acquired by performing echo cancellation on the recording sound including the content sound.

In this case, for example, the configuration of the content reproduction system is as shown in fig. 18. Note that portions in fig. 18 corresponding to those of fig. 14 are denoted by the same reference numerals, and the description is omitted as necessary.

The content reproduction system shown in fig. 18 includes a microphone 121, a signal processing device 11, and a speaker array 12.

Further, the configuration of the signal processing device 11 shown in fig. 18 has a configuration in which an echo cancellation unit 161 is newly provided in addition to the configuration of the signal processing device 11 shown in fig. 14, and otherwise has the same configuration as the signal processing device 11 of fig. 14.

In the signal processing apparatus 11 shown in fig. 18, the echo cancellation unit 161 is provided between the AD conversion unit 131 and the background noise analysis unit 132.

The echo cancellation unit 161 performs echo cancellation on the recording data supplied from the AD conversion unit 131 based on the supplied content sound data, and supplies the recording sound after the echo cancellation to the background noise analysis unit 132.

In the echo canceling unit 161, echo canceling for canceling the content sound from the recording sound is realized by performing filter processing on the recording sound using an echo canceling filter.

At this time, the echo cancellation unit 161 updates the echo cancellation filter therein to receive the recording sound and the content sound picked up by the microphone 121 as inputs, cancels (removes) the content sound from the recording sound, and outputs only background noise.

For example, the update algorithm of the echo cancellation filter may be a general Least Mean Square (LMS) or a normalized LMS (nlms).

The background noise analyzing unit 132 analyzes the level and the like of the recording sound supplied from the echo canceling unit 161 and supplies the analysis result to the masking sound generating unit 21 and the reproduction level adjusting unit 23.

The masking sound generating unit 21 generates a masking sound based on the analysis result supplied from the background noise analyzing unit 132 and supplies it to the wave field synthesis filter unit 51. For example, the masking sound generating unit 21 generates a masking sound by using the parameter table shown in fig. 15 or by using a predictor obtained by learning in advance.

The reproduction level adjustment unit 23 controls adjustment of the reproduction levels of the masking sound and the content sound based on at least one of the analysis result supplied from the background noise analysis unit 132 and the supplied content sound data.

For example, the reproduction level adjustment unit 23 determines the reproduction level of the content sound and the reproduction level of the masking sound based on the background noise level as the analysis result supplied from the background noise analysis unit 132 and the parameter table shown in fig. 15 held in advance.

< description of content reproduction processing >

In the content reproduction system shown in fig. 18 described above, the content reproduction processing shown in fig. 19 is executed. That is, the content reproduction processing of the content reproduction system shown in fig. 18 will be described below with reference to the flowchart of fig. 19.

In step S131, the echo cancellation unit 161 performs echo cancellation on the recording sound supplied from the AD conversion unit 131 based on the supplied content sound data and supplies the resultant recording sound after the echo cancellation to the background noise analysis unit 132.

In step S131, echo cancellation is performed on the recording sound picked up by the microphone 121 at an arbitrary timing. Thus, the content sound is eliminated from the recording sound, and the background noise is acquired (extracted).

When the background noise is obtained in this way, the processing of steps S132 to S139 is performed thereafter and the content reproduction processing ends. This processing is similar to that of steps S101 to S108 in fig. 16, and the description thereof will be omitted.

As described above, the content reproduction system acquires background noise by performing echo cancellation, generates a masking sound based on the background noise analysis result, and adjusts the reproduction levels of the content sound and the masking sound. Further, the content reproduction system causes reproduction of a content sound and a masking sound whose levels have been appropriately adjusted by wave field synthesis. As such, it is possible to easily make it difficult to hear the leakage sound.

< other modification >

< example of configuration of content reproduction System >

Also, in the above-described first to fifth embodiments, the example in which the content sound and the masking sound are reproduced by one speaker array 12 has been described. However, the masking sound and the content sound may be reproduced by different speakers or speaker arrays, and a speaker or speaker array that reproduces only the masking sound may be provided.

For example, in the embodiment shown in fig. 3, in a case where a speaker for reproducing only a masking sound is newly provided in addition to the speaker array 12, the configuration of the content reproduction system is as shown in fig. 20. Note that portions in fig. 20 corresponding to portions in fig. 3 are denoted by the same reference numerals, and the description is omitted as necessary.

The content reproduction system shown in fig. 20 includes the signal processing apparatus 11, the speaker array 12, and the speaker 191.

The content reproduction system has a configuration in which the speaker 191 is newly provided in addition to the configuration of the content reproduction system shown in fig. 3.

Further, the configuration of the signal processing apparatus 11 shown in fig. 20 has a configuration in which a Low Pass Filter (LPF)201 and a DA conversion unit 202 are newly provided in addition to the configuration of the signal processing apparatus 11 shown in fig. 3, and otherwise has the same configuration as the signal processing apparatus 11 of fig. 3.

In the signal processing apparatus 11 shown in fig. 20, the masking sound output from the amplifying unit 24 is supplied not only to the adding unit 26 but also to the LPF 201.

The LPF 201 is an LPF (low pass filter), and extracts only a low-frequency component of a masking sound by performing a filtering process on the masking sound supplied from the amplification unit 24 using the low pass filter and supplies the masking sound to the DA conversion unit 202.

The DA conversion unit 202 performs DA conversion on the masking sound (more specifically, the low-frequency component of the masking sound) supplied from the LPF 201, and supplies the resulting masking sound (i.e., an analog signal) to the speaker 191 and causes the speaker to reproduce the masking sound. In this case, the DA conversion unit 202 functions as an output unit that causes a masking sound to be output from the speaker 191.

For example, the speaker 191 includes a speaker for low-frequency reproduction having a larger diameter than the speaker constituting the speaker array 12, and outputs (reproduces) the masking sound supplied from the DA conversion unit 202.

Specifically, in the present embodiment, the diameter of the speaker constituting the speaker array 12 is smaller than the diameter of the speaker 191. Therefore, it is difficult to reproduce the low frequency components of the masking sound with a sufficient sound pressure by the speaker array 12. Therefore, in the content reproduction system, the middle-high frequency components of the masking sound are reproduced by the speaker array 12, and the low frequency components of the masking sound are reproduced by the speaker 191.

Note that, of course, the speaker array 12 may not reproduce the masking sound, and only the speaker 191 may reproduce the masking sound. By reproducing at least the low-frequency component of the masking sound in another speaker 191 different from the speaker array 12 that reproduces the content sound in this way, the masking sound can be reproduced with a desired frequency characteristic.

As described above, according to the present technology described in the first embodiment to the other modifications, it is possible to easily make it difficult to hear the leakage sound by making the masking sound reproduction of an appropriate reproduction level.

Also, in the present technology, since it becomes difficult to hear the leaking sound from another divided region in each divided region, the degree of auditory separation of the content sound can be improved. Therefore, the content of the content, i.e., the information provided by the content, can be acquired more easily.

In general, human hearing reacts sensitively even to small sounds (speech or music). Therefore, in the case where the leakage sound is the voice or music, the listener in the divided area or the person near the divided area hears the leakage sound unknowingly, which makes the person feel uncomfortable. Therefore, according to the present technology, by masking the leakage sound, the sense of hearing does not react to the content sound that becomes the leakage sound, and a person does not feel uncomfortable.

Further, with the prior art, it is necessary to increase the number of speakers to reduce the leakage sound, but with the present technology, even if the number of speakers is small, it may be made difficult to hear the leakage sound, and therefore, the number of speakers can be reduced and the cost can be reduced.

Also, according to the present technology, it is not necessary to mount microphones at control points for canceling sound as in the prior art, and even in the case of using microphones as external sensors, the number of microphones can be reduced. Therefore, not only can the layout of the operation positions of the content reproduction system be given a degree of freedom, but also the cost of equipment such as a microphone can be reduced.

Further, according to the present technology, even in a case where a time slot occurs in the radiation characteristic of the sound beam of the speaker due to manufacturing variations of the speaker that causes reproduction of the content sound, deterioration that occurs with time, sound reflection and sound absorption in the reproduction environment, the effect produced by the time slot can be suppressed (covered) by causing the masking sound to be reproduced. Therefore, the maintenance time and cost of the content reproduction system can be reduced.

< example of computer configuration >

Incidentally, the series of processes described above can be executed by hardware and also by software. In the case where the series of processes is executed by software, a program constituting the software is installed in a computer. Here, the computer includes a computer installed in dedicated hardware, for example, a general-purpose personal computer capable of executing various functions by installing various programs and the like.

Fig. 21 is a block diagram showing an example of the configuration of hardware of a computer that executes the series of processes described above by a program.

In the computer, a Central Processing Unit (CPU)501, a Read Only Memory (ROM)502, and a Random Access Memory (RAM)503 are interconnected by a bus 504.

An input/output interface 505 is further connected to the bus 504. An input unit 506, an output unit 507, a recording unit 508, a communication unit 509, and a driver 510 are connected to the input/output interface 505.

The input unit 506 includes a keyboard, a mouse, a microphone, an image sensor, and the like. The output unit 507 includes a display, a speaker array, and the like. The recording unit 508 includes a hard disk, a nonvolatile memory, and the like. The communication unit 509 includes a network interface and the like. The drive 510 drives a removable recording medium 511 such as a magnetic disk, an optical disk, a magneto-optical disk, or a semiconductor memory.

In the computer configured in the above-described manner, for example, the above-described series of processes is executed so that the CPU 501 loads a program stored in the recording unit 508 into the RAM 503 via the input/output interface 505 and the bus 504 and executes the program.

For example, a program executed by a computer (CPU 501) can be provided by being recorded on a removable recording medium 511 which is a package medium or the like. Further, the program can be provided via a wired or wireless transmission medium such as a local area network, the internet, or digital satellite broadcasting.

In the computer, when the removable recording medium 511 is installed on the drive 510, a program can be installed in the recording unit 508 via the input/output interface 505. Further, the communication unit 509 can receive the program via a wired or wireless transmission medium and install the program in the recording unit 508. Further, the program can be preinstalled in the ROM 502 or the recording unit 508.

Note that the program executed by the computer may be a program that is processed in chronological order following the order described in the present embodiment or may be a program that is processed in parallel or at a desired timing (for example, when a call is completed).

Further, the embodiments of the present technology are not limited to the above-described embodiments, but various changes may be made within a range not departing from the spirit of the present technology.

For example, the present technology can employ a configuration of cloud computing in which one function is shared and joint processing is performed by a plurality of devices via a network.

Further, the respective steps described in the above flowcharts can be executed by a single apparatus or shared by a plurality of apparatuses.

Also, in the case where a single step includes a plurality of processing pieces, the plurality of processing pieces included in the single step can be executed by a single apparatus or can be divided and executed by a plurality of apparatuses.

Also, the present technology can be configured as follows.

(1) A signal processing apparatus comprising:

a masking sound generating unit that generates a masking sound for masking a sound of the first content and a sound of the second content heard in a region between the first region and the second region in a case where the first content is reproduced in the first region and the second content is reproduced in the second region by wave field synthesis using the speaker array.

(2) The signal processing apparatus according to (1), further comprising:

and an output unit that causes the speaker array to output the masking sound.

(3) The signal processing apparatus according to (1) or (2), further comprising:

and an output unit that causes a speaker different from the speaker array to output the masking sound.

(4) The signal processing apparatus according to any one of (1) to (3),

wherein the reproduction level of the masking sound is greater than the background noise level.

(5) The signal processing apparatus according to any one of (1) to (4), further comprising:

a wave field synthesis filter unit that performs a filtering process on the masking sound generated by the masking sound generation unit to generate a sound of each of a plurality of channels for making the masking sound reproduced in a masking region between the first region and the second region by wave field synthesis.

(6) The signal processing apparatus according to any one of (1) to (5),

wherein the masking sound generating unit generates the masking sound based on the external information.

(7) The signal processing device according to (6),

wherein the external information includes information indicating at least one of a time zone, a day of the week, a number of visitors, and a climate.

(8) The signal processing apparatus according to any one of (1) to (5), further comprising:

a detection unit that detects a person as a subject from an image including at least a region around a first region and a second region;

wherein the masking sound generating unit generates the masking sound based on a result of the detection of the person by the detecting unit.

(9) The signal processing apparatus according to any one of (1) to (5), further comprising:

an analysis unit that analyzes a feature of background noise of a surrounding environment;

wherein the masking sound generating unit generates the masking sound based on the analysis result of the feature.

(10) The signal processing device according to (9),

wherein the masking sound generating unit generates a masking sound having a frequency characteristic from an analysis result of the characteristic.

(11) The signal processing apparatus according to (9) or (10), further comprising:

and a reproduction level adjustment unit that adjusts a reproduction level of the masking sound based on the analysis result of the feature.

(12) The signal processing apparatus according to any one of (9) to (11), further comprising:

an echo cancellation unit that extracts background noise by performing echo cancellation based on the sound of the first content and the sound of the second content with respect to the collected ambient sound.

(13) The signal processing apparatus according to any one of (1) to (12),

wherein the masking sound generating unit changes the frequency characteristic of the masking sound according to the frequency characteristics of the first content and the second content.

(14) The signal processing apparatus according to any one of (1) to (13), further comprising:

and a reproduction level adjustment unit that changes a reproduction level of the masking sound in accordance with reproduction levels of the first content and the second content.

(15) A signal processing method, wherein, a signal processing device,

in a case where the first content is reproduced in the first area and the second content is reproduced in the second area by wave field synthesis using the speaker array, a masking sound for masking a sound of the first content and a sound of the second content heard in an area between the first area and the second area is generated.

(16) A program for causing a computer to execute a process, comprising the steps of:

List of reference numerals

11 Signal processing device

12 loudspeaker array

21 masking sound generating unit

22 wave field synthesis filter unit

23 reproduction level adjustment unit

24 amplifying unit

25 amplifying unit

51 wave field synthesis filter unit

91 identification cell

121 microphone

132 background noise analysis unit

161 echo cancellation unit.

48页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：热词辨识和被动辅助

Signal processing device and method, and program

相关技术

网友询问留言