Monitoring method and monitoring system using multi-dimensional sensor data

文档序号：1617238 发布日期：2020-01-10 浏览：2次中文

阅读说明：本技术 利用多维度感测器数据的监控方法及监控系统 (Monitoring method and monitoring system using multi-dimensional sensor data ) 是由郭方文陈志明于 2018-08-01 设计创作，主要内容包括：本发明提出一种利用多维度感测器数据的监控方法及监控系统,该监控方法用于一监控系统。该监控系统包括设置于一场景中的多个感测器,且该多个感测器分为多个类型。该监控方法包括：利用该多个感测器检测该场景以取得各类型的一感测数据；分别对各类型的该感测数据进行一本地物件处理以产生一本地物件特征信息；依据该本地物件特征信息进行一全局物件处理以产生一全局物件特征信息；以及对该全局物件特征信息进行一全局物件辨识以产生一全局辨识结果。采用本方案可利用不同类型的感测器以取得场景的感测数据,进行同类型的本地物件的检测、对应、及辨识,且可利用不同类型的本地物件进行对应以产生全局的感测数据的全局物件,其具有全局融合特征。(The invention provides a monitoring method and a monitoring system using multi-dimensional sensor data. The monitoring system comprises a plurality of sensors arranged in a scene, and the sensors are divided into a plurality of types. The monitoring method comprises the following steps: detecting the scene by using the plurality of sensors to obtain sensing data of each type; respectively carrying out local object processing on the sensing data of each type to generate local object characteristic information; performing global object processing according to the local object feature information to generate global object feature information; and performing global object recognition on the global object feature information to generate a global recognition result. The scheme can utilize different types of sensors to obtain the sensing data of the scene, detect, correspond and identify local objects of the same type, can utilize different types of local objects to correspond to generate global objects of global sensing data, and has global fusion characteristics.)

1. A monitoring method using multi-dimensional sensor data is used for a monitoring system, the monitoring system comprises a plurality of sensors arranged in a scene, and the plurality of sensors are divided into a plurality of types, the monitoring method comprises the following steps:

detecting the scene by using the plurality of sensors to obtain sensing data of each type;

respectively carrying out local object processing on the sensing data of each type to generate local object characteristic information;

performing global object processing according to the local object feature information to generate global object feature information; and

and carrying out global object identification on the global object characteristic information to generate a global identification result.

2. The method as claimed in claim 1, wherein the plurality of sensors of the plurality of types comprises: a plurality of cameras, a plurality of microphones, a plurality of taste sensors, a plurality of odor sensors, a plurality of tactile sensors, or a combination thereof.

3. The method as claimed in claim 1, wherein the step of performing the local object processing on each type of the sensing data to generate the local object feature information comprises:

respectively carrying out local object detection and corresponding processing on the sensing data of each type to obtain a local object identification code list and a corresponding local rough feature set of the sensing data of each type;

according to the local object identification code list corresponding to each type of the sensing data and the corresponding local rough feature set, respectively performing local object feature extraction and fusion processing on each type of the sensing data to obtain a plurality of local fine feature sets of each type of the sensing data and fuse the plurality of local fine feature sets to generate a local fusion feature; and

and respectively inputting a local object identification model to the local fusion characteristics corresponding to each type of the sensing data so as to obtain a local identification list of each type of the sensing data.

4. The method as claimed in claim 3, wherein the local object id list of each type of the sensing data comprises a local object or a plurality of local objects in each type of the sensing data, and each local object has a corresponding local object id;

wherein the local coarse feature set includes a direction, a distance, and a summary corresponding to each type of the sensing data.

5. The method as claimed in claim 4, wherein the local object feature extraction and fusion process comprises:

extracting a plurality of local fine features of each type of the sensing data to establish a corresponding local fine feature set according to the local object identification code list corresponding to each type of the sensing data and the corresponding local rough feature set; and

the local fine feature set of the sensing data of each type is fused into the local fusion feature of each local object.

6. The method as claimed in claim 3, wherein the local object feature information includes the local object id list, the local rough feature set, the local fusion feature, and the local id list of each type of the sensing data, and the step of performing the global object processing to generate the global object feature information according to the local object feature information corresponding to each type of the sensing data comprises:

performing global object corresponding processing according to the local object feature information to generate a global object identification code list and a corresponding global rough feature set; and

according to the local object feature information, the global object identification code list and the corresponding global rough feature set, a global feature set corresponding process is carried out to generate a global fine feature set corresponding to one global object or each of a plurality of global objects in the global object list.

7. The method as claimed in claim 6, wherein the step of performing a global object recognition on the global object feature information to generate the global recognition result comprises:

performing a local context analysis process on the sensing data of each type to generate a local context analysis result, and combining the local context analysis results to generate a local context combination result;

selecting a neighboring distinguishable degree weighting coefficient or a self-adaptive weighting coefficient according to the local context merging result; and

and performing global fine feature fusion processing according to the selected adjacent distinguishable degree weighting coefficient or the self-adaptive weighting coefficient to generate a global fusion feature corresponding to each global object.

8. The method as claimed in claim 7, wherein the global identification result includes a confidence level, and the method further comprises:

decomposing the global fusion feature into a plurality of local detailed features corresponding to the sensing data of each type; and

and feeding back the confidence level and the local fine features corresponding to the sensing data of each type to the local object identification model corresponding to the sensing data of each type respectively.

9. The method as claimed in claim 7, wherein the local context analysis process comprises:

performing a context acquisition process on the sensing data of each type to obtain a corresponding context area;

performing context fusion processing on each context region corresponding to each type of the sensing data to obtain a fusion context region corresponding to each type of the sensing data;

performing the local context analysis processing on the fusion context areas corresponding to the sensing data of each type respectively to generate a local context analysis result; and

and merging the local context analysis results to generate the local context merging result.

10. The method of claim 3, further comprising:

when time stamps of a plurality of local objects in the sensing data of different types, the local rough feature set and the world coordinate positioning information are all matched, judging that the local objects are corresponding successfully; and

assigning a global object ID to the plurality of local objects that are successful.

11. A monitoring system, comprising:

a plurality of sensors, wherein the plurality of sensors are divided into a plurality of types and are used for detecting a scene to obtain sensing data of each type; and

the arithmetic device is used for respectively carrying out local object processing on the sensing data of each type so as to generate local object characteristic information;

the arithmetic device further performs global object processing according to the local object feature information to generate global object feature information, and performs global object identification on the global object feature information to generate a global identification result.

12. The monitoring system of claim 11, wherein the plurality of sensors of the plurality of types comprise: a plurality of cameras, a plurality of microphones, a plurality of taste sensors, a plurality of odor sensors, a plurality of tactile sensors, or a combination thereof.

13. The monitoring system of claim 11, wherein the computing device performs a local object detection and corresponding processing on each type of the sensing data to obtain a local object id list and a corresponding local rough feature set of each type of the sensing data;

the computing device performs local object feature extraction and fusion processing on each type of the sensing data respectively according to the local object list corresponding to each type of the sensing data and the corresponding local rough feature set to obtain a plurality of local fine feature sets of each type of the sensing data and fuse the plurality of local fine feature sets to generate a local fusion feature;

the computing device respectively inputs a local object identification model to the local fusion feature corresponding to each type of the sensing data so as to obtain a local identification list of each type of the sensing data.

14. The monitoring system of claim 13, wherein the local object id list of each type of the sensed data includes a local object or a plurality of local objects in each type of the sensed data, and each local object has a corresponding local object id;

wherein the local coarse feature set includes a direction, a distance, and a summary corresponding to each type of the sensing data.

15. The monitoring system of claim 14, wherein the computing device extracts a plurality of local fine features of each type of the sensing data to establish a corresponding local fine feature set according to the local object id list corresponding to each type of the sensing data and the corresponding local rough feature set, and merges the local fine feature set of each type of the sensing data into the local merged feature of each local object.

16. The monitoring system of claim 13, wherein the local object feature information includes the local object id list, the local rough feature set, the local fusion feature, and the local id list of each type of the sensing data, and the computing device further performs a global object mapping process according to the local object feature information to generate a global object id list and a corresponding global rough feature set, and performs a global feature set creation process according to the local object feature information, the global object id list, and the corresponding global rough feature set to generate a global fine feature set corresponding to each of a global object or a plurality of global objects in the global object list.

17. The monitoring system of claim 16, wherein the computing device further performs a context analysis on each type of the sensed data to generate a context analysis result, combines the context analysis results to generate a context combination result, and selects a neighboring discriminative weighting factor or an adaptive weighting factor according to the context combination result;

the computing device further performs a global fine feature fusion process according to the selected neighboring distinguishable degree weighting coefficient or the adaptive weighting coefficient to generate a global fusion feature corresponding to each global object.

18. The monitoring system of claim 17, wherein the global identification result includes a confidence level, and the computing device further disassembles the global fusion feature into a plurality of local fine features corresponding to each type of the sensing data, and feeds the confidence level and the plurality of local fine features corresponding to each type of the sensing data back to the local object identification model corresponding to each type of the sensing data.

19. The monitoring system of claim 17, wherein the computing device further performs a context obtaining process on the sensing data of each type to obtain a corresponding context area, and performs a context fusion process on the context areas corresponding to the sensing data of each type to obtain a fusion context area corresponding to the sensing data of each type;

the computing device further performs the context analysis on the fused context areas corresponding to the sensing data of each type to generate context analysis results, and combines the context analysis results to generate the context combination result.

20. The monitoring system of claim 13, wherein when a timestamp, the set of local rough features, and world coordinate positioning information of a plurality of local objects in the sensing data of different types match, the computing device determines that the plurality of local objects are corresponding successfully and assigns a global object id to the plurality of local objects that are corresponding successfully.

Technical Field

The present invention relates to a monitoring system, and more particularly, to a monitoring method and a monitoring system using multi-dimensional sensor data.

Background

In order to maintain the property and traffic safety of private and public homes and prevent and attack criminals, video surveillance systems or video cameras have been widely installed in homes, private areas, public places, and traffic lanes for real-time monitoring or video recording and evidence saving. However, conventional video surveillance systems are only capable of continuous video recording and have been installed in large quantities, resulting in extremely large video data over time. When a special event occurs, it is not always discovered and disposed in time, and a lot of manpower and time are consumed to retrieve and watch the recorded contents.

However, in a real-world environment, a video camera alone cannot collect complete information. For example: when a fire disaster or oil gas leakage occurs, the disaster is not detected in a secret place, and before the disaster is not expanded, the disaster can be detected in advance by detecting abnormal odor floating in the air. Thus, a single type of sensor is in fact limited for safety monitoring and maintenance.

Disclosure of Invention

The embodiment of the invention provides a monitoring method utilizing multi-dimensional sensor data, which is used for a monitoring system, wherein the monitoring system comprises a plurality of sensors arranged in a scene, and the plurality of sensors are divided into a plurality of types, and the monitoring method comprises the following steps: detecting the scene by using the plurality of sensors to obtain sensing data of each type; respectively carrying out local object processing on the sensing data of each type to generate local object characteristic information; performing global object processing according to the local object feature information to generate global object feature information; and performing global object recognition on the global object feature information to generate a global recognition result.

An embodiment of the present invention further provides a monitoring system, including: a plurality of sensors, wherein the plurality of sensors are divided into a plurality of types and are used for detecting a scene to obtain sensing data of each type; and the arithmetic device is used for respectively carrying out local object processing on the sensing data of each type so as to generate local object characteristic information, wherein the arithmetic device further carries out global object processing according to the local object characteristic information so as to generate global object characteristic information, and carries out global object identification on the global object characteristic information so as to generate a global identification result.

In the embodiment of the invention, the monitoring method and the monitoring system using the multi-dimensional sensor data can utilize different types of sensors to obtain the sensing data of the scene, perform detection, correspondence and identification of local objects of the same type, and utilize different types of local objects to perform correspondence to generate a global object of global sensing data, and have global fusion characteristics. In addition, global object recognition can be performed, so that the reliability and the accuracy of monitoring objects in a scene are higher.

Drawings

FIG. 1 is a block diagram of a monitoring system according to an embodiment of the invention.

FIG. 2 is a block diagram of a monitoring process according to an embodiment of the present invention.

Fig. 3A and 3B are flow charts illustrating local object mapping, local fine feature fusion, and local object identification for a video object according to an embodiment of the present invention.

Fig. 4A and 4B are flow charts illustrating local object mapping, local fine feature fusion, and local object identification for audio objects according to an embodiment of the present invention.

FIG. 5 is a flowchart illustrating global object mapping and establishing a global fine feature set according to an embodiment of the invention.

FIG. 6A is a schematic diagram illustrating capturing video data of a scene by a plurality of cameras according to an embodiment of the present invention.

FIG. 6B is a schematic diagram illustrating capturing audio data for a scene using multiple microphones according to an embodiment of the present invention.

FIG. 7A is a diagram illustrating different spatial partitions in a video frame according to an embodiment of the invention.

Fig. 7B is a diagram illustrating different time divisions within an audio segment according to an embodiment of the present invention.

FIG. 8A is a flowchart illustrating selection of coefficients for global fine feature fusion based on context analysis processing according to an embodiment of the present invention.

FIG. 8B is a flowchart illustrating the global context analysis process and weight determination step in the embodiment of FIG. 8A according to the present invention.

FIGS. 8C-1 and 8C-2 illustrate a flowchart of global fine feature fusion and global object recognition according to an embodiment of the present invention.

FIG. 8D is a diagram illustrating a data pipeline for global fine feature fusion and global object recognition according to an embodiment of the present invention.

FIG. 8E is a flowchart illustrating recognition result feedback and enhanced global feedback according to an embodiment of the invention.

Fig. 8F is a flow chart illustrating identification result feedback and enhanced local feedback according to an embodiment of the invention.

FIGS. 9A-1 and 9A-2 are block diagrams illustrating a monitoring method according to an embodiment of the invention.

FIGS. 9B-1 and 9B-2 show detailed block diagrams of a global context analysis process according to the embodiments of FIGS. 9A-1 and 9A-2.

FIG. 9C is a flowchart illustrating a monitoring method using multi-dimensional sensor data according to an embodiment of the invention.

FIG. 10 is a diagram of a scenario and monitoring system according to an embodiment of the present invention.

FIG. 11 is a flow chart illustrating a monitoring method using multi-dimensional sensor data according to an embodiment of the invention.

Reference numerals:

100-monitoring system;

110-sensor;

120-an arithmetic device;

121-a storage unit;

130-monitoring program;

110A-camera;

110B-microphone;

110C-taste sensor;

110D-odor sensor;

110E-a tactile sensor;

110A-1 to 110A-4 cameras;

110B-1 to 110B-3 microphones;

o1, O2, O3-object;

131-local object identification module

132-feature fusion module;

133-global identification module;

1311 local object detection and corresponding module;

1312 local object feature extraction and fusion module;

1313-local object identification model;

1314 to a feedback path;

1321-global object and feature set correspondence module;

1322-context area analysis module;

1323-weighted parameter selection module;

1331-feedback path;

1324 — global fine feature fusion module;

700-video frame;

710-region of interest;

715-video object;

720-searching area;

730-context area;

750-audio segment;

755-audio object;

760-region of interest;

770-exploring area;

780-context area;

S302-S324;

S402-S424 and S502-S514;

S802-S812 step;

S8021-S8025, S8041-S8043;

S8201-S8217;

S832-S840 and S850-S856 steps;

TF 1-TF 7-time frame;

ROI1 a-ROI 7 a-region of interest;

context 1-Context 7-Context area;

902-1-902-N-square;

904-1-904-N to squares;

904-1A, 904-1B, 904-1C, 904-1D, 904-1E-square;

904-2A, 904-2B, 904-2C, 904-2D, 904-2E-square;

904-NA, 904-NB, 904-NC, 904-ND and 904-NE;

906-1-906-N, 912-1-912-N to square;

9120. 908, 910, 912, 916, 918, 920, 924, 926;

914. 922 to a selector;

952 and 966 to square;

959. 967-path;

video1, Video2, Video 3-Video data;

audio1, Audio2, Audio 3-Audio data;

smell1, Smell2, Smell3 odor data;

600. 1000-scene;

1001-bank gate;

1002-sofa;

1010-;

1020-;

1041-1043-figure;

1031 to a first region;

1032 to a second region;

1033-overlapping area;

S1110-S1140.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.

FIG. 1 is a block diagram of a monitoring system according to an embodiment of the invention. As shown in fig. 1, the monitoring system 100 includes a plurality of sensors 110 and one or more computing devices 120.

In one embodiment, sensor 110 includes a variety of different types of sensors, such as: the camera 110A, the microphone 110B, the taste sensor 110C, the odor sensor 110D, the tactile sensor 110E, or a combination thereof, but the embodiments of the present invention are not limited to the above types or attributes of sensors, and the number of sensors of each type or attribute may be determined as the case may be. The different types of sensors can respectively correspond to five senses of a human, for example, the camera 110A can correspond to an eye and is used for capturing video images; the microphone 110B may correspond to an ear for receiving audio; the taste sensor 110C (e.g. an electronic tongue) can correspond to the tongue for detecting the sour, sweet, bitter and salty taste of the object; the odor sensor 110D (e.g., an electronic nose) may correspond to the nose for detecting the odor in the air; the tactile sensor 110E (e.g., an electronic skin) may correspond to a body for detecting pressure or temperature of contact, etc. The sensing data of various types of sensors can be considered as one dimension, for example, and the monitoring system 100 in the embodiment of the present invention can utilize various types.

In some embodiments, the cameras 110A corresponding to the eyes may include different types of cameras, for example, the cameras 110A may be color cameras that capture color images (e.g., RGB images), which may capture color images of a scene; the camera 110A may also be a depth camera that captures depth images of a scene (e.g., a grayscale image); the camera 110A may also be an infrared sensor (infrared sensor) that detects radiation energy in the scene, converts the detected radiation energy into electrical signals, and displays different temperature distributions in different colors, such as represented by an infrared thermal image. For convenience of explanation, in the following embodiments, the camera 110A is described by taking a color camera as an example.

The computing device 120 is, for example, one or more personal computers or servers, or a central data processing center, and is used to execute a monitoring program 130 that can use multidimensional sensor data, wherein the monitoring program can be stored in a storage unit 121 of the computing device 120. The storage unit 212 is a non-volatile memory, such as a hard disk drive (hard disk), a solid-state disk (solid-state disk), or a read-only memory (ROM), but the embodiment of the invention is not limited thereto. The computing device 120 executes the monitoring program 130 to receive the corresponding sensing data from each sensor 110 with different types, and performs the functions of local object identification processing, local object feature extraction and fusion processing, feature information feedback and enhancement processing, and global identification processing, which will be described in detail later.

In one embodiment, the camera 110A, the microphone 110B, the taste sensor 110C, the smell sensor 110D, and the touch sensor 110E respectively perform Detection (Detection) and Recognition (Recognition) operations in eye, ear, nose, tongue, and body sensing modes. However, the object features extracted during the process of detecting the object and the process of identifying the object are different. The process of detecting the object uses Rough features (Rough features), and the process of identifying the object uses fine features (Detail features). For example, the roughness features may be directions, distances, outlines, structures, etc., but the embodiments of the present invention are not limited thereto. Fine features may also be classified for different types of sensors, for example video fine features include: color, texture, shape, etc.; audio detail features include: volume, pitch, timbre, etc.; fine odor characteristics include: aromatic flavor, putrefactive flavor, ether flavor, pungent flavor, burnt odor, resin flavor, etc.; the taste refinement characteristics include: sweet, salty, sour, bitter, spicy, fresh, etc.; the fine tactile feature includes: pressure, temperature, etc., but embodiments of the invention are not limited thereto.

For example, the texture roughness features of an object are used to describe the object profile, such as: cylindrical barrels, rectangular vertical brands, people, automobiles, motorcycles and the like. The structure of the sound is a voiceprint. Voiceprints can show differences in the organs of sound production because human voice is a collection of multiple frequency sounds. Each person's voice print differs due to the differences in the shape of each person's vocal organs (vocal cords, mouth and nasal cavities), lips, and tongue. The three elements of sound are timbre, volume, and frequency, where "timbre" depends on the volume and structure of the oronasal cavity. Therefore, according to the characteristics of the voiceprint, the age, sex and face shape of the speaker, even the characteristics of roughness such as height can be roughly grasped.

In one embodiment, the computing device 120 may calculate a color histogram of the video images captured by each camera 110A to obtain color distribution information, and then calculate a Probability Mass Function (PMF) to obtain the approximate color feature. The computing device 120 may analyze the audio signal captured by the microphone 110B to obtain an audio spectrum, calculate frequency distribution information of the audio signal, and calculate a general frequency characteristic by using a Probability Density Function (PDF).

The perception of chemistry can be divided into the detection of taste (tast) of liquid chemicals and the detection of smell (smell) of substances in gases. For example, the computing device 120 can obtain distribution information of seven kinds of taste perceptrons (e.g., camphor, musk, floral, mint, ethereal, spicy, and rancid) from the sensing data captured by the odor sensor 110D, and then calculate with a probability mass function to know the approximate olfactory characteristics. The computing device 120 can also obtain the distribution information of six tastes (e.g., sweet, salty, sour, bitter, fresh and fat) from the sensing data captured by the taste sensor 110C, and then calculate the approximate taste characteristics by using the probability density function.

Tactile is the sensation caused by mechanical stimulation of the skin. The distribution density of the contact points on the skin surface and the area of the corresponding sensing area of the cerebral cortex are positively correlated with the sensitivity of the part to touch. For example, the computing device 120 can obtain distribution information of three physical properties (e.g., type, intensity, size) from the sensing data captured by the touch sensor 110E, and then calculate the probability mass function to obtain the approximate touch characteristic.

In detail, in the monitoring system 100, directional remote chemoreceptors (e.g. smell sensor 110D), directional contact chemoreceptors (e.g. taste sensor 110C) and directional mechanical receptors (e.g. touch sensor 110E) can be disposed in the scene, and the rough features of the direction and distance of the object in the scene can be calculated by using the positioning techniques disclosed in the above embodiments, and the vector value of the space and movement of the object can be roughly determined.

Probability Density Functions (PDFs) and Probability Mass Functions (PMFs) are functions that describe the likelihood that the output value of a random variable (random variables) is near a certain value-taking point. By taking the values of the first few highest probabilities and the size ratio of the probabilities of the values, we can obtain the approximate characteristics of the object.

In one embodiment, the odor sensor 110D or taste sensor 110C may be mounted on a mobile carrier or a mobile electronic police dog, such as a dog or dog with sharp sense of smell, and searched back and forth in a field, where a positioning system is configured to transmit detected odor or taste information via wireless transmission to a data processing center for use in association with other types of feature data.

In another embodiment, it is assumed that one child is in one meter and two adults are out of five meters in a sector area with a central point of the sector area 30 degrees to the north east, wherein one person wears white clothes and the other person wears black clothes. The white adult moves slowly toward the north at a speed of less than 3.0 km/h (0.83 m/s), and the black adult moves rapidly toward the west at a speed of more than 12 km/h (3.33 m/s), and emits screaming sounds with amplitude variation frequencies between 30/s and 150/s. The computing device 120 can use the three-dimensional camera/depth camera or directional microphone triangulation technique to calculate the rough features such as the directions and distances of the three people, and roughly determine the vector values of the space and movement of the object. Then, the computing device 120 can analyze the video image and the audio signal in the scene to obtain the color histogram and the timbre of the three characters, and thereby obtain the rough features of the outlines, structures, and the like of the three characters.

It should be noted that the difference between the coarse and fine features of each type of sensing data is the sampling and sampling precision and accuracy. The coarse feature only needs to determine that the data is within a range of intervals. For color, coarse features can roughly determine color by using a color histogram, and if fine features are obtained, various image feature comparisons and calculations are required, so that the required calculation amount is extremely large, but the accuracy is high, so that the fine features can be used for identity recognition, and the coarse features can be used for simple classification, for example. In terms of texture, the computing device 120 also determines the number of line patterns (patterns) and the number of vertical lines and horizontal lines. Depending on the sampling and sampling precision and accuracy, a larger amount of data also means a finer feature is extracted, but also means a larger amount of calculation and a longer calculation time.

For example, audio data can be heard at the ear in a range of 0 decibels (dB) to 140 dB. Decibels are units used to represent the intensity or volume of a sound. Sound of 40 db to 50 db disturbs sleep, sound of 60 db to 70 db disturbs learning, and noise of more than 120 db can cause earache and even permanent hearing loss.

The following are some examples of volume, for example: frictional sound of leaves: 20 decibels. Late at suburban areas: 30 decibels. Late night in suburban residential areas: 40 decibels. In a quiet office: 50 decibels. The voice of a person speaking regularly: 60 decibels. Sound in car or ring of phone: 70 decibels. Traveling sound of the bus: 80 decibels. Barking of dog: 90 decibels. Sound of electric train passing through iron bridge: 100 db. Horn sound, siren sound, and music sound in karaoke dance hall of the car: 110 db. Sound of breaker when repairing road: 120 decibels. Engine sound at takeoff of jet engine: 130 db.

The human ear is a very special organ, and in order to convert the noise signal measured by the noise instrument into the noise volume perceived by the human ear, the signals of different frequency domains measured by the noise instrument must be weighted in the frequency domain or weighted. The frequency range in which the human ear can hear sound is about 20 hertz to 20 kilohertz (Hz), and the human ear has different Weighting curves (Weighting Curve) for different sound intensities. The most common Weighting curves are A-Weighting (A-Weighting), C-Weighting, D-Weighting, and G-Weighting. Of these, C-Weighting (C-Weighting) is generally used to measure noisy mechanical noise, D-Weighting (D-Weighting) is generally used to measure airborne noise, and G-Weighting (G-Weighting) is used to measure ultra low frequency noise, most of which is structure noise caused by low frequency vibrations.

The timbre is the different sound generated by the different proportion of harmonic (overtone) components of the sounding body. Any sound in nature is a complex waveform that, in addition to the waveform at the fundamental frequency, has a series of resonant frequencies, so-called "overtones" (Harmonic), which have a certain "Harmonic" relationship with the fundamental tone. For example, when the basic frequency of vibration of an object is 240Hz, frequencies such as 480Hz (second harmonic), 720Hz (third harmonic) and the like also occur, the proportion of the components of the multiple sound of each object is different, and the sound of the different components of the multiple sound of the different objects is timbres (timbres).

FIG. 2 is a block diagram of the monitor program 130 according to an embodiment of the present invention. The monitor 130 includes, for example, a local object recognition module 131, a feature fusion module 132, and a global recognition module 133. The local object recognition module 131 is configured to perform a local object process on the sensing data of each type to generate a local object feature information.

For example, the local object recognition module 131 includes a local object detection and correspondence module 1311, a local object feature extraction and fusion module 1312, and a local object recognition model 1313. The local object processing includes, for example, various processing performed by the local object detection and correspondence module 1311, a local object feature extraction and fusion module 1312, and a local object recognition model 1313 with respect to the local object.

The local object detection and correspondence module 1311 receives the sensing data from the camera 110A, the microphone 110B, the taste sensor 110C, the smell sensor 110D, and the touch sensor 110E, respectively, and performs a local object detection and correspondence process of a corresponding sensing type to generate a local object ID list (local object ID list) and a Local Rough Feature Set (LRFS), details of which will be described later.

The local object feature extraction and fusion module 1312 performs a local feature extraction and fusion process, which includes a local fine feature extraction (LDFE) process and a local fine feature fusion (LDFF) process. For example, the computing device 120 extracts local fine features of each of the different types of sensing data according to the local object list and the local rough feature set generated by the local object identification module 131, and establishes a local fine feature set (LDFS) corresponding to each type of sensing data. Then, the computing device 120 fuses the local fine feature set corresponding to each type of sensing data according to each type of local object list into a Local Fusion Feature (LFF) of each local object. In some embodiments, the local object feature extraction and fusion module 1312 further performs context acquisition and fusion for each type of sensing data to generate a fused context region.

The local item identification model 1313 performs local item identification to generate a local identification list corresponding to each type of sensing data. For example, the computing device 120 inputs each type of local fusion feature generated by the local object feature extraction and fusion module 1312 into the local object recognition model 1313 to perform a local object identification process, and marks the recognition result with a local identification number (LIID), and then collects each local identification number to generate a local identification list (LIID list). In one embodiment, the local object recognition model 1313 may generate local recognition results and corresponding confidence levels after performing the local object recognition, and the local object recognition model 1313 may feed back the generated local recognition results of each type and the corresponding confidence levels to the local object detection and correspondence module 1311 through a feedback path 1314, so that the local object detection and correspondence module 1311 may perform self-learning (self-learning) according to the local object recognition results of the corresponding type.

Therefore, the local object feature information generated by the local object recognition module 131 includes a local object list, a local rough feature set, a local fusion feature, and a local identification list of each type of sensing data.

The feature fusion module 132 is configured to perform a global object processing according to the local object feature information to generate global object feature information. For example, the feature fusion module 132 includes a global object and feature set correspondence module 1321, a context area analysis module 1322, a weighting parameter selection module 1323, and a global fine feature fusion module 1324. The global object processing includes, for example, various processes performed on the global object by the global object and feature set correspondence module 1321, the context area analysis module 1322, the weighting parameter selection module 1323, and the global fine feature fusion module 1324.

The global object and feature set mapping module 1321 performs a Global Object Correlation (GOC) process and a global fine feature mapping (GDFC) process to generate a global object list and a corresponding global fine feature set. The context area analysis module 1322 performs local context analysis on the fused context areas of the types of sensing data generated by the local object feature extraction and fusion module 1312, and combines the local context analysis results of the types of sensing data to generate a local context combination result.

The weighting parameter selection module 1323 determines to use a neighboring distinguishable weighting coefficient or an adaptive weighting coefficient to perform a Global fine Feature Fusion (GDFF) according to the local context merging result generated by the context region analysis module 1322. The global fine feature fusion module 1324 performs a global fine feature fusion (GDFF) process according to the weighting parameters outputted from the weighting parameter selection module 1323, for example, to fuse the global fine feature sets generated by the global object and feature set correspondence module 1321 into a global fusion feature.

Thus, the global object feature information generated by the feature fusion module 132 includes: a global object list and corresponding global fine feature set, and global fusion features.

The global recognition module 133 performs a global object recognition on the global object feature information generated by the feature fusion module 132 to generate a global recognition result. For example, the global identification module 133 inputs the global fusion features generated by the global fine feature fusion module 1324 into a global object identification model to identify the global identity of each global fusion feature, such as creating a global identity list recording the global identity of each global fusion feature. In addition, the global identification list further records the global identification result and the confidence of the global object identification model.

The global recognition module 133 further feeds back the generated recognition result and the confidence thereof to the local object recognition model 1313 through a feedback path 1331. In addition, the global identification module 133 further disassembles the global fused features generated by the global fine feature fusion module 1324 to obtain object fine features of each type of sensing data, and feeds the obtained object fine features of each type of sensing data back to the corresponding type of local object identification model in the local object identification models 1313, so as to improve the accuracy of local object identification performed by each local object identification model 1313, wherein the feedback path 1331 may be referred to as co-learning (co-learning).

For convenience of illustration, the following embodiments are mainly examples of configurations of the camera 110A and the microphone 110B, and other types of sensors can operate in a similar manner and be used with the camera 110A and the microphone 110B.

Fig. 3A and 3B are flow charts illustrating local object mapping, local fine feature fusion, and local object identification for a video object according to an embodiment of the present invention.

In step S302, a plurality of video cameras 110A are used to capture a plurality of video data, respectively.

In step S304, Local Object Detection (LOD) is performed to determine whether there is a video object needing attention in each video data, and in step S306, it is determined whether a video object worth attention is found. If a video object of interest is found, step S308 is executed to record the corresponding video object. If no video object of interest is found, go back to step S302.

For example, the computing device 120 may detect whether there is a video object that needs to be focused on in a spatial exploration area within a video frame of each video data. In some embodiments, the computing device 120 detects a specific object, such as a person, a face, a hand, a car, a gun, a knife, a stick, etc., from each video data, but the embodiments of the invention are not limited thereto. The computing device 120 may also detect specific behaviors from each video data, such as gathering, chase, grab, fight, and fall, but the embodiments of the present invention are not limited thereto. That is, the computing device 120 can determine that the specific object or the specific behavior belongs to the video object that needs attention.

In one embodiment, the computing device 120 may determine different specific behaviors from each video data. Taking the clustering behavior as an example, the computing device 120 determines whether there is more than a predetermined people average density in a certain spatial region from the video data captured by the camera 110A and the situation lasts for more than a predetermined time, for example, it can determine that there are 3 to 5 people in each square meter range in the 5 square meter region and the situation lasts for 10 to 30 minutes and there is no tendency to move each other.

Taking the chase behavior as an example, the computing device 120 determines the motion tracks and the speed between the two people from the video data captured by the camera 110A, and when the motion tracks of the two people are similar and the speed is maintained to exceed a predetermined speed, the computing device 120 can determine that the chase behavior occurs in the video data. For example, in the case of a falling behavior, the computing device 120 determines whether the angular velocity of the position of the limbs of the person falling is greater than a predetermined angular velocity and stays at the predetermined angular velocity for a predetermined time from the video data captured by the camera 110A. In addition, the monitoring system 100 can also obtain the sensing data detected by the wearable device (wearable device) worn by the user, so as to determine whether a fall occurs in the video data. For example, the computing device 120 may use a known behavior determination algorithm to analyze whether a robbery or an fighting behavior occurs in the video data.

In detail, in the fields of artificial intelligence (intellectual interaction) and computer vision (computer vision), behavior detection is a very high-level application, and besides object identification, information of relationships such as dynamic timing, object movement tracks, object interaction, object distribution, density, and the like is utilized. The present invention can integrate various types of sensors to achieve complementary and global object recognition, and the video data is only one of the sensing data in the monitoring system 100 of the present invention. The present invention is not limited to the techniques for performing different behavior detection from video data disclosed in the above embodiments.

In addition, the computing device 120 may further calculate a world coordinate location of each detected video object. For example, the computing device 120 may obtain information such as the installation position (e.g., GPS coordinates), the shooting angle, and the viewing angle of each camera 110A, and calculate the world coordinate positioning information of each video object in the video data shot by each camera 110A. Each camera 110A may also add a corresponding time stamp (time stamp) when capturing the video image of the scene, so as to facilitate subsequent Local Object Correlation (LOC) and Global Object Correlation (GOC) processes.

In step S310, it is determined whether all video objects in each video data have been detected. If yes, go to step S312; if not, go back to step S304.

In step S312, all detected video objects are collected and analyzed. For example, a video object or a plurality of video objects may be detected in each video data, and each video object also carries corresponding world coordinate positioning information and a time stamp. Therefore, the computing device 120 can determine whether each video object in different video data is related to each other according to the corresponding world coordinate positioning information and the time stamp of each video object.

In step S314, a Local Object Correlation (LOC) process is performed according to the detected world coordinate positioning information (and/or time stamp) corresponding to each video object, so as to correlate and join the video objects related to the same local video object, and mark the video objects related to the same local video object with a corresponding local object identification code (local object ID). For example, the local object ID is called Local Image Object ID (LIOID).

In step S316, a local video object list (local list) is created, wherein the local video object list records a local video object or a plurality of local video objects with different local video object identification codes.

In step S318, a local rough feature set (LFRS) of each local video object is collected and established according to the local video object list, wherein the local rough feature set includes information of the direction, distance, outline, structure, and the like of each local video object.

In step S320, according to the local video object list, a local object detail feature extraction process is sequentially performed on each video data related to each local video object to establish a Local Detail Feature Set (LDFS) of each video data related to each local video object.

In step S322, according to the local video object list, a Local Detail Feature Fusion (LDFF) process is performed to sequentially fuse the local detail feature set corresponding to each video data associated with each local video object into a Local Fusion Feature (LFF) of each local video object.

In step S324, the local fusion features of the local video objects are input into a Local Object Recognition (LOR) model to perform a local object identification process, the recognition result is labeled with a local identification code, and local identification (LIID) codes are collected to generate a local identification list (LIID list) L1. Each local ID generated in the process of fig. 3B is labeled in the corresponding video object in the video data, and may also be referred to as a Local Video Identity ID (LVIID), and the local ID list L1 may also be referred to as a local video ID list.

Fig. 4A and 4B are flow charts illustrating local object mapping, local fine feature fusion, and local object identification for audio objects according to an embodiment of the present invention.

In step S402, a plurality of microphones 110B are used to capture a plurality of audio data respectively.

In step S404, Local Object Detection (LOD) is performed to determine whether there is an audio object (audio object) needing attention in each audio data, and in step S406, it is determined whether an audio object of interest is found. If an audio object of interest is found, step S408 is performed to record the corresponding audio object. If no audio object of interest is found, the process returns to step S402.

For example, the computing device 120 may detect whether there is an audio object that needs to be focused on in a time exploration area within an audio segment of each audio data, for example. In some embodiments, the computing device 120 detects a specific object sound or event sound, such as a gun, explosion, cry, scream, impact, etc., from the audio data, but the embodiments of the invention are not limited thereto. That is, the computing device 120 can determine that the specific object sound or the event sound is an abnormal sound in the real environment, and thus belongs to the audio object that needs attention. The above abnormal sounds may be characterized by using, for example, a conventional speech signal processing method such as Mel-Frequency Cepstrum Coefficient (MFCC), Linear Prediction Cepstrum Coefficient (LPCC), or the like.

However, many other types of sound exist in real world environments, such as: vehicle whistling, footsteps, other low frequency atmospheric noise, and the like, and conventional speech signal processing methods cannot process such other types of sounds. In one embodiment, the computing device 120 converts the audio signals captured by the microphones into a spectrogram of abnormal sounds, and performs feature description on time-frequency features of the spectrogram by using a 2D-Gabor filter. Next, the arithmetic device 120 extracts spectrogram features of the abnormal sound by using random Non-negative Independent Component Analysis (SNICA), and performs Classification and recognition by using a Sparse Representation Classification (SRC) method, thereby determining the abnormal sound of another type in the real environment.

In addition, the computing device 120 may further calculate a world coordinate location of each detected audio object. For example, the computing device 120 can obtain information such as the installation position (e.g., world coordinates), the sound reception angle, the separation distance, etc. of each microphone 110B, and calculate the world coordinate positioning information of each audio object in the audio data received by each microphone 110B. Each microphone 110B may also add a corresponding time stamp (time stamp) when capturing the audio data of the scene, so as to facilitate subsequent Local Object Correlation (LOC) and Global Object Correlation (GOC) processes.

In step S410, it is determined whether all audio objects in each audio data have been detected. If yes, go to step S412; if not, the process returns to step S404.

In step S412, all the detected audio objects are collected and analyzed. For example, an audio object or multiple audio objects may be detected in each audio data, and each audio object also carries corresponding world coordinate positioning information and time stamp. Therefore, the computing device 120 can determine whether each audio object in different audio data is related to each other according to the corresponding world coordinate positioning information and the time stamp of each audio object.

In step S414, a Local Object Correlation (LOC) process is performed according to the detected world coordinate positioning information (and/or time stamp) corresponding to each audio object, so as to correlate and link the audio objects related to the same local audio object, and mark the video objects related to the same local audio object with a corresponding local object ID (local object ID). For example, the local object ID is called local audio object ID (local audio object ID).

In step S416, a local audio object list is created, wherein the local audio object list records a local audio object or a plurality of local audio objects with different local audio object identification codes.

In step S418, a local rough feature set (LFRS) of each local video object is collected and established according to the local audio object list, wherein the local rough feature set includes information of the direction, distance, outline, structure, and the like of each local audio object.

In step S420, according to the local audio object list, a local object detail feature extraction process is sequentially performed on each audio data associated with each local audio object to establish a Local Detail Feature Set (LDFS) of each audio data associated with each local audio object. The local fine feature set of each local audio object includes, for example, audio fine features such as volume, pitch, and timbre of each local audio object.

In step S422, according to the local audio object list, a local fine feature fusion process is performed to sequentially fuse the local fine feature set corresponding to each audio data associated with each local audio object into a Local Fusion Feature (LFF) of each local audio object.

In step S424, the local fusion features of the local audio objects are input into a Local Object Recognition (LOR) model to perform a local object identification process, the recognition result is labeled with a Local Identity ID (LIID), and the local identity IDs are collected to generate a local identity list (LIID list) L2. Each local id generated in the process of fig. 4B is labeled in the corresponding audio object in the audio data, and thus may also be referred to as a local audio id, and the local id list L2 may also be referred to as a local audio id list.

FIG. 5 is a flowchart illustrating global object mapping and establishing a global fine feature set according to an embodiment of the invention. In one embodiment, the process of global object mapping and establishing the global fine feature set in fig. 5 uses the information and object lists generated by the various processes related to the video object and the audio object in fig. 3A and 3B and fig. 4A and 4B.

In step S502, the time stamps of the local video objects in the local video object list and the local audio objects in the local audio object list are compared one by one.

In step S504, it is determined whether the timestamps of the local video object and the local audio object match. If the time stamps are matched, the step S506 is executed; if the time stamps do not match, step S508 is executed.

In step S506, the first local rough feature set and the first world coordinate positioning information of the local video object are compared with the second local rough feature set and the second world coordinate positioning information of the local audio object.

In step S508, it is determined whether the comparison between each local video object in the local video object list and each local audio object in the local audio object list is completed. If yes, go to step S514; if not, go back to step S502.

In step S510, it is determined whether the first local rough feature set, the second local rough feature set, and the world coordinate positioning information match. If yes, go to step S512; if not, go to step S508. For example, if the determination result in the step S510 is "yes", it indicates that the time stamps of the local video object selected in the local video object list and the local audio object selected in the local audio object list are consistent, and the corresponding local rough feature set and the world coordinate positioning information of the local video object and the local audio object are consistent. Therefore, the computing device 120 can determine that the local video object and the local audio object are related to the same object.

In step S512, the corresponding successful local video object and local audio object are recorded, and a global object identity list (global object identity list) and a global rough feature set (global rough feature set) are established. For example, the corresponding successful local video object and local audio object can be associated with each other and can be regarded as a global object (global object), and the computing device 120 assigns a Global Object Id (GOID) to the global object. Therefore, the computing device 120 can record each global object and the corresponding global object id in the global object list. In addition, the local video object and the local audio object corresponding to the successful local video object also have corresponding local rough feature sets, so that the computing device 120 also connects the local rough feature set of the local video object and the local rough feature set of the local audio object to each other to form a Global Rough Feature Set (GRFS) of the global object.

In step S514, the local object lists of different types corresponding to the global objects in the global object list and the corresponding local fusion features are merged into a Global Detail Feature Set (GDFS) to which the global object list belongs. For example, the local video object and the local audio object corresponding to success are already included in the global object list, and the local video object and the local audio object corresponding to success also have corresponding Local Fusion Features (LFF), respectively, so that the computing device 120 also interconnects the local fusion features of the local video object and the local fusion features of the local audio object to generate a global fine feature set corresponding to the global object.

FIG. 6A is a schematic diagram illustrating capturing video data of a scene by a plurality of cameras according to an embodiment of the present invention.

For example, if the monitoring system 100 is configured with 4 cameras 110A in the scene 600 for respectively capturing video data in the scene 600. In scene 600, there are three objects, such as character 1 (i.e., object O1), character 2 (i.e., object O2), and character 3 (i.e., object O3), which are an adult male, a child male, and an adult female, respectively, as shown in FIG. 6A. The 4 cameras 110A are, for example, cameras 110A-1, 110A-2, 110A-3, and 110A-4 installed at different positions, respectively. In this scenario, the three objects O1, O2, and O3 may only be captured by a portion of the camera because of the object being obscured or captured by angle. For example, object O1 is captured only by cameras 110A-1, 110A-2, and 110A-3, object O2 is captured only by cameras 110A-1 and 110A-2, and object O3 is captured only by cameras 110A-1, 110A-2, and 110A-4. The computing device 120 performs object detection on the content captured by each camera, for example, detects video objects VO1, VO2, and VO 3. The computing device 120 assigns an Object Identifier (OID) to each of the video objects VO1, VO2, VO3, and marks them, for example, the Object identifiers corresponding to the video objects VO1, VO2, VO3 are respectively VOID1, VOID2, and VOID 3. The Object identifiers VOID1, VOID2 and VOID3 corresponding to the video objects VO1, VO2 and VO3 are called Local Object Identifiers (LOIDs) of the video data.

For the object O1, since the object O1 is only captured by the cameras 110A-1, 110A-2, and 110A-3, the computing device 120 then performs local fine feature extraction on the video data of the cameras 110A-1, 110A-2, and 110A-3 to obtain local video fine features (such as color, texture, and shape of the object, etc.), such as the local video fine feature set VidF1_ O1, VidF2_ O1, VidF3_ O1 related to the object O1, respectively. The computing device 120 then performs a local fine feature fusion process on the local video fine feature sets VidF1_ O1, VidF2_ O1, and VidF3_ O1 to obtain a fused video fine feature VidFF _ O1 corresponding to the object O1. In short, the fused video fine feature VidFF _ O1 may represent different video features of the same object O1 captured by the camera 110A at different angles.

Similarly, for the object O2, since the object O2 is only captured by the cameras 110A-1 and 110A-2, the computing device 120 performs local fine feature extraction on the video data of the cameras 110A-1 and 110A-2 to obtain the video fine features (such as the color, texture, shape, etc. of the object) thereof, such as the local video fine feature set VidF1_ O2 and VidF2_ O2 related to the object O2, respectively. The computing device 120 then performs a local fine feature fusion process on the local video fine feature sets VidF1_ O2 and VidF2_ O2 to obtain a fused video fine feature VidFF _ O2 related to the object O2. The fused video fine feature VidFF _ O2 may represent different video features of the same object O2 captured by the camera 110A at different angles.

Similarly, for the object O3, since the object O3 is only captured by the cameras 110A-1, 110A-2, and 110A-4, the computing device 120 performs local fine feature extraction on the video data of the cameras 110A-1, 110A-2, and 110A-4 to obtain the video fine features (such as the color, texture, and shape of the object, etc.) thereof, such as the local video fine feature set VidF1_ O3, VidF2_ O3, and VidF4_ O3, respectively, with respect to the object O3. The computing device 120 then performs local fine feature fusion processing on the local video fine feature sets VidF1_ O3, VidF2_ O3, and VidF4_ O3 to obtain a fused video fine feature VidFF _ O3 for the object O3. The fused video fine feature VidFF _ O3 may represent different video features of the same object O3 captured by the camera 110A at different angles.

It is noted that the cameras 110A-1, 110A-2, 110A-3, and 110A-4 have corresponding time stamps (time stamp) added when capturing video images of a scene, and the fused video detail features VidFF _ O1, VidFF _ O2, VidFF _ O3 also have corresponding time stamps. For example, the local video object list includes the fused video fine features VidFF _ O1, VidFF _ O2, VidFF _ O3, and corresponding time stamps.

The computing device 120 inputs the fused video fine features VidFF _ O1, VidFF _ O2, and VidFF _ O3 into a local video object recognition model to recognize the corresponding identity of each of the fused video fine features VidFF _ O1, VidFF _ O2, and VidFF _ O3, for example, the computing device 120 may assign a Local Video Identity Identifier (LVIID) to the fused video fine features VidFF _ O1, VidFF _ O2, and VidFF _ O3, such as the local video identity identifiers LVIID1, LVIID2, and LVIID3, respectively. The computing device 120 records the local video IDs 1, LVIID2 and LVIID3 in a local video identification list (e.g., local identification list L1).

FIG. 6B is a schematic diagram illustrating capturing audio data for a scene using multiple microphones according to an embodiment of the present invention.

If the monitoring system 100 further configures 3 microphones 110B in the scene 600 to respectively capture the audio data in the scene 600, as shown in fig. 6B. The 3 microphones 110B are, for example, microphones 110B-1, 110B-2, and 110B-3, respectively installed at different positions. The microphones 110B-1, 110B-2, and 110B-3 may be attached to the cameras 110A-1, 110A-2, and 110A-3, respectively, to capture audio data, or may be disposed at different positions in the scene 600 to capture audio data, respectively.

In this scenario, the sounds emitted by the three objects O1, O2, and O3 may only be received by a portion of the microphone 110B because of the objects being obscured, volume, or distance. For example, the sound of the object O1 is captured by the microphones 110B-2 and 110B-3 only, the sound of the object O2 is captured by the microphones 110B-1 and 110B-3 only, and the sound of the object O3 is captured by the microphone 110B-3 only.

For the object O1, since the object O1 is only captured by the microphones 110B-2 and 110B-3, the computing device 120 then performs local fine feature extraction on the audio data captured by the microphones 110B-2 and 110B-3 to obtain the audio fine features (such as volume, pitch, timbre, etc.), such as the local audio fine feature sets AudF2_ O1 and AudF3_ O1 related to the object O1. Performing local fine feature fusion on the local audio fine feature sets AudF2_ O1 and AudF3_ O1 results in a fused audio fine feature AudFF _ O1 with respect to the object O1. In short, the fused audio fine feature AudFF _ O1 may represent different audio features of the same object O1 captured by microphones 110B at different positions.

Similarly, for the object O2, since the object O2 is only captured by the microphones 110B-1 and 110B-3, the computing device 120 then performs local fine feature extraction on the audio data captured by the microphones 110B-1 and 110B-3 to obtain the audio fine features (such as volume, pitch, timbre, etc.), such as the local audio fine feature sets AudF1_ O2 and AudF3_ O2 related to the object O2. Local fine feature fusion is performed on the local audio fine feature sets AudF1_ O2 and AudF3_ O2 to obtain a fused audio fine feature AudFF _ O2. The fused audio fine feature AudFF _ O2 may represent different audio features of the same object O2 received by the microphone 110B at different locations.

Similarly, for the object O3, since the object O3 is only captured by the microphone 110B-3, the computing device 120 then performs local fine feature extraction on the audio data captured by the microphone 110B-3 to obtain its audio fine features (e.g., volume, pitch, tone, etc.), such as the local audio fine feature set AudF3_ O3 related to the object O3. After local fine feature fusion is performed on the local audio fine feature set AudF3_ O3, a fused audio fine feature AudFF _ O3 is obtained. In this embodiment, the fused audio fine feature set AudFF _ O3 is equal to the audio fine feature set AudF3_ O3. The fused audio fine feature AudFF _ O3 may represent different audio features of the same object O3 received by the microphone 110B at different locations.

It should be noted that the microphones 110B-1, 110B-2, and 110B-3 also add corresponding time stamps (time stamps) when capturing the audio signals of the scene 600, and the fused audio fine features AudFF _ O1, AudFF _ O2, and AudFF _ O3 all have corresponding time stamps. That is, the local audio object list includes the fused audio fine features AudFF _ O1, AudFF _ O2, AudFF _ O3 and corresponding time stamps.

The computing device 120 inputs the fused audio fine features AudFF _ O1, AudFF _ O2, and AudFF _ O3 into a local audio object recognition model to recognize the identity of each of the fused audio fine features AudFF _ O1, AudFF _ O2, and AudFF _ O3, for example, the computing device 120 may assign a local Audio Identity Identifier (AIID) to the fused audio fine features AudFF _ O1, AudFF _ O2, and AudFF _ O3, such as LAIID1, LAIID2, and LAIID3, respectively. The computing device 120 records the local audio ids LAIID1, LAIID2, and LAIID3 in a local audio id list (e.g., local id list L2).

FIG. 7A is a diagram illustrating different spatial partitions in a video frame according to an embodiment of the invention.

As shown in fig. 7A, each video frame (video frame) can be divided into different spatial partitions to facilitate the computing device 120 to perform different image detection, image recognition, and image analysis processes. For example, the video frame 700 may include different regions, such as a region of interest (ROI) 710, an exploration region 720, and a context region 730. The region of interest 710 is, for example, a spatial extent of the video object 715 in the video frame 700. The exploration area 720 represents the surrounding of the region of interest 710 to which the video object 715 belongs, i.e., the area that is operated when used for Tracking (Tracking) the video object 715 in computer vision. The context area 730 is larger than the exploration area 720, and the context area 730 is the spatial range used for the context analysis (context analysis) of the video object 715.

Fig. 7B is a diagram illustrating different time divisions within an audio segment according to an embodiment of the present invention.

As shown in fig. 7B, each audio segment (audio segment) can be divided into different time divisions, so as to facilitate the computing device 120 to perform different audio detection, audio recognition, and audio analysis processes. For example, the audio segment 750 may include different regions, such as a region of interest (ROI) 760, an exploration region 770, and a context region 780. The region of interest 760 is, for example, a time range of the audio object 755 in the audio segment 750. The exploration area 770 represents the proximity time range of the region of interest 760 to which the audio object 755 belongs, i.e., the range that is operated upon in the computing device 120 for Tracking (Tracking) the audio object 755. The context area 780 is larger than the exploration area 770, and the context area 780 is the spatial range used for context analysis (context analysis) of the audio object 715.

From the embodiments of fig. 7A and 7B, it can be known that the region of interest can be a spatial region or a temporal region where the object is located. When the computing device 120 is to track an object, an exploration area (e.g., exploration areas 720 and 770) that is spatially and temporally larger than the areas of interest (e.g., areas of interest 710 and 760) is employed. In addition, when the computing device 120 is going to perform the context analysis, the computing device 120 employs a context area that is spatially and temporally larger than the search areas (e.g., the search areas 720 and 770), such as the context area 730 of fig. 7A or the context area 780 of fig. 7B.

In detail, the Context area (Context) defines an Exploration boundary as a maximum Exploration area (Exploration Region), and the minimum Exploration area is a Region of interest (ROI). When the computing device 120 performs object tracking, a predicted region of interest (predictedlroi) is defined by a user or by the computing device 120, and then the object is searched for in the search region by an identification model (e.g., the local object identification model 1313). In an embodiment, the computing device 120 may set the search area to be twice (but not limited to) the area of interest, for example. In another embodiment, the computing device 120 can also automatically adjust the size of the search area according to the moving speed and direction of the object. However, in order to take account of the calculation and reaction efficiency, the calculation device 120 usually does not set an excessively large search area, and the user can set a Context area (Context) by himself for limitation.

In one embodiment, before the computing device 120 performs the global fine feature fusion, the computing device 120 performs a context analysis (e.g., the context area analysis module 1322) to calculate a weight distribution of each feature. If the computing device 120 determines that the weight distribution of each type of Local Fusion Feature (LFF) is biased to some local fusion features according to the result of the context analysis (e.g., it can be considered that the local fusion features are very different), which may cause a deviation in the result of global object recognition, an Adaptive Weighting Coefficient (AWC) is used for feature Weighting. On the contrary, if the computing device 120 determines that the weight difference of each type of local feature is not large according to the result of the context analysis, the weight of each type of local feature may be referred to as an Adjacent distinguishable degree weighting Coefficient (ADWC). If the computing device 120 performs the Context analysis on the predicted region of interest, a calculation range of the computing device 120 when calculating the difference between the different types of feature values is the predicted region of interest, which may also be referred to as an Interested Context (Interested Context). In addition, a region of interest, after the recognition model searches and recognizes its exploration region to confirm the range where the object exists, may be referred to as a Recognized region of interest (recognited ROI).

FIG. 8A is a flowchart illustrating selection of coefficients for global fine feature fusion based on context analysis processing according to an embodiment of the present invention.

In step S802, a global context analysis process is performed. For example, the global context analysis process analyzes each video object and a context area corresponding to each audio object.

In step S804, it is determined whether it is appropriate to employ the Adjacent Distinguishability Weighting Coefficient (ADWC). The step S804 may also be referred to as a weight determination step. If it is determined that the neighboring differentiable weighting factor (ADWC) is suitable for use (e.g., the local fusion feature difference is not large), step S806 is executed to perform global fine feature fusion processing using the neighboring differentiable weighting factor (ADWC); if it is determined that it is not suitable to use the adjacent differentiable weighting factor (ADWC) (e.g., the local fusion features are very different), step S808 is performed to perform the global fine feature fusion processing using the adaptive weighting factor (AWC).

For example, in step S806, the computing device 120 performs a global fine feature fusion process using an adjacent differentiable weighting factor (ADWC). For example, the computing device 120 may update the neighboring discriminative weighting coefficients (ADWC) and then perform global fine feature fusion (global fusion) on the updated neighboring discriminative weighting coefficients (ADWC) to fuse the global fine feature sets of different types of objects in the global object list and generate a global fusion feature (global fusion feature).

In step S808, the arithmetic device 120 directly performs the global fine feature fusion process using the Adaptive Weighting Coefficient (AWC). When the computing device 120 determines that it is not suitable to use the neighboring distinguishable weighting factor (ADWC) (e.g., the local feature is obvious) according to the analysis result of the context area, the computing device 120 uses a feedback path to use the adaptive weighting factor as an input weighting factor to perform global fine feature fusion processing, i.e., to fuse global fine feature sets of different types of objects included in each global object in the global object list to generate a global fusion feature (global fusion feature).

In step S810, the global fusion feature corresponding to each global object is input into a global object recognition model (global object recognition model) for identity recognition of the global object, and a global identity ID list (global identity ID list) is generated. For example, the computing device 120 may assign a Global Identity Identifier (GIID) to each global fusion fine feature, and the computing device 120 records the global identity in the global identity list. In addition, the global identification list further records the global identification result and the confidence of the global object identification model.

In step S812, the local detailed features, the global recognition result and the confidence level thereof are fed back to each local object recognition model. For example, the global fusion feature is further decomposed into the original local fine feature in addition to being input to the global object recognition model, and is fed back to the corresponding local object recognition model.

FIG. 8B is a flowchart illustrating the global context analysis process and weight determination step in the embodiment of FIG. 8A according to the present invention. For example, fig. 8B shows a detailed flow of steps S802 and S804 in fig. 8A, wherein step S802 performs, for example, a global context analysis process, and step S804 performs a weight determination step.

In step S8021, a Predicted region of interest (Predicted ROI) is defined. For example, the predicted region of interest may also be referred to as a Context of interest (Interested Context) and may be defined by a user or by the computing device 120.

In step S8022, Local Context Analysis (LCA) processing of each type is performed, and a difference value of feature values of local fine features of each type is calculated and normalized (normalized). For example, Local Context Analysis (LCA) is a Local Fusion Feature (LFF) that performs feature extraction and fusion on currently different types of sensed data (e.g., video, audio, odor, taste, touch, etc.) to obtain each type of sensed data.

The computing device 120 performs a specific set of calculation and analysis on the Local Fusion Features (LFF) of the current types of sensing data to obtain the weight values corresponding to the Local Fusion Features (LFF) of the current types of sensing data. The weighted value may be calculated, for example, for video fine features such as color, texture, and shape, and for audio fine features such as volume, pitch, and timbre. The local fusion features of the other types of sensing data can be calculated in a similar manner to obtain the corresponding weight values.

Taking video detail as an example, the color feature may include differences between feature values such as density, saturation, and brightness, the texture feature may include differences between feature values of a pattern (pattern), and the shape feature may include differences between feature values such as lines, relative positions, relative lengths, and relative directions. For example, the volume characteristic may include a difference in sound energy, the pitch characteristic may include a difference in sound frequency, and the tone characteristic may include a difference in proportion of harmonic or harmonic components of the utterances. Since the characteristics of each local fine feature are different from each other, the difference between the local fine features needs to be normalized (Normalization) to compare the normalized difference between the local fine features. The normalized difference value represents the relative importance of each selected Local Fusion Feature (LFF) in the overall evaluation, and may be represented by a natural number, for example, and thus may be negative or zero.

In some embodiments, the computing device 120 may utilize Local Binary Patterns (LBPs) to calculate the disparity value of each local fine feature of the different types.

In step S8023, the local context analysis results of all local detailed features of each type are merged. For example, after obtaining the normalized difference value of each local fine feature of each type, the computing device 120 merges the local context analysis results of all the local fine features of each type.

In step S8024, a weight value corresponding to each local fine feature is assigned according to the normalized difference value of each local fine feature. If the normalized difference value corresponding to a local fine feature has a larger value, the weight value of the local fine feature will also be larger. If the normalized difference value corresponding to a local fine feature is smaller, the weight value of the local fine feature is also smaller.

In step S8025, an upper threshold and a lower threshold of a predetermined interval are obtained. For example, the computing device 120 may obtain an upper/lower threshold (e.g., user-defined or self-defined by the computing device 120) for a predetermined interval of weight values defined by the actual conditions of its type or application.

In step S8041, it is determined whether the weight values corresponding to the local detailed features are all within the predetermined interval. When the weight values corresponding to the local detailed features are all within the predetermined interval, it indicates that the difference between the weight values corresponding to the local detailed features is not large, so step S8042 may be executed to determine that the Adjacent Distinguishable Weighting Coefficient (ADWC) is suitable for use. When the weight value corresponding to any local fine feature is not within the predetermined interval (i.e., the normalized difference value of any local fine feature exceeds the upper threshold or the lower threshold), it can be determined that the local fine feature is too biased, so step S8043 can be executed to determine that the Adjacent Distinguishable Weighting Coefficient (ADWC) is not suitable for use, i.e., that the Adaptive Weighting Coefficient (AWC) is suitable for use.

In one embodiment, it is assumed that three Local Fusion Features (LFFs) with valid video are selected and referred to as features A, B, C, and two Local Fusion Features (LFFs) with valid audio are referred to as features D, E, and the results are analyzed and normalized and combined according to the local contexts, respectively, to obtain the weighted values WA to WE corresponding to the features a to E, such as WA 5, WB 2, WC 4, WD 6, and WE 3. If the predetermined interval is 3-6, indicating that the lower threshold is 3 and the upper threshold is 6, there is no bias towards a feature in this case, so the weight values WA-WE can be set as the Adjacent Distinguishable Weighting Coefficients (ADWC) and applied to each feature to change the relative importance of each feature. However, if the weighting values WA to WE corresponding to the features a to E are WA-7, WB-2, WC-4, WD-6, and WE-3, respectively, when a predetermined interval in which the lower threshold value is 3 and the upper threshold value is 6 is used, because WA-7 exceeds the upper threshold value, this indicates that the result of the determination is biased toward the feature a, the arithmetic device 120 uses an Adaptive Weighting Coefficient (AWC).

In detail, in order to enhance the object recognition capability, accuracy and prediction capability, the computing device 120 may adopt different types of valid feature information according to the user's requirement, and then determine which weighting factor to adopt by global context analysis, and merge and use the selected features. In addition, the computing device 120 may feed back the global object recognition result to the local object recognition model. For example, before the computing device 120 wants to merge the global fine feature set (GDFS) into the global merged feature (GFF), the computing device 120 needs to select a valid local merged feature. By valid or invalid features is meant features that can contribute to the accuracy of recognition, such as features where the texture of the skin is valid for the predicted age and features where the color of the skin is invalid for the predicted age. That is, for different types of sensing data, the user can set which local fusion features are valid features when performing global object recognition or local object recognition.

FIGS. 8C-1 and 8C-2 illustrate a flowchart of global fine feature fusion and global object recognition according to an embodiment of the present invention.

In step S8201, initial weight values of the Adjacent Distinguishable Weighting Coefficients (ADWC) and the Adaptive Weighting Coefficients (AWC) are set, respectively.

In step S8202, a tracking condition is obtained, and a prediction region of interest is defined according to the tracking condition. For example, the tracking condition may be defined by the user, such as a character wearing a specific color or pattern, or a character moving the fastest. Then, the computing device 120 defines the predicted regions of interest from the different types of sensing data.

In step S8203, all local contexts are fused and regions of interest are predicted. For example, the context and regions of interest of the various types of sensed data are merged. If step S8203 is performed for the first time, the combined region of interest generated after combination can be referred to as an initial region of interest (initial ROI).

In step S8204, a global context analysis process is performed. The details of the global context analysis process may refer to the flows of fig. 8A-8B.

In step S8205, it is determined whether it is appropriate to employ the Adjacent Distinguishability Weighting Coefficient (ADWC). The details of step S8205 can refer to the process in fig. 8B, and thus are not described again. When it is determined that the adjacent differentiable weighting factor (ADWC) is suitable for use, performing step S8206; when it is determined that the Adjacent Distinguishability Weighting Coefficient (ADWC) is not suitable for use, step S8210 is performed.

In step S8206, the weight value obtained by the context analysis is set as the Adjacent Distinguishability Weighting Coefficient (ADWC). For example, when the weight values corresponding to the local fusion features of the sensing data of each type are within the predetermined interval, the result indicating the global object recognition is not biased to a certain feature, so the weight values obtained by the context analysis can be set as the Adjacent Distinguishable Weighting Coefficients (ADWC).

In step S8207, the neighboring discriminative weighting coefficients (ADWC) are applied to perform global fine feature fusion (GDFF) process to establish a global fusion feature. For example, since it is determined that the neighboring differentiable weighting factor (ADWC) is suitable for use and the weight value obtained by the context analysis is set as the neighboring differentiable weighting factor (ADWC), the computing device 120 can perform the global fine feature fusion process according to the weight value corresponding to each type of local fusion feature to generate the global fusion feature.

In step S8208, the global blend features are input into a global object recognition model for global object recognition. The global object recognition model can be, for example, the global object recognition model 920 shown in FIG. 9A-2.

In step S8209, a boundary (boundary) of the identified region of interest is generated according to the identification result of the global object identification. It should be noted that, when the step S8209 is completed, the global object identification process for the current time frame (time frame) is completed, and the global object identification process for the next time frame can be performed.

In step S8210, the Adaptive Weighting Coefficients (AWC) are applied to perform global fine feature fusion (GDFF) to establish a global fusion feature. For example, because it has been determined that Adaptive Weighting Coefficients (AWC) are suitable, and the weight values obtained from the global context analysis may bias certain features, the computing device 120 applies the Adaptive Weighting Coefficients (AWC) to perform global fine feature fusion (GDFF) to create a global fused feature.

In step S8211, the global blend feature is input into a global object recognition model for global object recognition. The global object recognition model can be, for example, the global object recognition model 920 shown in FIG. 9A-2.

In step S8212, a boundary (boundary) of the identified region of interest is generated according to the identification result of the global object identification. It should be noted that, after the step S8212 is completed, the step S8213 is further executed to determine whether to execute the global context analysis for the first time, and if the determination result of the step S8213 is "yes", the step S8214 is executed; if the determination result in step S8213 is "no", step S8215 is performed.

In step S8214, feature approximation degree evaluation of the front region of interest and the rear region of interest is performed. In step S8214, since the global context analysis is performed for the first time, the front region of interest refers to the region of interest before the global object recognition is performed on the current time frame, and the rear region of interest refers to the region of interest after the global object recognition is performed on the current time frame.

In step S8215, feature approximation degree evaluation of the front region of interest and the rear region of interest is performed. In step S8215, since the global context analysis is not performed for the first time, the front region of interest refers to the region of interest after performing the global object recognition for the previous time frame (previous time frame), and the rear region of interest refers to the region of interest before performing the global object recognition for the current time frame.

The feature approximation degree scale of the front and rear regions of interest performed in steps S8214 and S8215 is generally calculated using a baryta distance (Bhattacharyya distance). Because each feature has different characteristics, its feature values are normalized (normalized) to be compared with each other. If the similarity of the same feature value in the front region of interest and the rear region of interest is higher, the weight value of the corresponding feature is increased accordingly. The weighted values of the features are normalized to obtain an Adaptive Weighting Coefficient (AWC).

In step S8216, the Adaptive Weighting Coefficient (AWC) is updated. Note that the updated Adaptive Weighting Coefficients (AWC) are data for processing the next time frame.

In step S8217, the boundary of the identified region of interest obtained from the current time frame is applied to the next time frame to serve as the predicted region of interest of the next time frame, and then the process of steps S8203 to S8217 is repeated.

FIG. 8D is a diagram illustrating a data pipeline for global fine feature fusion and global object recognition according to an embodiment of the present invention. Please refer to fig. 8C-1, fig. 8C-2 and fig. 8D.

In FIG. 8D, stages 1 through 9 are shown on the left for different stages of the data pipeline for global fine feature fusion and global object recognition. TF1 to TF7 represent time frames. The ROIs 1 to 7 respectively represent predicted regions of interest at time frames TF1 to TF7, and the contexts 1 to 7 represent Context regions at time frames TF1 to TF 7.

In phase 1, local context fusion and predictive ROI fusion are performed. At time frame TF1, the first time frame, the stage 2 Global Context Analysis (GCA) can be performed directly using the fused context region and the fused region of interest. The neighboring discriminative weighting factor (ADWC) or the adaptive weighting factor (AWC) is set to a preset value (default) in the first time frame.

In stage 3, it is decided which Weighting Coefficient (WC) to use according to the result of the Global Context Analysis (GCA). If it is determined that an Adaptive Weighting Coefficient (AWC) is used, then a is indicated at stage 3 on fig. 8D; if it is determined that the Adjacent Distinguishability Weighting Coefficient (ADWC) is used, AD is indicated at stage 3 of FIG. 8D.

In stage 4, global fine feature fusion (GDFF) is performed according to the selected weighting coefficients, and a Global Fusion Feature (GFF) of stage 5 is generated.

At stage 6, Global Object Recognition (GOR) is performed according to the global fusion features generated at stage 5. And in stage 7, generating the boundary of the identified region of interest according to the identification result of the global object identification.

In stage 8, a feature approximation evaluation of the front and rear regions of interest is performed. If the global context analysis is performed for the first time (i.e., time frame TF1), the front ROI refers to the ROI before performing global object recognition for the current time frame, and the back ROI refers to the ROI after performing global object recognition for the current time frame. If the global context analysis is not performed for the first time, the front interested area refers to the interested area after the global object identification is performed on the previous time frame, and the rear interested area refers to the interested area before the global object identification is performed on the current time frame.

At stage 9, the Adaptive Weighting Coefficients (AWC) are updated. For example, if the label at stage 3 is a, it indicates that Adaptive Weighting Coefficient (AWC) is used, so the Adaptive Weighting Coefficient (AWC) needs to be updated at stage 9. If the label at stage 3 is AD, it indicates that the neighboring discriminative weighting factor (ADWC) is used, so that stage 8 and stage 9 can be omitted. For example, the Adaptive Weighting Coefficients (AWC) are determined for the time frames TF1, TF2, TF4 and TF5, so that the feature approximation evaluation of the pre-and post-interest regions of stage 8 is performed.

In addition, the updated Adaptive Weighting Coefficient (AWC) of the current time frame in stage 9 is also used in stage 3 of the next time frame. For example, the Adaptive Weighting Coefficient (AWC) generated in phase 9 of the time frame TF1 is updated to AWC1, so the Adaptive Weighting Coefficient (AWC) in phase 3 of the time frame TF2 is AWC1, and so on.

If it is decided in stage 3 to use the neighboring differentiable weighting factors (ADWC), the values of the neighboring differentiable weighting factors (ADWC) are updated in the current time frame. For example, in stage 3 of the time frame TF3, it is determined to use the neighboring differentiable weighting coefficients (ADWC), so the neighboring differentiable weighting coefficients (ADWC) are updated to the neighboring differentiable weighting coefficients ADWC3 of the current time frame, and so on.

It is noted that the boundaries of the identified regions of interest obtained in stage 7, e.g. ROI1 a-ROI 7a, are applied to the next time frame, the boundary of the identified region of interest ROI1a obtained in time frame TF1 is applied to the boundary of the predicted region of interest ROI2 in time frame TF2, and so on.

FIG. 8E is a flowchart illustrating recognition result feedback and enhanced global feedback according to an embodiment of the invention.

In step S832, the Global Fusion Feature (GFF) is transmitted to the global object recognition model for global object recognition, and a global recognition result and corresponding confidence are generated. For example, the global recognition result outputted by the global object recognition model represents the person (i.e. the global object) detected by the different types of sensing data, and the higher the confidence, the higher the confidence of the global recognition result.

In step S834, a confidence threshold is defined. For example, the user may set the required confidence threshold by himself, or the confidence threshold may be determined by the computing device 120 by himself, wherein the confidence threshold may represent, for example, a minimum confidence level required for the global recognition result.

In step S836, it is determined whether the confidence is lower than the confidence threshold. If yes, ending the process; if not, step S838 is executed. For example, if the confidence of the global recognition result is lower than the confidence threshold, it indicates that the current global recognition result is not highly reliable, and the data of the sensor may need to be updated or the global recognition result may need to be updated after the object in the scene moves. If the confidence of the global identification result is not lower than the confidence threshold, the current global identification result has certain confidence.

In step S838, the global fine feature (GDFF) is decomposed into local fused features. For example, because the current global recognition result has a certain confidence level, the global fine feature used for global object recognition in the global object recognition model can be decomposed into the native fusion features of various types.

In step S840, the global recognition result and the corresponding confidence level, and the local fusion features are fed back to the local object recognition models (e.g., the local object recognition model 1313 shown in fig. 2).

Through the process of fig. 8E, each local object recognition model can utilize the feedback path to perform co-learning (co-learning) of the Local Object Recognition (LOR) model and the Global Object Recognition (GOR) model, so that the capability and accuracy of local object recognition can be globally and automatically enhanced.

Fig. 8F is a flow chart illustrating identification result feedback and enhanced local feedback according to an embodiment of the invention. In the process of fig. 8E, the global feedback is mainly used and the cooperative learning of the local object recognition model and the global object recognition model can be performed. In addition, the local end can also perform similar feedback, which is called local feedback.

In step S850, the Local Fusion Feature (LFF) is transmitted to the local object recognition model for local object recognition, and a local recognition result and corresponding confidence (for example, included in the local id list) are generated. For example, the local recognition result output by the local object recognition model represents the person (i.e. the local object) detected by the same type of sensing data, and the higher the confidence, the higher the confidence of the local recognition result.

In step S852, a confidence threshold is defined. For example, the user may set the required confidence threshold by himself, or the confidence threshold may be determined by the computing device 120 by himself, wherein the confidence threshold may represent, for example, a minimum confidence level required for local recognition results. The confidence threshold for local item recognition may be the same as or different from the confidence threshold for global item recognition.

In step S854, it is determined whether the confidence level is below a confidence level threshold. If yes, ending the process; if not, step S856 is executed. For example, if the confidence of the local recognition result is lower than the confidence threshold, it indicates that the current local recognition result is not highly reliable, and the data of the sensor may need to be updated or the local recognition result may need to be updated after the object in the scene moves. If the confidence of the local identification result is not lower than the confidence threshold, the current local identification result has certain confidence.

In step S856, the local recognition result and the corresponding confidence level, and the local fusion features are fed back to the local object detection models (e.g., the local object detection and correspondence module 1311 shown in fig. 2).

Through the process of fig. 8F, each local object detection model can utilize the feedback path to perform self-learning (self-learning) of the Local Object Detection (LOD) model and the Local Object Recognition (LOR) model, so that the capability and accuracy of local object detection and recognition can be automatically enhanced. In some embodiments, the local object detection model and the local object identification model may selectively refer to the feedback information, and may also determine how to apply the feedback information according to actual conditions and requirements.

FIGS. 9A-1 and 9A-2 are block diagrams illustrating a monitoring method according to an embodiment of the invention.

In one embodiment, the computing device 120 performs a local object detection and correspondence process including a local object detection process and a local object correspondence process at blocks 902-1-902-N. Each of the blocks 902-1-902-N receives sensing data from different types of sensors, such as blocks 902-1 receiving video data from one camera 110A or multiple cameras 110A (e.g., cameras 110A-1-110A-4), blocks 902-2 receiving audio data from one microphone 110B or multiple microphones 110B (e.g., microphones 110B-1-110B-3), and blocks 902-N receiving odor data from one odor sensor 110D or multiple odor sensors 110D (e.g., odor sensors 110D-1-110D-3). Each block 902 generates a local object list and a corresponding local coarse feature set for each type of sensed data. For example, the local rough feature set corresponding to different types of sensing data includes information such as direction, distance, profile, and structure of local objects of different types of sensing data. In addition, the process of establishing the local object list and the corresponding local rough feature set for the video object and the audio object by the video data and the audio data can refer to the embodiments of fig. 3A, fig. 3B and fig. 4A, fig. 4B.

At blocks 904-1 to 904-N, the computing device 120 performs a local object fine feature extraction and fusion process for different types of sensing data, respectively, including a local object fine feature extraction process and a local fine feature fusion process. For example, the computing device 120 sequentially performs a local object fine feature extraction process on the sensing data related to each local object according to the local object list to establish a local fine feature set of each sensing data related to each local object. The computing device 120 further performs a local fine feature fusion process according to the local object list to sequentially fuse the local fine feature sets corresponding to the video data associated with each local video object into a local fusion feature of each local video object.

At blocks 906-1 to 906-N, the computing device 120 respectively inputs the local fusion features of the local video objects into a Local Object Recognition (LOR) model to perform a local object identification process, marks the identification result with a local identification code, and collects the local identification codes (LIID) to generate a local identification list.

At block 908, the computing device 120 performs a global object mapping process to generate a global object list and a global coarse feature set corresponding to each global object. For example, the flow of global object processing may refer to the embodiment of fig. 5, but here is not just the corresponding processing of video objects and audio objects, but the flow is similar. For example, the computing device 120 also compares the time stamps of the local objects in the local object list of the different types of sensing data one by one. When the time stamps are matched, the computing device 120 compares the time stamps with the world coordinate positioning information of the local objects matched with the time stamps. When the world coordinate positioning information is also matched, the computing device 120 further determines whether the local rough feature sets of the different types of sensing data are matched. When the local rough feature sets are also matched, the computing device 120 may associate the local objects corresponding to the successful different types of sensing data with the corresponding local rough feature sets, and establish a corresponding global object and a corresponding global rough feature set, where each global object has a corresponding global object identifier.

At block 910, the computing device 120 performs a global fine feature set mapping process. For example, the computing device 120 merges the local object lists of different types corresponding to the global objects in the global object list and the corresponding local fusion features into the global fine feature set to which the global object list belongs. Since the local objects of different types corresponding to success are already included in the global object list, and the local objects of different types corresponding to success also have corresponding local merging characteristics, the computing device 120 also associates the local merging characteristics corresponding to the local objects of different types with each other to generate the global fine feature set corresponding to the global object.

At block 912, the computing device 120 performs a global context analysis process. For example, the computing device 120 analyzes a context area corresponding to a local object of different types (cross-types) among the global objects in the global object list. Taking the video object as an example, referring to fig. 7A, the computing device 120 performs context analysis on the spatial range of the context area 730 in the video frame 700. If the audio object is taken as an example, referring to FIG. 7B, the computing device 120 performs context analysis on the time range of the context area 780 in the audio segment 750.

The computing device 120 further determines whether any detailed features are apparent in the context area of each local object. For example, the video fine features include color, texture, shape, etc., and the audio fine features include volume, timbre, pitch, etc., and the result of the context analysis process selects to use the neighboring discriminative weighting coefficients or adaptive weighting coefficients for the subsequent global fine feature fusion. In addition, each weighting coefficient corresponds to a detailed feature. Therefore, taking video data and audio data as an example, there are 6 weighting coefficients in total.

If at block 912 it is determined that neighboring discriminative weighting factors are to be used for global fine feature fusion at the selector 914, the local identity list, the global object list, and the corresponding global fine feature set generated at block 910 are input to block 916, and the computing device 120 updates the neighboring discriminative weighting factors (ADWC).

If at block 912 it is determined that Adaptive Weighting Coefficients (AWC) are to be used for global fine feature fusion at the selector 914, the local identity list, the global object list, and the corresponding global fine feature set generated at block 910 are directly input to block 918 for global fine feature fusion. In addition, the updated adaptive weighting coefficients generated in block 924 are also input to block 918 for global fine feature fusion. The adaptive weighting coefficients generated in block 924 are determined and updated, for example, by the recognition result of the global object recognition model in the previous time.

At block 918, the computing device 120 performs a global fine feature fusion process. As described above, the input parameters for the global fine feature fusion process may be neighborhood discriminative weighting coefficients (ADWC) or Adaptive Weighting Coefficients (AWC), depending on the results of the global context analysis of block 912. In detail, the computing device 120 performs another feature fusion on the global fine feature set corresponding to each global object to obtain a Global Fusion Feature (GFF) corresponding to each global object.

At block 920, the computing device 120 inputs the global fusion feature into a global object recognition model for global object identification, and generates a global identification list (global identification ID list). For example, the computing device 120 may assign a Global Identity Identifier (GIID) to each global fusion feature, and the computing device 120 records the global identity in the global identity list. In addition, the global identification list further records the identification result and the confidence of the global object identification model.

At the selector 922, if the determination result at the block 912 is to perform global fine feature fusion using the neighboring discriminative weighting coefficients, the recognition result and the corresponding confidence level (both may be referred to as feedback information) outputted at the block 920 are directly outputted to the local object recognition models of different types of sensing data, for example, the blocks 906-1 to 906-N. In some embodiments, if the confidence level corresponding to the recognition result of a specific type of sensing data is less than a predetermined percentage (e.g., 80%), the selector 922 does not feed the recognition result and the confidence level of the specific type of sensing data back to the local object recognition models in blocks 906-1 to 906-N. In other embodiments, the local object recognition models in blocks 906-1 through 906-N may also determine whether to use feedback information.

At block 924, the computing device 120 updates the adaptive weighting coefficients, for example, according to the previous recognition result of the global object recognition model.

At block 926, the computing device 120 decomposes the global fine features to obtain different types of fine features. It is noted that the different types of detailed features obtained by decomposition in block 926 are input into the local object recognition models of blocks 906-1 through 906-N, respectively.

Therefore, each local object recognition model in the blocks 901-6-906-N can adjust or update the current local object recognition model according to the corresponding type of recognition result and its confidence (from the block 920 and via the selector 922), and the corresponding type of detailed features (from the block 926), so that the next object recognition can obtain more accurate result.

FIGS. 9B-1 and 9B-2 show detailed block diagrams of a global context analysis process according to the embodiments of FIGS. 9A-1 and 9A-2. In one embodiment, the global context analysis process performed in block 912 of FIGS. 9A-1 and 9A-2 can refer to the contents of FIGS. 9B-1 and 9B-2. Each of the blocks 904-1 to 904-N in FIGS. 9A-1 and 9A-2 performs a context acquisition process and a context fusion process for the corresponding type of sensing data in addition to the local object fine feature extraction process and the local fine feature fusion process for the corresponding type of sensing data. The computing device 120 performs the context acquisition process while performing the local object fine feature extraction process, as shown in blocks 904-1-904-N of FIGS. 9B-1 and 9B-2.

In detail, taking block 904-1 as an example, after the Video data Video1, Video2, and Video3 captured by the cameras 110A-1 to 110A-3 pass through block 902-1 in fig. 9A-1, the Video1, Video2, and Video3 are still input into block 904-1, i.e., the Video data captured by the different cameras 110A are input into block 904-1 for local object fine feature extraction and context acquisition, respectively, as shown in blocks 904-1A, 904-1B, and 904-1C. The local object fine feature set obtained in blocks 904-1A-904-1C is input into block 904-1D for local fine feature fusion to generate a local object list and local fusion features for the video data, and the local object list and local fusion features are input into block 910 for global fine feature set creation.

In addition, the context acquisition processes performed in blocks 904-1A, 904-1B, and 904-1C, respectively, may, for example, refer to the identified video object and further obtain the context area and the predicted region of interest in the corresponding video frame. At block 904-1E, the computing device 120 performs a local context fusion process and ROI fusion process to respectively fuse the context regions and the predicted regions of interest from blocks 904-1A, 904-1B, and 904-1C, e.g., a fused context region and a fused region of interest are obtained. Blocks 904-2 (e.g., for Audio1, Audio2, and Audio3) through 904-N (e.g., for scent data Smell1, Smell2, and Smell3) in fig. 9B-1 and 9B-2 may each process a respective type of sensed data and may result in a respective type of fused context region and fused region of interest.

The fused context regions and the fused regions of interest obtained in blocks 904-1 to 904-N are inputted to corresponding blocks 912-1 to 912-N in block 912 for a context analysis process, and the local context analysis results of blocks 912-1 to 912-N are transmitted to block 9120 for a context analysis result merging process and a global region of interest (ROI) merging process. The computing device 120 determines to use the neighboring distinguishable weighting factors or the adaptive weighting factors for global fine feature fusion according to the context combination result generated in block 9120.

FIG. 9C is a flowchart illustrating a monitoring method using multi-dimensional sensor data according to an embodiment of the invention. The flow diagrams and block diagrams of FIGS. 9A-1-9A-2 and 9B-1-9B-2 can be combined and simplified to the flow of FIG. 9C. Please refer to fig. 9C and fig. 2.

At block 952, sensed data is obtained using the same type of sensor group. For example, the sensing data obtained by the same type of sensors is transmitted to the corresponding local object detection and correspondence module 1311.

At block 954, local object detection and correspondence (LOD and LOC) is performed. For example, the local object detection and correspondence module 1311 receives the sensing data from the camera 110A, the microphone 110B, the taste sensor 110C, the smell sensor 110D, and the touch sensor 110E, respectively, and performs a local object detection and correspondence process (including Local Object Detection (LOD) and Local Object Correspondence (LOC)) for the corresponding sensing type to generate a local object identifier list (LOID list) and a Local Rough Feature Set (LRFS).

At block 956, local fine feature extraction and fusion is performed. For example, the local object feature extraction and fusion module 1312 performs a local feature extraction and fusion process, which includes a local fine feature extraction process (LDFE) and a local object fusion process (LDFF). For example, the computing device 120 extracts local fine features of each type of sensing data according to the local object list and the local rough feature set generated by the local object identification module 131, and establishes a local fine feature set corresponding to each type of sensing data. Then, the computing device 120 fuses the local fine feature set corresponding to each type of sensing data according to each type of local object list into a local fusion feature of each local object. In some embodiments, the local object feature extraction and fusion module 1312 further performs context acquisition and fusion for each type of sensing data to generate a fused context region. In addition, the local object feature extraction and fusion module 1312 may also fuse regions of interest for each type of sensed data to generate a fused region of interest.

At block 958, local object identification (LOR) is performed. For example, the local object recognition model 1313 performs local object identification to generate a local identification list corresponding to each type of sensing data. The computing device 120 inputs the local fusion features of each type from the block 956 into the local object recognition model 1313 to perform a local object identification process, labels the recognition result with a local identification code, and assembles the local identification codes to generate a local identification list (LIID list). In one embodiment, the local object recognition module 1313 may feed the generated local object recognition results of each type back to the local object detection and correspondence module 1311 through a feedback path (e.g., arrow 959), so that the local object detection and correspondence module 1311 may perform self-learning (self-learning) according to the local object recognition results of the corresponding type.

At block 960, Global Object Correspondence (GOC) is performed. For example, the global object and feature set correspondence module 1321 may perform Global Object Correspondence (GOC) according to the local object identifier list (LOID list) and the Local Rough Feature Set (LRFS) from block 954 and the local identification list (lid list) from block 958 to generate a global object identifier list (GOID list) and a Global Rough Feature Set (GRFS).

At block 962, global fine feature correspondence (GDFC) is performed. For example, the global object and feature set correspondence module 1321 can perform global fine feature correspondence processing to generate a global fine feature set (GDFS) according to the global object identifier list (GOID list) and the global coarse feature set (GRFS) from the block 960, the local identification list (LIID list) from the block 958, and the types of fused context area and the fused roi from the block 956.

At block 964, global fine feature fusion (GDFF) is performed. For example, the global fine feature fusion module 1324 performs the global fine feature fusion process according to the weighting parameters outputted from the weighting parameter selection module 1323, such as fusing the global fine feature sets generated by the global object and feature set correspondence module 1321 into a global fusion feature. Wherein the weighting parameter can be, for example, an Adaptive Weighting Coefficient (AWC) or an Adjacent Distinguishable Weighting Coefficient (ADWC), depending on the result of the global context analysis, the details of which can be found in the embodiments of FIGS. 8C-1, 8C-2, and 9B-1-9B-2.

At block 966, Global Object Recognition (GOR) is performed. For example, the global identification module 133 inputs the global fusion features generated by the global fine feature fusion module 1324 into a global object identification model to identify the global identity of each global fusion feature, such as creating a global identity list recording the global identity of each global fusion feature. In addition, the global identification list further records the identification result and the confidence of the global object identification model.

In some embodiments, for example in dark environments where there is a lack of light, the person speaking may be determined by hearing the ear (e.g., using microphone 110B), listening to the tone of the person's speech. In some embodiments, the dog call, cat call or other animal call is determined by listening to the animal call, and even under the condition that a specific animal is frequently contacted and the dog of a certain neighbor is called, the specific animal can be determined by only listening to the animal call.

In some embodiments, the ambient environment can be detected by smell olfaction (e.g., smell sensing 110D can be utilized) and predicted to be potentially dangerous, such as smelling a burnt smell, smelling a gas smell, smelling a gasoline smell, and so forth.

In some embodiments, the behavior of the other party is observed by listening to his/her speech tone, even the taste, such as wine or perfume, given from the other party, when in a conversation or negotiation with the other party.

The monitoring system 100 can integrate the sensing information collected from different types of sensors (similar to different senses) and then make appropriate response. In detail, there may be some sensor data in the scene monitored by the monitoring system 100 and no object needing attention is detected, for example, in a dark environment or a low-light environment, the video data captured by the camera 110A is generally not helpful to identify the global object. At this time, the computing device 120 may determine that the video data does not have the object of interest, but may determine that the object of interest is present according to the audio data or other types of sensing data. In addition, since some types of fine features are not helpful to identify the global object, the context analysis performed by the computing device 120 uses the neighboring distinguishable weighting coefficients to perform the global fine feature fusion process. That is, the weighting coefficients of the video data with respect to the video fine features (including color, texture, and shape, for example) are all set to 0, and the global fine feature fusion process is performed.

Similarly, in another embodiment, the scene monitored by the monitoring system 100 may be a noisy environment, and various environmental noises may be mixed in the audio data received by the microphone 110B, at this time, although the computing device 120 may determine that the audio data has an object of interest, the determination result may be affected by the noises, so that the confidence (or accuracy) of determining the audio object is reduced. The context analysis performed by the computing device 120 uses the neighboring discriminative weighting coefficients to perform the global fine feature fusion process. That is, the weighting coefficients of the video data with respect to the audio fine features (including, for example, volume, pitch, and timbre) are all set to 0, and the global fine feature fusion process is performed.

FIG. 10 is a diagram of a scenario and monitoring system according to an embodiment of the present invention.

In a conventional video surveillance system, each camera is independently and continuously capturing video images and storing the images in a hard disk of the video surveillance system. After the camera is installed, the video monitoring system can display the shot images on the monitoring screen in real time, and safety personnel can monitor the images at any time. If an event occurs, the image files in the hard disk need to be called and read manually, but due to factors such as the shooting angle and the setting space position, the image pictures shot by different cameras need to be tracked and connected manually. Because information among the independent cameras cannot be fused and exchanged in real time, the cross-type sensing fusion analysis capability is lacking, the camera is easily subjected to conditions such as light interference, shielding, object cross overlapping and the like, complete information cannot be acquired, and the recognition is approximate and the recognition result is unstable.

The monitoring system of the present invention solves the above problems. As shown in fig. 10, the scene 1000 is, for example, an area near a doorway of a bank, where a camera 1010 and a directional microphone 1020 are installed to monitor the entrance and exit of a gate 1001 of the bank, and the area is defined as a first area 1031, for example. After entering the bank through the bank gate 1001, there is a customer waiting area, for example, with a sofa 1002. And a camera 1011 and a directional microphone 1021 are installed in the customer waiting area for monitoring the area entering the bank hall and the customer waiting area from the bank gate, and the monitoring area is defined as a second area 1032. The first area 1031 and the second area 1032 have an overlapping area, for example, defined as a third area 1033, and the cameras 1010 to 1011 and the directional microphones 1020 to 1021 can monitor the third area 1033.

The location where each type of sensor in the monitoring system 100 is installed and the scene space where it is monitoring to capture and receive sound can be used to locate information using world coordinates. The spatial location of all detected objects can be converted to world coordinate locations. In addition, the information collected by all sensors can be transmitted to a central data processing center (not shown), and the central data processing center can execute the method of the aforementioned embodiment of the present invention to fuse the sensing data of each type of sensor through the AI identification system and generate feedback to achieve the self-training self-enhanced detection and identification capability and accuracy.

For example, if three people (e.g., people 1041, 1042, and 1043) arrive at the gate of the bank and enter the first area 1031, they are detected by the camera 1010 and the directional microphone 1020, and the computing device 120 can establish the tags ID #01, ID #02, and ID #03 for the people 1041, 1042, and 1043 by using the procedure of the above embodiment. Furthermore, the computing device 120 further performs fine feature extraction on the video data and the recorded audio data captured by the persons 1041, 1042, and 1043, respectively, for example, the fine features of the video include: color, texture, and shape, and audio detail features include: volume, pitch, timbre.

The detailed characteristics of the video corresponding to tag ID #01 of person 1041 are, for example: black, no stripes, big, female, and audio detail features such as: loud, sharp and clear. The detailed video characteristics corresponding to tag ID #02 of person 1042 are, for example: blue, no stripes, adult, male, and audio detail features such as: moderate volume, muddy, deep and full. The detailed video features corresponding to tag ID #03 of person 1043 are, for example: black, horizontal stripes, children, and audio detail features such as: loud, bright, clear and active.

Referring to fig. 9A-1, 9A-2 and 10, in detail, when the persons 1041, 1042 and 1043 are located in the first area 1031, the computing device 120 may extract the video fine features and the audio fine features corresponding to the tags ID #01, ID #02 and ID #03 of the persons 1041, 1042 and 1043, input the video fine features and the audio fine features corresponding to the tags ID #01, ID #02 and ID #03 into the respective local object recognition models (e.g., blocks 906-1 and 906-2), and generate recognition results of the respective video object and audio object, for example, recorded in a local identification list L1 for the video object and a local identification list L2 for the audio object.

Next, global object correspondences and global fine feature set correspondences are performed to generate a local object identifier list, and global coarse feature set and corresponding global fine feature set, via blocks 908 and 910. Assuming that Adaptive Weighting Coefficients (AWC) are selected for updating in block 912, the local object identifier list, and the global coarse feature set and corresponding global fine feature set are global fine feature fused in block 918 to generate global fused features, and recognition results are generated in the global object recognition model in block 920, such as recognizing global objects P1, P2, and P3, and assigning corresponding global IDs GIID1, GIID2, and GIID3 to the recognized global objects P1, P2, and P3.

Briefly, the global objects corresponding to the global IDs 1, GIID2, and GIID3 have all video detail and audio detail of the tags ID #01, ID #02, and ID #03, respectively.

Therefore, when the persons 1041, 1042 and 1043 enter the first zone 1031, the computing device 120 establishes the IDs #01, ID #02 and ID #03 and the global IDs GIID1, GIID2 and GIID3 of the persons 1041, 1042 and 1043 and all the video and audio detailed features thereof.

When the people 1041, 1042, and 1043 enter the overlapped area 1033 from the first area 1031, the area 1033 can be monitored by the cameras 1010-1011 and the directional microphones 1020-1021 at the same time. While the people 1041, 1042, and 1043 enter the third region 1033 from the first region 1031, the camera 1010 and the directional microphone 1020 continuously collect video and audio data and perform object tracking and object recognition, but the feature information may be missed due to the position and angle of the sensor, the light background sound of the current environment, or the shielding of people overlapping. Therefore, when the people 1041, 1042, and 1043 enter the overlap area 1033 from the first area 1031, the computing device 120 can utilize the video data and the audio data captured by the camera 1010 and the directional microphone 1020 at different positions and angles in addition to the video data and the audio data captured by the camera 1011 and the directional microphone 1021 to establish the global fusion feature of each object according to the above steps. The computing device 120 may also integrate the feature data collected by the camera 1010 and the directional microphone 1020. Then, the global object recognition model in block 920 determines whether the global merge feature is the same as the global merge feature of the previously recognized global objects corresponding to the global IDs GIID1, GIID2 and GIID 3. If the global fusion features are the same, the same person can be judged; if the global fusion features are different, the characters can be judged to be different.

When the people 1041, 1042, and 1043 leave the overlapping area 1033 and enter the second area 1032, the computing device 120 can only use the video data and the audio data captured by the camera 1011 and the directional microphone 1021 to establish the global fusion feature of each object according to the above steps. Then, the global object recognition model in block 920 determines whether the global merge feature is the same as the global merge feature of the previously recognized global objects corresponding to the global IDs GIID1, GIID2 and GIID 3. If the global fusion features are the same, the computing device 120 may determine that the same person is present; if the global fusion features are different, the computing device 120 may determine that the characters are different. Therefore, the monitoring system of the invention is richer and more complete in information for object identification than the traditional monitoring system, and has a feedback reinforcement mechanism, so that the tracking and identification capabilities and accuracy can be greatly improved.

FIG. 11 is a flow chart illustrating a monitoring method using multi-dimensional sensor data according to an embodiment of the invention. Please refer to fig. 2 and fig. 11.

In step S1110, the scene is detected by sensors (e.g., the sensors 110A-110E) to obtain a sensing data of each type. Sensor 110 includes a variety of different types of sensors, such as: camera 110A, microphone 110B, taste sensor 110C, odor sensor 110D, tactile sensor 110E, or combinations thereof, although embodiments of the invention are not limited to sensors of the types or attributes described above.

In step S1120, a local object process is performed on each type of the sensing data to generate a local object feature information. For example, the local object processing includes various processes performed by the local object detection and correspondence module 1311, a local object feature extraction and fusion module 1312, and a local object recognition model 1313, for example, with respect to the local object. In addition, the local object feature information generated by the local object recognition module 131 includes a local object list, a local rough feature set, a local fusion feature, and a local identification list of each type of sensing data.

In step S1130, a global object process is performed according to the local object feature information to generate global object feature information. For example, the global object processing includes various processing performed on the global object by the global object and feature set correspondence module 1321, the context area analysis module 1322, the weighting parameter selection module 1323, and the global fine feature fusion module 1324, for example. In addition, the global object feature information generated by the feature fusion module 132 includes: a global object list and corresponding global fine feature set, and global fusion features.

In step S1140, a global object recognition is performed on the global object feature information to generate a global recognition result. For example, the global identification module 133 inputs the global fusion features generated by the global fine feature fusion module 1324 into a global object identification model to identify the global identity of each global fusion feature, such as creating a global identity list recording the global identity of each global fusion feature.

In summary, embodiments of the present invention provide a monitoring method and a monitoring system using multi-dimensional sensor data, which can use different types of sensors to obtain sensing data of a scene, and perform detection, correspondence and identification of local objects of the same type, and can use different types of local objects to perform correspondence to generate a global object of global sensing data, which has a global fusion feature. In addition, the monitoring system and the monitoring method using the multi-dimensional sensor data in the embodiment of the invention can perform global object identification, so that the reliability and the accuracy of the objects in the monitoring scene are higher.

Although the present invention has been described with reference to particular embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

54页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种智能监控视频处理方法

Monitoring method and monitoring system using multi-dimensional sensor data

相关技术

网友询问留言