Method, apparatus, electronic device, medium, and program product for determining attributes

文档序号：664141 发布日期：2021-04-27 浏览：7次中文

阅读说明：本技术 确定属性的方法、装置、电子设备、介质和程序产品 (Method, apparatus, electronic device, medium, and program product for determining attributes ) 是由庞磊聂卫国李晨曦王珊张塘昆于 2020-12-25 设计创作，主要内容包括：本公开提供了一种确定人物的属性的方法、装置、电子设备、计算机可读存储介质和计算机程序产品,可以用于视频分类领域、人物识别领域和视频推荐领域中。该方法包括：针对包括人物的视频段,通过人物的行为来确定人物的行为分类信息；从视频段确定包括人物的声音的音频段；针对音频段,通过声音来确定人物的声音分类信息；以及基于行为分类信息和声音分类信息来确定人物的属性。利用上述方法,能够准确并且高效地确定视频中的人物的属性,从而能够对视频中的人物进行准确地描述并且对视频进行准确的分类,因而能够提高视频推荐的准确度以及提升用户体验。(The present disclosure provides a method, an apparatus, an electronic device, a computer-readable storage medium, and a computer program product for determining attributes of people, which can be used in the field of video classification, people recognition, and video recommendation. The method comprises the following steps: determining behavior classification information of a person through the behavior of the person for a video segment including the person; determining an audio segment including a sound of a person from the video segment; determining sound classification information of a person through sound for the audio segment; and determining attributes of the person based on the behavior classification information and the sound classification information. By the method, the attributes of the people in the video can be accurately and efficiently determined, so that the people in the video can be accurately described and the video can be accurately classified, and therefore the accuracy of video recommendation can be improved and the user experience can be improved.)

1. A method of determining attributes of a person, comprising:

determining behavior classification information of the person through the behavior of the person based on a video segment including the person;

determining an audio segment including the sound of the person from the video segment;

determining sound classification information of the person through the sound based on the audio segment; and

determining the attribute of the person based on the behavior classification information and the sound classification information.

2. The method of claim 1, further comprising:

dividing the video into a plurality of video segments with equal length; and

determining the video segment with the longest occurrence time of the person in the plurality of video segments as the video segment of the person.

3. The method of claim 1, wherein determining the segment of audio comprises:

acquiring a combination of a plurality of audio clips of the sound of the character in the video segment;

if the time length of the combination is greater than a threshold time length, truncating audio of the threshold time length from the combination as the audio segment;

determining the combination as the audio segment if the combined time length is equal to the threshold time length; and

if the time length of the combination is less than the threshold time length and greater than a second threshold time length, blank audio is added to the combination such that the time length of the added combination is equal to the threshold time length, and the added combination is determined to be the audio segment.

4. The method of claim 1, wherein the behavior classification information comprises a behavior classification multidimensional vector, the sound classification information comprises a sound classification multidimensional vector, and determining the attribute comprises:

determining an attribute multi-dimensional vector of the person based on the behavior classification multi-dimensional vector and the sound classification multi-dimensional vector; and

determining the attribute based on the attribute multi-dimensional vector.

5. The method of claim 1, wherein the behavior classification information and the sound classification information each include at least one candidate classification and at least one classification probability associated with the at least one candidate classification, and determining the attribute comprises:

determining at least one candidate attribute and at least one attribute probability associated with the at least one candidate attribute based on candidate classifications for the behavior classification information and the sound classification information and classification probabilities associated with the candidate classifications; and

determining the attribute based on the at least one candidate attribute and the at least one attribute probability.

6. The method of claim 1, wherein the behavior classification information and the sound classification information each comprise at least one candidate classification, and determining the attribute comprises:

excluding mutually exclusive candidate classifications from the candidate classifications for the behavioral classification information and the sound classification information; and

determining the attribute based on a candidate classification that excludes the mutually exclusive candidate classification.

7. The method of claim 1, wherein determining the attribute comprises:

determining the attributes of the persona using a multi-modal fusion method based on the behavior classification information and the sound classification information.

8. The method of claim 1, further comprising:

determining an image including the person from the video segment;

determining static classification information of the person from the image; and is

Determining the attributes of the persona includes:

determining the attribute of the person based on the behavior classification information, the sound classification information, and the static classification information.

9. The method of claim 8, wherein determining the image comprises determining a plurality of images including the person, and determining the static classification information comprises:

determining a plurality of static classification information of the person from the plurality of images; and

determining the static classification information by a voting method based on the plurality of static classification information.

10. The method of claim 9, wherein determining the static classification information comprises:

determining gender information for the person based on at least one of the plurality of static classification information;

verifying the plurality of static classification information using the gender information; and

for static classification information that is verified as erroneous, performing one of:

reduce voting weight, an

And (4) discarding.

11. An apparatus to determine attributes of a person, comprising:

a behavior classification information determination module configured to determine behavior classification information of the person by a behavior of the person based on a video segment including the person;

an audio segment determination module configured to determine an audio segment including the sound of the person from the video segment;

a sound classification information determination module configured to determine sound classification information of the person by the sound based on the audio segment; and

an attribute determination module configured to determine the attribute of the person based on the behavior classification information and the sound classification information.

12. The apparatus of claim 11, further comprising:

a segmentation module configured to segment the video into a plurality of video segments of equal length; and

a video segment determination module configured to determine, among the plurality of video segments, a video segment having the longest time of occurrence of the person as the video segment of the person.

13. The apparatus of claim 11, wherein the audio segment determination module comprises:

a combination acquiring module configured to acquire a combination of a plurality of audio pieces including the sound of the character in the video segment;

a first audio segment determination module configured to intercept audio of a threshold length of time from the combination as the audio segment if the length of time of the combination is greater than the threshold length of time;

a second audio segment determination module configured to determine the combination as the audio segment if the combined time length is equal to the threshold time length; and

a third audio segment determination module configured to add blank audio to the combination such that the added time length of the combination is equal to a second threshold time length if the time length of the combination is less than the threshold time length and greater than the threshold time length, and determine the added combination as the audio segment.

14. The apparatus of claim 11, wherein the behavior classification information comprises a behavior classification multidimensional vector, the sound classification information comprises a sound classification multidimensional vector, and the attribute determination module comprises:

an attribute multidimensional vector determination module configured to determine an attribute multidimensional vector of the person based on the behavior classification multidimensional vector and the sound classification multidimensional vector; and

a first attribute determination module configured to determine the attribute based on the attribute multi-dimensional vector.

15. The apparatus of claim 11, wherein the behavior classification information and the sound classification information each include at least one candidate classification and at least one classification probability associated with the at least one candidate classification, and the attribute determination module comprises:

a candidate attribute and attribute probability determination module configured to determine at least one candidate attribute and at least one attribute probability associated with the at least one candidate attribute based on candidate classifications for the behavior classification information and the sound classification information and classification probabilities associated with the candidate classifications; and

a second attribute determination module configured to determine the attribute based on the at least one candidate attribute and the at least one attribute probability.

16. The apparatus of claim 11, wherein the behavior classification information and the sound classification information each comprise at least one candidate classification, and the attribute determination module comprises:

a candidate classification exclusion module configured to exclude mutually exclusive candidate classifications from candidate classifications for the behavior classification information and the sound classification information; and

a third attribute determination module configured to determine the attribute based on a candidate classification after excluding the mutually exclusive candidate classification.

17. The apparatus of claim 11, wherein the attribute determination module comprises:

a fourth attribute determination module configured to determine the attribute of the person using a multi-modal fusion method based on the behavior classification information and the sound classification information.

18. The apparatus of claim 11, further comprising:

an image determination module configured to determine an image including the person from the video segment;

a static classification information determination module configured to determine static classification information of the person through the image; and is

The attribute determination module includes:

a fifth attribute determination module configured to determine the attribute of the person based on the behavior classification information, the sound classification information, and the static classification information.

19. The apparatus of claim 18, wherein determining the image comprises determining a plurality of images including the person, and the static classification information determination module comprises:

a first static classification information determination module configured to determine a plurality of static classification information of the person through the plurality of images; and

a second static classification information determination module configured to determine the static classification information by a voting method based on the plurality of static classification information.

20. The apparatus of claim 19, wherein the second static classification information determination module comprises:

a gender information determination module configured to determine gender information for the person based on at least one of the plurality of static classification information;

a static classification information verification module configured to verify the plurality of static classification information using the gender information; and

an error static information processing module configured to, for static classification information verified as erroneous, perform one of:

reduce voting weight, an

And (4) discarding.

21. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-10.

22. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-10.

23. A computer program product comprising a computer program which, when executed by a processor, performs the method of any one of claims 1-10.

Technical Field

The present disclosure relates to computer technology, and more particularly, to a method, an apparatus, an electronic device, a computer-readable storage medium, and a computer program product for determining attributes of persons, which may be used in the field of video classification, the field of person recognition, and the field of video recommendation.

Background

Users using video viewing applications often wish to be able to learn about the style of a video before viewing it, particularly when viewing a video that is being cast by a character such as a real person, a virtual person, an animated person or an animated animal, and in advance learn about the attributes of the cast so that favorite videos can be selected according to personal preferences. The character style can be judged by combining various factors, and the judgment of the character style can form the subjective impression of others by factors such as talking and spitting, behavior, dressing style, stature and the like under the normal condition. To meet the above-mentioned needs of the user, the video viewing application may typically provide the user with a description for the anchor style or video category for the video.

Descriptions for the anchor style or video classification may be provided by the provider or uploader of the video, but there are likely to be issues with correctness, accuracy, and standardization. Thus, the video viewing application may proactively identify video provided by the provider or uploader to determine the anchor style or video classification. In this manner, a more accurate and canonical anchor style or video category description may be provided to a user of the video viewing application, thereby facilitating the user of the video viewing application in selecting a more favorite video to view.

However, conventional techniques for identifying videos do not accurately describe the anchor style or video classification in all directions.

Disclosure of Invention

According to an embodiment of the present disclosure, a method, an apparatus, an electronic device, a computer-readable storage medium, and a computer program product for determining attributes of a person are provided.

In a first aspect of the present disclosure, there is provided a method of determining attributes of a person, comprising: determining behavior classification information of a person through the behavior of the person based on a video segment including the person; determining an audio segment including a sound of a person from the video segment; determining sound classification information of a person through sound based on the audio segment; and determining attributes of the person based on the behavior classification information and the sound classification information.

In a second aspect of the present disclosure, there is provided an apparatus for determining attributes of a person, comprising: a behavior classification information determination module configured to determine behavior classification information of a person by a behavior of the person based on a video segment including the person; an audio segment determination module configured to determine an audio segment including a sound of a person from the video segment; a sound classification information determination module configured to determine sound classification information of a person by sound based on the audio segment; and an attribute determination module configured to determine an attribute of the person based on the behavior classification information and the sound classification information.

In a third aspect of the present disclosure, an electronic device is provided, comprising at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to implement a method according to the first aspect of the disclosure.

In a fourth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to implement a method according to the first aspect of the present disclosure.

In a fifth aspect of the present disclosure, a computer program product is provided, comprising a computer program which, when executed by a processor, performs the method according to the first aspect of the present disclosure.

With the technology according to the present application, there is provided a multimodal character attribute determination method that can determine attributes of characters by combining a video recognition manner, an audio recognition manner, and an image recognition manner, and further adopting a vector combination manner. By the method, the attributes of the people in the video can be accurately and efficiently determined, so that the people in the video can be accurately described and the video can be accurately classified, and therefore the accuracy of video recommendation can be improved and the user experience can be improved.

It should be understood that the statements herein reciting aspects are not intended to limit the critical or essential features of the embodiments of the present disclosure, nor are they intended to limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The foregoing and other objects, features and advantages of the disclosure will be apparent from the following more particular descriptions of exemplary embodiments of the disclosure as illustrated in the accompanying drawings wherein like reference numbers generally represent like parts throughout the exemplary embodiments of the disclosure. It should be understood that the drawings are for a better understanding of the present solution and do not constitute a limitation of the present disclosure. Wherein:

FIG. 1 illustrates a schematic block diagram of an environment 100 in which attributes of a character may be determined in which methods of determining attributes of characters in certain embodiments of the present disclosure may be implemented;

FIG. 2 illustrates a flow diagram of a method 200 of determining attributes of a persona according to an embodiment of the present disclosure;

FIG. 3 illustrates a flow diagram of a method 300 of determining attributes of a persona according to an embodiment of the present disclosure;

FIG. 4 illustrates a schematic diagram of phase data 400 from video segment to attribute in accordance with an embodiment of the disclosure;

FIG. 5 shows a schematic diagram of a feature vector topology 500 according to an embodiment of the present disclosure;

FIG. 6 shows a schematic block diagram of an apparatus 600 to determine attributes of a person according to an embodiment of the present disclosure; and

FIG. 7 illustrates a schematic block diagram of an example electronic device 700 that can be used to implement embodiments of the present disclosure.

Like or corresponding reference characters designate like or corresponding parts throughout the several views.

Detailed Description

Preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The term "include" and variations thereof as used herein is meant to be inclusive in an open-ended manner, i.e., "including but not limited to". Unless specifically stated otherwise, the term "or" means "and/or". The term "based on" means "based at least in part on". The terms "one example embodiment" and "one embodiment" mean "at least one example embodiment". The term "another embodiment" means "at least one additional embodiment". The terms "first," "second," and the like may refer to different or the same object. Other explicit and implicit definitions are also possible below.

As described above in the background, although a video viewing application may proactively identify videos provided by a provider or an uploader to determine a anchor attribute or a video classification, conventional techniques for identifying videos cannot accurately describe the anchor attribute or the video classification in all directions.

Specifically, in the conventional scheme, the existing character style recognition modes include: (1) image style recognition, which is mainly to define the types of people at the image level and to recognize the corresponding styles by using the image classification technology; and (2) audio style classification, which is mainly to define the style of sound and perform corresponding style identification on tone colors by using an audio classification technology.

However, the above-described approach taken in the conventional scheme has the following disadvantages: in the method (1), only the character styles at some image levels can be defined in a category manner, and the image-level style classification is performed by using the existing deep learning technology, but objectively, the character styles are formed by combining various factors, such as stature, talk and talk, behavior, dressing style, and the like, and the image style classification based on a single modality cannot describe the character styles in all directions. In the method (2), the audio classification is a classification technique using the timbre of the speaker's voice, and each person's voice has unique characteristics and changes correspondingly with the age, but this is not a character attribute completely conforming to the natural law, and the style of a character cannot be fully described by using only the single-mode voice style determined by voice.

To address, at least in part, one or more of the above problems and other potential problems, embodiments of the present disclosure propose a scheme for determining attributes of a person. This scheme provides a multi-modal character attribute determination method, which can determine the attributes of characters by combining a video recognition mode, an audio recognition mode and an image recognition mode and further adopting a vector combination mode.

FIG. 1 illustrates a schematic block diagram of an environment 100 in which attributes of a person may be determined in which methods of determining attributes of a person may be implemented in certain embodiments of the present disclosure. According to one or more embodiments of the present disclosure, the environment 100 in which the attributes of the person are determined may be a cloud environment. As shown in FIG. 1, an environment 100 for determining attributes of a person includes a computing device 110. In the environment 100 in which the attribute of the person is determined, the video segment 120 including the person is supplied to the computing device 110 as an input to the computing device 110, and the attribute 130 of the person is output by the computing device 110 as an output. In accordance with one or more embodiments of the present disclosure, attributes 130 can include styles, e.g., appearance, language, and behavior, of, e.g., anchor characters included in video segment 120, e.g., steady, passion, serious, lovely, boney, strong, thin, etc.

It should be understood that the environment 100 for determining attributes of persons is merely exemplary and not limiting, and is scalable in that more computing devices 110 can be included, and that more video segments 120 including persons can be provided as input to the computing devices 110, and that the computing devices 110 can also output more attributes 130 of persons as output, thereby making it possible to satisfy the need for more users to simultaneously utilize more computing devices 110, and even more video segments 120 including persons, to simultaneously or non-simultaneously obtain attributes 130 of persons. Further, the computing device 110 may not actually output the attribute 130 of the person, but may obtain the attribute 130 of the person only by processing.

According to one or more embodiments of the present disclosure, in the environment 100 in which attributes of a person are determined, after the computing device 110 acquires the video segment 120 including the person, the computing device 110 may generate an audio segment including a sound of the person and an image including the person using the video segment 120, and may perform corresponding recognition on the video segment 120, the audio segment, and the image, respectively, to determine behavior classification information, sound classification information, and static classification information of the person, and then may determine attributes 130 of the person to be output based on these classification information, and may further output the attributes 130 of the person.

In the environment 100 for determining attributes of a person shown in fig. 1, input of a video segment 120 to a computing device 110 and output of attributes 130 of a person from the computing device 110 can be performed through a network.

FIG. 2 shows a flow diagram of a method 200 of determining attributes of a person according to an embodiment of the present disclosure. In particular, the method 200 of determining attributes of a person may be performed by the computing device 110 in the environment 100 of determining attributes of a person shown in FIG. 1. It should be understood that the method 200 of determining attributes of a person may also include additional operations not shown and/or may omit the operations shown, as the scope of the present disclosure is not limited in this respect.

At block 202, the computing device 110 determines behavior classification information for a person from the behavior of the person for a video segment 120 that includes the person. According to one or more embodiments of the present disclosure, for the video segment 120, the computing device 110 determines the behavior of a person using changes between frames, and determines the behavior classification information by the behavior of the person. The behavior of the character may include the magnitude of the change in the motion of the character, making a particular motion, and the like. For example, if the motion of the person between different frames is less varied, it may be determined that the behavior classification information includes a steady weight; if the motion of the character between different frames is largely changed, it may be determined that the behavior classification information includes a passion. For another example, if a change in motion of the person between different frames indicates that the person has made a motion like hearts, it may be determined that the behavior classification information includes loveliness; if the change in motion of the character between different frames indicates that the character made a motion like a fist, it may be determined that the behavior classification information includes an incentive or passion.

In accordance with one or more embodiments of the present disclosure, the behavior of the character determined by the computing device 110 using the changes between frames may also include changes in the shape of the character's mouth. With the change in the shape of the mouth of the person, a method such as lip language recognition can be adopted to recognize a sound such as a voice uttered by the person. It should be understood that when determining the sound made by a person using changes in the person's mouth shape, since such determination is not made based on actual sound recordings, information such as the volume and tone of the sound made by the person may not be provided.

In accordance with one or more embodiments of the present disclosure, the video segments 120 may not be provided to the computing device 110, but rather raw, such as live or recorded, video. In this case, the computing device 110 can generate the video segment 120 from a video. For example, the computing device 110 may divide the video into a plurality of video segments of equal length, such as 45 seconds, and among the plurality of video segments, determine the video segment having the longest person presence time as the video segment 120. For another example, if a character continues to appear in the video, the computing device 110 may arbitrarily intercept a length of video segment, such as 45 seconds, in the video as the video segment 120. For another example, the computing device 110 can intercept a certain length of video segment in the video, such as 45 seconds, where the motion of the person changes greatly, as the video segment 120.

At block 204, the computing device 110 determines an audio segment from the video segment 120 that includes the sound of a person. In accordance with one or more embodiments of the present disclosure, in order for the computing device to then more easily and accurately determine the sound classification information of the person from the audio segment, it is preferable that only the sound of the person is included in the audio segment. Thus, the computing device 110 can first extract all of the audio segments from the video segment 120 and then perform two operations to determine the audio segments that include the sounds of a person. These two operations include removing noise, such as background sounds, and removing portions of the audio segment that do not include the sound of a person. It should be appreciated that these two operations may be performed in any order, or may be performed in parallel.

In accordance with one or more embodiments of the present disclosure, the sound of the character may include the voice sound of the character as well as non-voice sounds emitted by the character such as whistling and applauding, ringing fingers, and the like.

According to one or more embodiments of the disclosure, the computing device 110 can determine the audio segment by obtaining a combination of multiple audio segments that include the sound of a character in the video segment 120. To enable the computing device 110 to more accurately determine the sound classification information of a person by including an audio segment of the person's sound, the audio segment may be set to have a certain length of time, such as 20 seconds. Thus, a threshold length of time such as 20 seconds may be set, and if the length of time of the combination of audio segments is greater than 20 seconds, 20 seconds of audio is cut from the combination of audio segments as audio segments; if the time length of the combination of the audio segments is equal to 20 seconds, the combination of the audio segments is determined as the audio segment. In particular, if the time length of the combination of audio segments is less than the threshold time length, an audio segment of a certain time length, such as 20 seconds, may be obtained by adding a portion of blank audio to the combination of audio segments. It is noted that in order for the computing device 110 to more accurately determine the sound classification information of the person by including the audio segment of the person's sound, the audio segment needs to reach a minimum length of time, such as a second threshold length of time of 16 seconds. At this time, if the time length of the combination of the audio pieces is 17 seconds which is less than 20 seconds and greater than 16 seconds, blank audio of 3 seconds may be added to the combination of the audio pieces so that the time length of the combination of the added audio pieces is 20 seconds, and the combination of the added audio pieces is determined as an audio piece.

At block 206, the computing device 110 determines sound classification information for the person by sound for the audio segment. According to one or more embodiments of the present disclosure, the sound of the character may include the intonation, the speed, the volume, or whether some specific language is spoken, etc. of the character. For example, if the voice speech rate of the person is gentle and the voice wording is round, it may be determined that the voice classification information includes a seriousness; if the person's voice is fast in speech, lingering in tone, and loud in volume, it can be determined that the voice classification information includes passion. For another example, if languages such as "family", "maotai", etc. are often included in the voice of the person, it may be determined that the voice classification information includes lively or lovely; if the voice of the character often includes a language such as "olympic", it can be determined that the voice classification information includes an emotion or a motivation.

According to one or more embodiments of the present disclosure, the sound classification information may include classifications divided by attributes such as intonation, timbre, and the like of the sound of the person, such as a girl sound classification, an old person sound classification, and the like. The classification of these categories may be defined according to actual requirements, and the scope of the disclosure is not limited in this respect.

According to one or more embodiments of the present disclosure, when the sound of the character includes non-speech sounds, such as whistling and applause, ringing fingers, etc., emitted by the character, the computing device 110 may determine, for example, by these non-speech sounds that the sound classification information includes an emotion or a motivation.

In connection with block 204 and block 206, in accordance with one or more embodiments of the present disclosure, the computing device 110 may identify segments, e.g., chats, utterances for audio segments including the sounds of the characters using a multimodal video classification method, such as a multimodal video classification algorithm, and then extract the audio and employ a source separation method, e.g., a source separation algorithm. This is because the audio effect problem of aliasing of various sounds is generated in the video segment 120 due to the environment and the like, and in order to solve the audio aliasing problem generated only in such video and live scenes, the introduced sound source separation method can effectively extract and separate the sounds of the person from the background sounds. Thereafter, the computing device 110 may utilize the extracted sounds to employ an audio classification method, such as an audio classification algorithm. For example, the computing device 110 may convert the audio segments into matrix features that may be input into a convolutional neural network and classify the audio segments to enable timbre recognition of the human voice and, ultimately, extract attributes of the human voice.

At block 208, the computing device 110 determines attributes of the person based on the behavior classification information of the person determined at block 202 and the sound classification information of the person determined at block 206. According to one or more embodiments of the present disclosure, the attribute of the character may include a combination of, for example, a seriousness, a motivation, and the like in the behavior classification information and the sound classification information, or may be one of the descriptions, or may be a more general description based on several descriptions.

According to some embodiments of the present disclosure, the behavior classification information includes a behavior classification multidimensional vector, the sound classification information includes a sound classification multidimensional vector, and determining the attribute of the character may include determining an attribute multidimensional vector of the character based on the behavior classification multidimensional vector and the sound classification multidimensional vector, and determining the attribute of the character based on the attribute multidimensional vector, wherein each dimension of the attribute multidimensional vector may include, for example, one of the foregoing seriousness, inspirations, and the like.

According to further embodiments of the present disclosure, the behavior classification information and the sound classification information each include at least one candidate classification and at least one classification probability associated with the at least one candidate classification, and determining the attribute of the character may include determining at least one candidate attribute and at least one attribute probability associated with the at least one candidate attribute based on the candidate classification for the behavior classification information and the sound classification information and the classification probability associated with the candidate classification, and determining the attribute of the character based on the at least one candidate attribute and the at least one attribute probability. For example, if the behavior classification information includes two candidate classifications, which are respectively a severity with a probability of 60% and a liveliness with a probability of 40%, the sound classification information also includes two candidate classifications, which are respectively a severity with a probability of 70% and a liveliness with a probability of 30%, at this time, the corresponding probabilities may be added and divided by two to obtain two candidate attributes, which are respectively a severity with a probability of 65% and a liveliness with a probability of 35%, and the attribute of the person may be further determined as a severity with a probability of 65% and a liveliness with a probability of 35%, or the attribute of the person may be determined as a severity by simply taking the one with the larger probability.

According to still further embodiments of the present disclosure, the behavior classification information and the sound classification information each include at least one candidate classification, and determining the attribute of the person may include excluding mutually exclusive candidate classifications from the candidate classifications for the behavior classification information and the sound classification information, and determining the attribute of the person based on a candidate classification after excluding the mutually exclusive candidate classifications. For example, if the candidate classifications for the behavior classification information and the sound classification information include serious, sadness, and happiness, at this time, due to sadness and happiness, at least one mutually exclusive candidate classification may be removed from the two candidate classifications, thereby determining that the attribute of the character includes serious, or serious and sadness, or serious and happiness.

According to one or more embodiments of the present disclosure, after obtaining the behavior classification information of the persona determined in block 202 and the sound classification information of the persona determined in block 206, the computing device 110 may also determine attributes of the persona using a multimodal fusion method, such as a multimodal fusion algorithm, based on the behavior classification information and the sound classification information.

FIG. 3 shows a flow diagram of a method 300 of determining attributes of a person according to an embodiment of the present disclosure. In particular, the method 300 of determining attributes of a person may also be performed by the computing device 110 in the environment 100 of determining attributes of a person shown in FIG. 1. It should be understood that the method 300 of determining attributes of a person may also include additional operations not shown and/or may omit the operations shown, as the scope of the present disclosure is not limited in this respect. In the method 300 of determining attributes of people, static classification information determined from individual images in the video segment 120 is additionally considered.

At block 302, the computing device 110 determines behavior classification information for a person from the behavior of the person based on the video segment 120 that includes the person. The specific content of the step referred to in the block 302 is the same as that of the step referred to in the block 202, and is not described herein again.

At block 304, the computing device 110 determines an audio segment from the video segment 120 that includes the sound of a person. The specific content of the step referred to in the block 304 is the same as that of the step referred to in the block 204, and is not described herein again.

At block 306, the computing device 110 determines sound classification information for the person from the sound based on the audio segment. The specific content of the step referred to in the block 306 is the same as that of the step referred to in the block 206, and is not described herein again.

At block 308, the computing device 110 determines an image from the video segment 120 that includes a person. Images included in a video segment 120 that do not take into account interrelationships may also be used to determine attributes of people in accordance with one or more embodiments of the present disclosure. Therefore, it is first necessary to cut out an image including a person from the video segment 120.

At block 310, the computing device 110 determines static classification information for the person from the image including the person determined at block 308. According to one or more embodiments of the present disclosure, the image of the person may include the person's gender, height, stature, color value, expression, and the like. For example, if the image of the person shows that the person is tall, it may be determined that the static classification information for the person includes a high peak; if the image of the person shows that the person's muscles are developed, it can be determined that the static classification information of the person includes robustness. For another example, if the image of the person shows that the person is crying, it may be determined that the static classification information of the person includes sadness; if the image of the person shows that the person is laughing, it may be determined that the static classification information of the person includes distraction.

In accordance with one or more embodiments of the present disclosure, the computing device 110 may determine a plurality of images including a person at block 308. For example, the computing device 110 can capture one image every 1 second in the video segment 120. As another example, the computing device 110 can choose to capture at least 20 images including people in the video segment 120 with an interval greater than 0.5 seconds. At this time, the computing device 110 may determine a plurality of static classification information of the person through the plurality of images. The computing device 110 may then determine the static classification information for the person by a voting method, such as a voting algorithm, based on the plurality of static classification information.

According to one or more embodiments of the present disclosure, after the computing device 110 determines the plurality of images including the person at block 308, the computing device 110 may determine gender information for the person based on at least one of the plurality of static classification information determined at block 310 and verify whether there is an error in the plurality of static classification information using the gender information. For example, since men generally cannot have static classification information such as charming, if the gender information of the person is determined to be male and the static classification information includes charming, this static classification information may be verified as erroneous. At this time, the negative influence of the wrong static classification information on the finally determined attribute of the character can be avoided by reducing the voting weight of the charming static classification information or directly discarding the static classification information. For another example, since a man generally cannot have static classification information such as girlfriety, if it is determined that the sex information of the person is a male and the static classification information includes girlfriety, this static classification information may be verified as an error. In this case, the negative influence of the wrong static classification information on the attributes of the finally determined person can be avoided by reducing the voting weight of the static classification information, which is small, or directly discarding the static classification information. The above verification may be understood as utilizing gender classification to constrain the long-stature classification and the stature classification.

At block 312, the computing device 110 determines attributes of the person based on the behavioral classification information of the person determined at block 302, the voice classification information of the person determined at block 306, and the static classification information of the person determined at block 310. According to one or more embodiments of the present disclosure, the attribute of the character may include a combination of, for example, a seriousness, a motivation, and the like in the behavior classification information and the sound classification information, or may be one of the descriptions, or may be a more general description based on several descriptions.

Further, in accordance with one or more embodiments of the present disclosure, the computing device 110 may also determine attributes of the persona using, for example, a multi-dimensional vector-based approach, a probability-based approach, an approach based on excluding mutually exclusive candidate classifications, and a multimodal fusion approach such as a multimodal fusion algorithm, as described above with respect to block 208. For example, computing device 110 can feature-splice audio features embodied by sound classification information, temporal information embodied by behavior classification information, and image features embodied by static classification information in video segment 120 using a multi-modal video classification method, such as a multi-modal video classification algorithm. The computing device 110 may also use a behavior recognition method such as a behavior recognition algorithm to finally obtain attributes of a person by inputting a video segment by adopting a video segment structure for multimodal video classification, learning sound classification information by applying 2D convolution to an audio segment extracted from the video segment 120, learning behavior classification information by directly applying a dual-stream network model to the video segment 120, and learning static classification information by applying 3D convolution to an image extracted from the video segment 120.

Fig. 4 shows a schematic diagram of phase data 400 from video segment to attribute according to an embodiment of the present disclosure. According to one or more embodiments of the disclosure, each phase data 400 corresponds to the data acquired or determined in the method 300 of determining attributes of a person according to an embodiment of the disclosure shown in fig. 3.

As illustrated in fig. 4, at the top is a video segment 120 that includes a character. Then, as indicated by the arrows drawn from the video segment 120, the behavior classification information 440 of the person can be obtained for the video segment 120, and the audio segment 410 including the person and the image 420 including the person can be determined from the video segment 120. After that, as indicated by arrows drawn from the audio segment 410 and the image 420, the sound classification information 430 of the person and the static classification information 450 of the person may be determined for the audio segment 410 and the image 420, respectively. Finally, as indicated by arrows drawn from the sound classification information 430, the behavior classification information 440, and the static classification information 450 of the person, the attribute 130 of the person may be determined based on the sound classification information 430, the behavior classification information 440, and the static classification information 450 of the person. The specific contents of the video segment 120, the audio segment 410, the image 420, the sound classification information 430, the behavior classification information 440, the static classification information 450 of the person, and the attribute 130 in fig. 4 are the same as those of the video segment 120, the audio segment, the image, the sound classification information, the behavior classification information, the static classification information of the person, and the attribute described with reference to fig. 1 to 3, and are not described herein again.

Fig. 5 shows a schematic diagram of a feature vector topology 500 according to an embodiment of the present disclosure. In accordance with one or more embodiments of the present disclosure, the feature vector topology 500 may serve as an input to the multimodal vector fusion method described with reference to fig. 2 and 3. For example, the multimodal vector fusion method determines attributes such as styles of a person by using multimodal feature vectors such as behavior, gender, age, color value, stature, portrait style, and voice style as inputs.

As shown in fig. 5, seven nodes are included in the feature vector topology 500, respectively attribute 130, behavior 510, sound 520, facies of growth 530, age 540, color value 550, and stature 560. As can be seen by the feature vector topology 500, the behavior 510 is associated with the attribute 130, the sound 520 is associated with the attribute 130 and the age 540, respectively, and the growth 530 is associated with the attribute 130, the age 540, the color value 550, and the stature 560, respectively. While attributes 130 are directly associated with behavior 510, sound 520, and growth phase 530, and may be indirectly associated with age 540, color value 550, and stature 560. In accordance with one or more embodiments of the present disclosure, the behavior 510, the sound 520, and the long phase 530 may correspond to the behavior classification information, the sound classification information, and the static classification information described with reference to fig. 2 and 3, respectively. It should be understood that the specific contents of the above nodes are only used for exemplifying the feature vector topology, and are not used for limiting the protection scope of the present disclosure.

According to one or more embodiments of the present disclosure, after a complete form or a partial form of the feature vector topology 500 including at least a part of the nodes is taken as an input, information of each node and its surrounding nodes may be aggregated by the graph neural network, so that potential spatial topological structure connections of each modality may be learned without destroying feature vectors of each modality by using the shallow structure advantage of the graph neural network, thereby determining an attribute multidimensional vector for the attributes 130 of the person.

The environment 100 of determining the attribute of a person, in which the method of determining the attribute of a person according to the embodiment of the present disclosure, the method 200 of determining the attribute of a person according to the embodiment of the present disclosure, the method 300 of determining the attribute of a person according to the embodiment of the present disclosure, the respective phase data 400 from a video segment to an attribute according to the embodiment of the present disclosure, and the related contents of the feature vector topology 500 according to the embodiment of the present disclosure are described above with reference to fig. 1 to 5. It should be understood that the above description is intended to better illustrate what is recited in the present disclosure, and is not intended to be limiting in any way.

It should be understood that the number of various elements and the size of physical quantities employed in the various drawings of the present disclosure are by way of example only and are not limiting upon the scope of the present disclosure. The above numbers and sizes may be arbitrarily set as needed without affecting the normal implementation of the embodiments of the present disclosure.

Details of the method 200 of determining attributes of a person and the method 300 of determining attributes of a person according to embodiments of the present disclosure have been described above with reference to fig. 1 to 5. Hereinafter, respective modules in the apparatus for determining attributes of a person will be described with reference to fig. 6.

Fig. 6 is a schematic block diagram of an apparatus 600 to determine attributes of a person according to an embodiment of the present disclosure. As shown in fig. 6, the apparatus 600 for determining attributes of a person may include: a behavior classification information determination module 610 configured to determine behavior classification information of a person by a behavior of the person based on a video segment including the person; an audio segment determination module 620 configured to determine an audio segment including a sound of a person from the video segment; a sound classification information determination module 630 configured to determine sound classification information of a person by sound based on the audio segment; and an attribute determination module 640 configured to determine attributes of the character based on the behavior classification information and the sound classification information.

In one or more embodiments, the apparatus 600 for determining attributes of a person further comprises: a segmentation module (not shown) configured to segment the video into a plurality of video segments of equal length; and a video segment determination module (not shown) configured to determine, among the plurality of video segments, a video segment having the longest person appearance time as the video segment of the person.

In one or more embodiments, audio segment determination module 620 includes: a combination acquiring module (not shown) configured to acquire a combination of a plurality of audio pieces including a sound of a character in the video segment; a first audio segment determination module (not shown) configured to intercept audio of a threshold length of time from the combination as an audio segment if the length of time of the combination is greater than the threshold length of time; a second audio segment determination module (not shown) configured to determine the combination as an audio segment if the length of time of the combination is equal to the threshold length of time; and a third audio segment determination module (not shown) configured to add blank audio to the combination such that the added combination has a time length equal to the threshold time length and determine the added combination as an audio segment if the combined time length is less than the threshold time length and greater than the second threshold time length.

In one or more embodiments, wherein the behavior classification information comprises a behavior classification multidimensional vector, the sound classification information comprises a sound classification multidimensional vector, and the attribute determination module 640 comprises: an attribute multidimensional vector determination module (not shown) configured to determine an attribute multidimensional vector of the person based on the behavior classification multidimensional vector and the sound classification multidimensional vector; and a first attribute determination module (not shown) configured to determine the attributes based on the attribute multidimensional vector.

In one or more embodiments, wherein the behavioral classification information and the sound classification information each include at least one candidate classification and at least one classification probability associated with the at least one candidate classification, and the attribute determination module 640 includes: a candidate attribute and attribute probability determination module (not shown) configured to determine at least one candidate attribute and at least one attribute probability associated with the at least one candidate attribute based on a candidate classification for the behavior classification information and the sound classification information and a classification probability associated with the candidate classification; and a second attribute determination module (not shown) configured to determine the attribute based on the at least one candidate attribute and the at least one attribute probability.

In one or more embodiments, wherein the behavioral classification information and the sound classification information each comprise at least one candidate classification, and the attribute determination module 640 comprises: a candidate classification exclusion module (not shown) configured to exclude mutually exclusive candidate classifications from candidate classifications for the behavior classification information and the sound classification information; and a third attribute determination module (not shown) configured to determine the attribute based on the candidate classification after excluding the mutually exclusive candidate classification.

In one or more embodiments, the attribute determination module 640 comprises: a fourth attribute determination module (not shown) configured to determine attributes of the persona using a multi-modal fusion method based on the behavior classification information and the sound classification information.

In one or more embodiments, the apparatus 600 for determining attributes of a person further comprises: an image determination module (not shown) configured to determine an image including a person from the video segment; a static classification information determination module (not shown) configured to determine static classification information of a person through an image; and the attribute determining module 640 includes: a fifth attribute determination module (not shown) configured to determine an attribute of the person based on the behavior classification information, the sound classification information, and the static classification information.

In one or more embodiments, wherein determining the image comprises determining a plurality of images including a person, and the static classification information determination module comprises: a first static classification information determination module (not shown) configured to determine a plurality of static classification information of the person through a plurality of images; and a second static classification information determination module (not shown) configured to determine the static classification information by a voting method based on the plurality of static classification information.

In one or more embodiments, the second static classification information determination module includes: a gender information determination module (not shown) configured to determine gender information of the person based on at least one of the plurality of static classification information; a static classification information verification module (not shown) configured to verify a plurality of static classification information using gender information; and an error static information processing module (not shown) configured to, for static classification information verified as erroneous, perform one of: decreasing the voting weight, and discarding.

Through the above description with reference to fig. 1 to 6, the technical solution according to the embodiments of the present disclosure has many advantages over the conventional solution. For example, with the technical scheme, the attributes of the people in the video can be accurately and efficiently determined, so that the people in the video can be accurately described and the video can be accurately classified, and therefore the accuracy of video recommendation can be improved and the user experience can be improved. On the application level, the technical scheme of the embodiment of the disclosure as the bottom technical scheme of the content understanding side can be directly applied to various live and small video platforms and applications, so as to serve as an associated bridge between users of the platforms and the applications and videos or anchor broadcasters. Compared with the traditional method for static recognition or voice recognition of character attributes, the attribute of the character determined by multi-mode implicit vector fusion can better depict the attribute of the anchor in the dynamic video, and the accurate recommendation of a recommendation side aiming at a platform and an applied user can be helped, so that the user viscosity can be enhanced, and the core competitiveness of the product can be increased.

The present disclosure also provides an electronic device, a computer-readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 7 illustrates a schematic block diagram of an example electronic device 700 that can be used to implement embodiments of the present disclosure. For example, the computing device 110 shown in FIG. 1 and the apparatus 600 for determining attributes of a person shown in FIG. 6 may be implemented by the electronic device 700. The electronic device 700 is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 7, the device 700 comprises a computing unit 701, which may perform various suitable actions and processes according to a computer program stored in a Read Only Memory (ROM)702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the device 700 can also be stored. The computing unit 701, the ROM 702, and the RAM 703 are connected to each other by a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

Various components in the device 700 are connected to the I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, or the like; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the device 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

Computing unit 701 may be a variety of general purpose and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The computing unit 701 performs the various methods and processes described above, such as the methods 200 and 300. For example, in some embodiments, methods 200 and 300 may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 708. In some embodiments, part or all of a computer program may be loaded onto and/or installed onto device 700 via ROM 702 and/or communications unit 709. When the computer program is loaded into RAM 703 and executed by computing unit 701, one or more steps of methods 200 and 300 described above may be performed. Alternatively, in other embodiments, the computing unit 701 may be configured to perform the methods 200 and 300 in any other suitable manner (e.g., by way of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

21页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：视频浏览方法、装置、终端及存储介质

Method, apparatus, electronic device, medium, and program product for determining attributes

相关技术

网友询问留言