object recognition method and device

文档序号：1783052 发布日期：2019-12-06 浏览：30次中文

阅读说明：本技术 物体识别方法及装置 (object recognition method and device ) 是由黄飞吴小飞李志豪许松岑刘健庄颜友亮钱莉黄雪妍于 2019-08-09 设计创作，主要内容包括：本申请涉及人工智能领域。具体涉及计算机视觉领域,公开了一种优化用户拍照姿势的方法,应用于电子设备,所述方法包括：显示所述电子设备的相机的拍摄界面；获取所述拍摄界面的取景图像,根据所述取景图像确定所述拍摄界面上包括人像；进入姿态推荐模式,将人体姿态推荐图片按照预定的预览方式呈现给用户；所述人体姿态图片为使用度量学习,从图片库中选取同所述取景图像相似度排名靠前的至少一张图片；其中,所述相似度为融合了背景相似度和前景相似度的整体相似度。(the application relates to the field of artificial intelligence. Specifically, the field of computer vision, discloses a method for optimizing a photographing posture of a user, which is applied to electronic equipment, and comprises the following steps: displaying a shooting interface of a camera of the electronic equipment; acquiring a framing image of the shooting interface, and determining that the shooting interface comprises a portrait according to the framing image; entering a posture recommendation mode, and presenting a human body posture recommendation picture to a user according to a preset preview mode; the human body posture picture is used for measuring and learning, and at least one picture which is ranked at the front of the similarity of the view finding image is selected from a picture library; and the similarity is the overall similarity fusing the background similarity and the foreground similarity.)

1. A human posture similar picture recommendation method is characterized by comprising the following steps:

Receiving an input picture, wherein the input picture comprises a portrait;

Selecting at least one picture with the highest similarity with the input picture from a picture library as a human posture recommended picture by using metric learning based on multi-level environment information characteristics; wherein the human posture recommendation picture comprises a portrait; the multi-level environmental features include: scene characteristics, spatial distribution characteristics of objects and foreground human body characteristics;

And presenting the human body posture recommendation picture according to a preset preview mode.

2. the picture recommendation method of claim 1, wherein the method comprises:

Receiving recommendation preference settings of a user; wherein, the selecting at least one picture with the highest similarity with the input picture from a picture library as a human posture recommendation picture by using metric learning based on multi-level environmental information characteristics comprises:

Selecting at least one picture with the highest similarity to the input picture from a picture library as a human posture recommended picture by using metric learning based on multi-level environment information characteristics and combining the recommended preference of the user; the human posture recommendation picture conforms to the recommendation preference of the user.

3. The picture recommendation method according to claim 1 or 2, wherein the selecting at least one picture with the highest similarity to the input picture from a picture library as a human posture recommendation picture by using metric learning based on multi-level environmental information features comprises:

performing feature extraction processing on the input picture to obtain features of the input picture;

calculating the similarity between the characteristics of the input picture and the image characteristics in each characteristic library by using metric learning based on multi-level environment information characteristics; the feature library is obtained by extracting features of a preset number of dimensions for each picture in the picture library;

and selecting at least one corresponding picture with the higher similarity ranking from the picture library as a human posture recommendation picture according to the calculation result.

4. The picture recommendation method of claim 3, wherein the method comprises:

Receiving recommendation preference settings of a user;

And screening out pictures which accord with the recommendation preference from the human body posture pictures to serve as final human body posture recommendation pictures.

5. The picture recommendation method of any one of claims 1-4, wherein the person is a subject of a picture taking, and receiving an input picture comprises: receiving a plurality of input pictures containing the shooting target at different angles;

Using measurement learning based on multi-level environment information characteristics to select at least one picture with the highest similarity with the input picture from a picture library as a human posture recommended picture, and the method comprises the following steps:

calculating the most similar picture in the picture library to each input picture by using metric learning based on multi-level environmental information characteristics;

And sequencing all the most similar pictures, and selecting at least one picture with the top ranking as the human body posture recommendation picture.

6. the picture recommendation method of any one of claims 1-5, wherein the method further comprises:

receiving a user-defined picture uploaded by a user;

and updating the user-defined picture to the picture library.

7. An image recommendation apparatus, characterized in that the apparatus comprises:

the receiving module is used for receiving an input picture, and the input picture comprises a portrait;

the recommendation module is used for selecting at least one picture with the highest similarity with the input picture received by the receiving module from a picture library as a human posture recommendation picture by using metric learning based on multi-level environment information characteristics; wherein the multi-level environmental features include: scene characteristics, spatial distribution characteristics of objects and foreground human body characteristics;

And the output module is used for presenting the human body posture recommendation picture according to a preset preview mode.

8. the picture recommendation device of claim 7, wherein the device further comprises:

The preference setting receiving module is used for receiving the recommendation preference setting of the user; the recommendation module is specifically configured to:

9. the picture recommendation device of claim 1 or 2, wherein the recommendation module comprises:

the characteristic extraction unit is used for carrying out characteristic extraction processing on the input picture to obtain the characteristics of the input picture;

The similarity calculation unit is used for calculating the similarity between the features of the input picture and the image features in each feature library based on the metric learning of the multilevel environment information features; the feature library is obtained by extracting features of a preset number of dimensions for each picture in the picture library;

And the recommending unit is used for selecting at least one picture corresponding to the similarity with the top rank from the picture library as a human posture recommending picture according to the calculation result.

10. the picture recommendation device of claim 7, wherein the receiving module is further configured to receive a plurality of input pictures containing the shooting target at different angles;

The recommendation module comprises:

the similarity calculation unit is used for calculating the most similar picture to each input picture in the picture library based on the metric learning of multi-level environment information characteristics;

and the recommending unit is used for sequencing all the most similar pictures by the user and selecting at least one picture with the top rank as the human posture recommending picture.

11. the picture recommendation device according to any one of claims 7-10, wherein said device further comprises:

the user-defined picture receiving module is used for receiving a user-defined picture uploaded by a user;

And the updating module is used for updating the user-defined picture to the picture library.

12. a method for prompting a user to take a picture with a similar composition is characterized in that the method comprises the steps of receiving a plurality of original picture sets which are framed by the user at a current position and contain a shooting target and are in different angles;

Recommending at least one target picture and at least one corresponding original picture to a user, wherein the target picture comprises a recommended character gesture, and the target picture and the corresponding original picture have similar background compositions.

13. the method of claim 12, wherein the method further comprises:

and displaying a preview frame on a shooting interface, and displaying the target picture and a preview picture corresponding to the corresponding original picture in the preview frame.

14. An intelligent terminal capable of prompting a user to take pictures of similar compositions is characterized in that the device comprises:

the receiving module is used for receiving a plurality of original picture sets which are framed at the current position by a user and contain shooting targets at different angles;

the recommendation module is used for recommending at least one target picture and at least one corresponding original picture to a user, wherein the target picture comprises a recommended character gesture, and the target picture and the corresponding original picture have similar background compositions.

15. the intelligent terminal of claim 14, wherein the apparatus further comprises:

And the presentation module is used for displaying a preview frame on a shooting interface, and displaying the target picture and the preview picture corresponding to the corresponding original picture in the preview frame.

16. A method of constructing a human body feature library, the method comprising:

Calculating the similarity between every two human body posture pictures in the human body posture library;

for each picture in the human body posture library, collecting a triple training sample according to the similarity between every two pictures of the human body posture; wherein each triplet training sample < a, P, N > comprises three human pose images: a is a certain human body posture picture in the human body posture library, P is a positive sample of the picture A, the positive sample is a human body posture picture which can be directly recommended in the shooting scene of the picture A, N is a negative sample of the picture A, and the negative sample is a human body posture picture which cannot be directly recommended in the shooting scene of the picture A;

Training the triple training samples in a metric learning mode to obtain a CNN feature extraction model; the CNN feature extraction model enables the distance of samples which can be recommended mutually to be close after the samples are mapped to the feature space, and the distance of samples which can not be recommended to be far away after the samples are mapped to the feature space.

and extracting the characteristics of a preset number of dimensions for each picture in the human posture picture library by using the CNN characteristic extraction model to construct a human posture characteristic library.

17. The method of claim 16, wherein calculating the similarity between two human pose pictures in the human pose library comprises:

Calculating the background similarity and the foreground similarity between every two human body posture pictures in the human body posture library, wherein the foreground similarity comprises the foreground human body feature similarity;

and fusing the background similarity and the foreground similarity between every two human posture pictures in the human posture library to obtain the overall similarity between every two human posture pictures in the human posture library.

18. The method of claim 16, wherein the calculating of the background similarity and the foreground similarity between two of the human pose pictures in the human pose library comprises:

Calculating the background similarity between every two human body posture pictures in the human body posture library through a scene classification algorithm and a scene analysis algorithm;

And calculating the foreground similarity between every two human body posture pictures in the human body posture library through a human body attribute extraction algorithm.

19. the method of claim 16, wherein for each picture in the human pose library, collecting triplet training samples according to the similarity between two of the human pose pictures comprises:

For each picture in the human body posture library, taking a plurality of pictures with the top similarity ranking in the human body posture library as positive samples, and taking all the remaining pictures as negative samples, wherein the positive samples and the negative samples form main elements of the triples.

20. an apparatus for constructing a human body feature library, the apparatus comprising:

The image similarity calculation module is used for calculating the similarity between every two human body posture pictures in the human body posture library;

the training sample acquisition module is used for acquiring a triple training sample for each picture in the human body posture library according to the similarity between every two pictures in the human body posture library; wherein each triplet training sample < a, P, N > comprises three human pose images: a is a certain human body posture picture in the human body posture library, P is a positive sample of the picture A, the positive sample is a human body posture picture which can be directly recommended in the shooting scene of the picture A, N is a negative sample of the picture A, and the negative sample is a human body posture picture which cannot be directly recommended in the shooting scene of the picture A;

the CNN feature learning module is used for training the triple training samples in a metric learning mode to obtain a CNN feature extraction model; the CNN feature extraction model enables samples which can be recommended mutually to be close after being mapped to a feature space, and samples which cannot be recommended are far away after being mapped to the feature space.

And the human body posture feature library construction module is used for extracting the features of a preset number of dimensions for each picture in the human body posture picture library by using the CNN feature extraction model so as to construct the human body posture feature library.

21. The apparatus of claim 20, wherein the calculate image similarity module comprises:

The similarity calculation unit is used for calculating the background similarity and the foreground similarity between every two human body posture pictures in the human body posture library;

and the fusion unit is used for fusing the background similarity and the foreground similarity between every two human posture pictures in the human posture library to obtain the overall similarity between every two human posture pictures in the human posture library.

22. The method of claim 20, wherein the sample acquisition module is trained to:

for each picture in the human body posture library, taking a plurality of pictures with the top similarity ranking in the human body posture library as positive samples, and taking all the rest pictures as negative samples.

23. A method for optimizing a photographing gesture of a user is applied to an electronic device, and is characterized by comprising the following steps:

Displaying a shooting interface of the electronic equipment;

Acquiring a framing image of the shooting interface, wherein the framing image comprises a portrait;

Presenting the human body posture recommendation picture according to a preset preview mode; the human body posture picture is used for measuring and learning, and at least one picture which is ranked at the front of the similarity of the view finding image is selected from a picture library; and the similarity is the overall similarity fusing the background similarity and the foreground similarity.

24. The method of claim 23, wherein prior to presenting the human pose recommendation picture to the user in a predetermined preview manner, the method further comprises:

Performing feature extraction processing on the view finding image to obtain features of the view finding image;

calculating the similarity between the characteristics of the view finding image and the characteristics of each image in a characteristic library; the feature library is obtained by extracting features of a preset number of dimensions for each picture in the picture library;

and sequencing the similarity, and selecting at least one picture corresponding to the similarity ranked at the top from the picture library as a human posture recommendation picture.

25. the method of claim 23, wherein prior to presenting the human pose recommendation picture to the user in a predetermined preview manner, the method further comprises:

performing feature extraction processing on the view finding image to obtain features of the view finding image;

transmitting the framing image features to a cloud server;

Receiving the human body posture recommendation picture; the human body posture recommendation picture is that the cloud server selects at least one picture which is ranked at the front of the similarity of the view image from a picture library according to the view image characteristics.

26. the method of any one of claims 23-25, wherein prior to presenting the human gesture recommendation picture to the user in a predetermined preview manner, the method further comprises:

receiving recommendation preference settings of a user;

Selecting at least one picture with the similarity ranking with the input picture in front from a picture library as a human posture recommended picture based on metric learning and the recommendation preference of the user; the human posture recommendation picture conforms to the recommendation preference of the user.

27. A method according to any one of claims 23 to 26, further comprising:

Receiving a user-defined picture uploaded by a user;

and updating the user-defined picture to the picture library.

28. an electronic device, comprising:

one or more processors;

one or more memories;

a plurality of application programs;

And one or more programs, wherein the one or more programs are stored in the memory, which when executed by the processor, cause the electronic device to perform the steps of:

displaying a shooting interface of a camera of the electronic equipment;

Acquiring a framing image of the shooting interface, and determining that the shooting interface comprises a portrait according to the framing image;

entering a posture recommendation mode, and presenting a human body posture recommendation picture to a user according to a preset preview mode; the human body posture picture is used for measuring and learning, and at least one picture which is ranked at the front of the similarity of the view finding image is selected from a picture library; and the similarity is the overall similarity fusing the background similarity and the foreground similarity.

29. the electronic device of claim 28, wherein the one or more programs, when executed by the processor, cause the electronic device to perform the steps of:

Performing feature extraction processing on the view finding image to obtain features of the view finding image;

And sequencing the similarity, and selecting at least one picture corresponding to the similarity ranked at the top from the picture library as a human posture recommendation picture.

30. The electronic device of claim 28, wherein the one or more programs, when executed by the processor, cause the electronic device to perform the steps of:

performing feature extraction processing on the view finding image to obtain features of the view finding image;

switching to a cloud intelligent recommendation mode by a corresponding user, and then transmitting the framing image characteristics to a cloud server;

receiving the human body posture recommendation picture; the human body posture recommendation picture is that the cloud server selects at least one picture which is ranked at the front of the similarity of the view image from a picture library according to the view image characteristics.

31. The electronic device of any of claims 28-30, wherein the one or more programs, when executed by the processor, cause the electronic device to perform the steps of:

receiving recommendation preference setting of a user;

32. a method according to any one of claims 28-31, wherein the one or more programs, when executed by the processor, cause the electronic device to perform the steps of:

receiving a user-defined picture uploaded by a user;

and updating the user-defined picture to the picture library.

33. A computer storage medium comprising computer instructions which, when run on an electronic device, cause the electronic device to perform the human pose similar picture recommendation method of any one of claims 1 to 6.

34. a computer storage medium comprising computer instructions which, when run on an electronic device, cause the electronic device to perform a method of optimizing a user's photograph pose as claimed in any one of claims 23 to 27.

35. A computer program product, characterized in that when the computer program product is run on a computer, it causes the computer to execute the human pose similar picture recommendation method according to any one of claims 1 to 6.

36. a computer program product, which, when run on a computer, causes the computer to perform a method of optimizing a user's photographing posture as claimed in any one of claims 23 to 27.

Technical Field

The present application relates to the field of artificial intelligence, and in particular, to an object recognition method and apparatus.

Background

computer vision is an integral part of various intelligent/autonomous systems in various application fields, such as manufacturing, inspection, document analysis, medical diagnosis, military and the like, and is a study on how to use cameras/video cameras and computers to acquire data and information of a photographed object which are required by us. In a descriptive sense, a computer is provided with eyes (a camera or a video camera) and a brain (an algorithm) to identify, track, measure and the like an object instead of human eyes, so that the computer can perceive the environment. Because perception can be viewed as extracting information from sensory signals, computer vision can also be viewed as the science of how to make an artificial system "perceive" from images or multidimensional data. Generally, computer vision is to use various imaging systems to obtain input information instead of visual organs, and then the computer is used to process and interpret the input information instead of the brain. The ultimate research goal of computer vision is to make a computer have the ability to adapt to the environment autonomously by visually observing and understanding the world like a human.

The human body posture recommendation method is a very novel application problem in the field of computer vision, is applied to the situation of portrait photographing of a mobile phone, and can recommend a series of professional human body posture pictures which are highly similar to the current environment according to the environment information of the current photographed person when a user photographs the portrait in a daily scene, so that the photographed person can select and refer to the pictures, and the posture sense and the aesthetic feeling of portrait photographing are further improved.

Currently, some methods for recommending human body postures exist in the industry, but the existing methods have limited information, poor recommendation results and difficulty in meeting the requirements of practical application, or the used models have too high complexity, cannot support real-time recommendation requirements and cannot be deployed on terminal equipment such as mobile phones with limited computing power. Therefore, the application scenarios of the existing human body posture recommendation method are very limited.

Disclosure of Invention

The invention comprehensively considers the defects of the prior art, provides a technical scheme for recommending the human body posture, extracts the complex environment information in the preview picture by methods of scene classification, scene analysis, human body attributes and the like, performs information fusion and model training by measurement learning, realizes a scheme with high light weight and high accuracy, and can be deployed to mobile terminals such as mobile phones and the like for real-time recommendation.

on one hand, the embodiment of the application provides a human posture similar picture recommendation method, which comprises the following steps:

receiving an input picture, wherein the input picture comprises a portrait;

Selecting at least one picture with the highest similarity with the input picture from a picture library as a human posture recommended picture by using metric learning based on multi-level environment information characteristics; wherein the multi-level environmental features include: scene characteristics, spatial distribution characteristics of objects and foreground human body characteristics;

And presenting the human body posture recommendation picture to a user according to a preset preview mode.

optionally, the method comprises:

optionally, the selecting, from a picture library, at least one picture with the highest similarity to the input picture as a human posture recommended picture by using metric learning based on multi-level environmental information features includes:

performing feature extraction processing on the input picture to obtain features of the input picture;

Calculating the similarity between the characteristics of the input picture and the image characteristics in each characteristic library by using metric learning based on multi-level environment information characteristics; the feature library is obtained by extracting features of a preset number of dimensions for each picture in the picture library;

And selecting at least one corresponding picture with the higher similarity ranking from the picture library as a human posture recommendation picture according to the calculation result.

Optionally, the method comprises:

Receiving recommendation preference settings of a user;

And screening out pictures which accord with the recommendation preference from the human body posture pictures to serve as final human body posture recommendation pictures.

Optionally, receiving the input picture comprises: receiving a plurality of input pictures containing a shooting target at different angles; optionally, another alternative is: receiving an input picture includes: receiving an input picture containing at least one different angle of a shooting target;

calculating the most similar picture in the picture library to each input picture by using metric learning based on multi-level environmental information characteristics;

and sequencing all the most similar pictures, and selecting at least one picture with the top ranking as the human body posture recommendation picture.

optionally, the method further comprises:

receiving a user-defined picture uploaded by a user;

And updating the user-defined picture to the picture library.

in one aspect, an embodiment of the present application provides an image recommendation device, where the device includes:

The receiving module is used for receiving an input picture, and the input picture comprises a portrait;

And the output module is used for presenting the human body posture recommendation picture to a user according to a preset preview mode.

optionally, the apparatus further comprises:

The preference setting receiving module is used for receiving the recommendation preference setting of the user; the recommendation module is specifically configured to:

selecting at least one picture with the highest similarity to the input picture from a picture library as a human posture recommended picture by using metric learning based on multi-level environment information characteristics and combining the recommended preference of the user; the human posture recommendation picture conforms to the recommendation preference of the user.

Optionally, the recommendation module includes:

the characteristic extraction unit is used for carrying out characteristic extraction processing on the input picture to obtain the characteristics of the input picture;

and the recommending unit is used for selecting at least one picture corresponding to the similarity with the top rank from the picture library as a human posture recommending picture according to the calculation result.

optionally, the receiving module is further configured to receive a plurality of input pictures containing a shooting target at different angles;

the recommendation module comprises:

The similarity calculation unit is used for calculating the most similar picture to each input picture in the picture library based on the metric learning of multi-level environment information characteristics;

And the recommending unit is used for sequencing all the most similar pictures by the user and selecting at least one picture with the top rank as the human posture recommending picture.

optionally, the apparatus further comprises:

The user-defined picture receiving module is used for receiving a user-defined picture uploaded by a user;

and the updating module is used for updating the user-defined picture to the picture library.

On one hand, the embodiment of the application provides a method for prompting a user to take a picture with a similar composition, and the method comprises the steps of receiving a plurality of original picture sets which are framed at the current position by the user and contain a shooting target and are in different angles;

optionally, the method further comprises:

Displaying a preview frame on a shooting interface, displaying the target picture and a preview picture corresponding to the corresponding original picture in the preview frame, and assisting with a text prompt.

In one aspect, an embodiment of the present application provides an intelligent terminal capable of prompting a user to take a picture of a similar composition, where the intelligent terminal includes:

the receiving module is used for receiving a plurality of original picture sets which are framed at the current position by a user and contain shooting targets at different angles;

Optionally, the apparatus further comprises:

And the presentation module is used for displaying a preview frame on a shooting interface, displaying the target picture and a preview picture corresponding to the corresponding original picture in the preview frame, and assisting with text prompt.

in one aspect, an embodiment of the present application provides a method for constructing a human body feature library, where the method includes:

calculating the similarity between every two human body posture pictures in the human body posture library;

For each picture in the human body posture library, collecting a triple training sample according to the similarity between every two pictures of the human body posture; wherein each triplet training sample < a, P, N > comprises three human pose images: a is a certain human body posture picture in the human body posture library, P is a positive sample of the picture A, the positive sample is a human body posture picture which can be directly recommended in the shooting scene of the picture A, N is a negative sample of the picture A, and the negative sample is a human body posture picture which cannot be directly recommended in the shooting scene of the picture A;

And extracting the characteristics of a preset number of dimensions for each picture in the human posture picture library by using the CNN characteristic extraction model to construct a human posture characteristic library.

optionally, the calculating the similarity between every two human body posture pictures in the human body posture library includes:

calculating the background similarity and the foreground similarity between every two human posture pictures in the human posture library;

And fusing the background similarity and the foreground similarity between every two human posture pictures in the human posture library to obtain the overall similarity between every two human posture pictures in the human posture library.

optionally, the calculating a background similarity and a foreground similarity between every two of the human body posture pictures in the human body posture library includes:

Calculating the background similarity between every two human body posture pictures in the human body posture library through a scene classification algorithm and a scene analysis algorithm;

And calculating the foreground similarity between every two human body posture pictures in the human body posture library through a human body attribute extraction algorithm.

optionally, for each picture in the human body posture library, collecting a triple training sample according to a similarity between every two of the human body posture pictures, including:

In one aspect, an embodiment of the present application provides an apparatus for constructing a human body feature library, where the apparatus includes:

the image similarity calculation module is used for calculating the similarity between every two human body posture pictures in the human body posture library;

The training sample acquisition module is used for acquiring a triple training sample for each picture in the human body posture library according to the similarity between every two pictures in the human body posture library; wherein each triplet training sample < a, P, N > comprises three human pose images: a is a certain human body posture picture in the human body posture library, P is a positive sample of the picture A, the positive sample is a human body posture picture which can be directly recommended in the shooting scene of the picture A, N is a negative sample of the picture A, and the negative sample is a human body posture picture which cannot be directly recommended in the shooting scene of the picture A;

the CNN feature learning module is used for training the triple training samples in a metric learning mode to obtain a CNN feature extraction model; the CNN feature extraction model enables samples which can be recommended to each other to be as close as possible after being mapped to a feature space, and samples which cannot be recommended to be as far away as possible after being mapped to the feature space.

And the human body posture feature library construction module is used for extracting the features of the preset number of dimensions for each picture in the human body posture picture library by using the CNN feature extraction model so as to construct the human body posture feature library.

Optionally, the module for calculating image similarity includes:

The similarity calculation unit is used for calculating the background similarity and the foreground similarity between every two human body posture pictures in the human body posture library;

And the fusion unit is used for fusing the background similarity and the foreground similarity between every two human posture pictures in the human posture library to obtain the overall similarity between every two human posture pictures in the human posture library.

Optionally, the training sample collection module is configured to:

In one aspect, an embodiment of the present application provides a method for optimizing a photographing gesture of a user, which is applied to an electronic device, and the method includes:

displaying a shooting interface of a camera of the electronic equipment;

Acquiring a framing image of the shooting interface, and determining that the shooting interface comprises a portrait according to the framing image;

optionally, after entering the gesture recommendation mode, before presenting the human body gesture recommendation picture to the user in a predetermined preview manner, the method further includes:

Performing feature extraction processing on the view finding image to obtain features of the view finding image;

and sequencing the similarity, and selecting at least one picture corresponding to the similarity ranked at the top from the picture library as a human posture recommendation picture.

Optionally, after entering the gesture recommendation mode, before presenting the human body gesture recommendation picture to the user in a predetermined preview manner, the method further includes:

Performing feature extraction processing on the view finding image to obtain features of the view finding image;

switching to a cloud intelligent recommendation mode by a corresponding user, and then transmitting the framing image characteristics to a cloud server;

Optionally, after entering the gesture recommendation mode, before presenting the human body gesture recommendation picture to the user in a predetermined preview manner, the method further includes:

Receiving recommendation preference setting of a user;

optionally, the method further comprises:

Receiving a user-defined picture uploaded by a user;

And updating the user-defined picture to the picture library.

in one aspect, an embodiment of the present application provides an electronic device, which includes:

One or more processors;

one or more memories;

a plurality of application programs;

and one or more programs, wherein the one or more programs are stored in the memory, which when executed by the processor, cause the electronic device to perform the steps of:

displaying a shooting interface of a camera of the electronic equipment;

Acquiring a framing image of the shooting interface, and determining that the shooting interface comprises a portrait according to the framing image;

Entering a posture recommendation mode, and presenting a human body posture recommendation picture to a user according to a preset preview mode; the human body posture picture is used for measuring and learning, and at least one picture which is ranked at the front of the similarity of the view finding image is selected from a picture library; and the similarity is the overall similarity fusing the background similarity and the foreground similarity.

Optionally, the one or more programs, when executed by the processor, cause the electronic device to perform the steps of:

Performing feature extraction processing on the view finding image to obtain features of the view finding image;

And sequencing the similarity, and selecting at least one picture corresponding to the similarity ranked at the top from the picture library as a human posture recommendation picture.

Optionally, the one or more programs, when executed by the processor, cause the electronic device to perform the steps of:

performing feature extraction processing on the view finding image to obtain features of the view finding image;

Switching to a cloud intelligent recommendation mode by a corresponding user, and then transmitting the framing image characteristics to a cloud server;

optionally, the one or more programs, when executed by the processor, cause the electronic device to perform the steps of:

Receiving recommendation preference setting of a user;

selecting at least one picture with the similarity ranking with the input picture in front from a picture library as a human posture recommended picture based on metric learning and the recommendation preference of the user; the human posture recommendation picture conforms to the recommendation preference of the user.

optionally, the one or more programs, when executed by the processor, cause the electronic device to perform the steps of:

Receiving a user-defined picture uploaded by a user;

and updating the user-defined picture to the picture library.

in one aspect, an embodiment of the present application provides a computer storage medium, which includes computer instructions, and when the computer instructions are run on an electronic device, the electronic device is caused to execute the human posture similar picture recommendation method according to any one of the above.

in one aspect, an embodiment of the present application provides a computer program product, which is characterized in that when the computer program product runs on a computer, the computer is caused to execute the human posture similar picture recommendation method as described above.

the embodiment of the invention uses multi-level characteristics of images, utilizes beneficial information in human posture recommendation, defines the similarity of the human posture recommendation based on the beneficial information, effectively performs information fusion and model training through measurement learning, realizes a scheme with high light weight and high accuracy, and can be deployed to mobile terminals such as mobile phones and the like to perform posture recommendation in real time.

Further, the user self-defines a recommendation gallery, and can upload self-defined human posture pictures through a sharing mechanism, so that local and cloud galleries are continuously updated and expanded.

furthermore, the user can set user preference options by combining the current environment, and then human body posture pictures actually needed by the user are recommended according to the personalized setting of the user, so that the user experience is further improved.

these and other aspects of the present application will be more readily apparent from the following description of the embodiments.

drawings

in order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art in the present application, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic structural diagram of a system architecture according to an embodiment of the present application;

Fig. 2 is a schematic diagram of a CNN feature extraction model provided in an embodiment of the present application;

FIG. 3 is a schematic diagram of an effect provided by an embodiment of the present application;

Fig. 4 is a schematic diagram illustrating an effect provided by an embodiment of the present application;

fig. 5 is a schematic diagram of a system implementation provided in an embodiment of the present application;

fig. 6 is a flowchart of a human body posture picture recommendation method according to an embodiment of the present application;

FIG. 7 is a flowchart of a human body posture picture recommendation method according to an embodiment of the present application;

Fig. 8 is a schematic diagram of a network structure for multitask metric learning according to an embodiment of the present application;

fig. 9a is a schematic diagram illustrating an effect provided by an embodiment of the present application;

FIG. 9b is a schematic diagram illustrating an effect provided by an embodiment of the present application;

fig. 10 is a schematic diagram of a human body posture picture recommendation method according to an embodiment of the present application;

FIG. 11 is a schematic view of a user interface provided by an embodiment of the present application;

FIG. 12 is a schematic view of a user interface provided by an embodiment of the present application;

fig. 13 is a flowchart of a human body posture picture recommendation method according to an embodiment of the present application;

FIG. 14 is a schematic diagram of a network structure for multi-task metric learning according to an embodiment of the present disclosure;

FIG. 15 is a schematic view of a user interface provided by an embodiment of the present application;

FIG. 16 is a schematic view of a user interface provided by an embodiment of the present application;

FIG. 17 is a schematic view of a user interface provided by an embodiment of the present application;

FIG. 18 is a schematic view of a user interface provided by an embodiment of the present application;

Fig. 19 is a schematic diagram illustrating a method for recommending a human posture similar picture according to an embodiment of the present application;

Fig. 20 is a schematic view of a human body posture picture recommendation device according to an embodiment of the present application;

fig. 21 is a schematic diagram illustrating a method for prompting a user to take a picture of a similar composition according to an embodiment of the present application;

Fig. 22 is a schematic diagram of an intelligent terminal capable of prompting a user to take a picture of a similar composition according to an embodiment of the present application;

Fig. 23 is a flowchart of a method for constructing a human body feature library according to an embodiment of the present application;

fig. 24 is a schematic diagram of an apparatus for constructing a human body feature library according to an embodiment of the present application;

FIG. 25 is a flowchart of a method for optimizing a photographing gesture of a user according to an embodiment of the present disclosure;

Fig. 26 is a schematic structural diagram of an electronic device according to an embodiment of the present application;

fig. 27 is a schematic diagram of a chip structure according to an embodiment of the present application.

Detailed Description

first, the abbreviations used in the examples of the present application are listed below:

TABLE 1

Since the embodiments of the present application relate to the application of a large number of neural networks, for the convenience of understanding, the related terms and related concepts such as neural networks related to the embodiments of the present application will be described below.

(1) and (4) object identification, namely determining the category of the image object by using image processing and machine learning, computer graphics and other related methods.

(2) neural network

the neural network may be composed of neural units, the neural units may refer to operation units with xs and intercept 1 as inputs, and the output of the operation units may be:

Where s is 1, 2, … … n, n is a natural number greater than 1, Ws is the weight of xs, and b is the bias of the neural unit. f is an activation function (activation functions) of the neural unit for introducing a nonlinear characteristic into the neural network to convert an input signal in the neural unit into an output signal. The output signal of the activation function may be used as an input to the next convolutional layer. The activation function may be a sigmoid function. A neural network is a network formed by a number of the above-mentioned single neural units joined together, i.e. the output of one neural unit may be the input of another neural unit. The input of each neural unit can be connected with the local receiving domain of the previous layer to extract the characteristics of the local receiving domain, and the local receiving domain can be a region composed of a plurality of neural units.

(3) Deep neural network

Deep Neural Networks (DNNs), also known as multi-layer Neural networks, can be understood as Neural networks having many hidden layers, where "many" has no particular metric. From the division of DNNs by the location of different layers, neural networks inside DNNs can be divided into three categories: input layer, hidden layer, output layer. Generally, the first layer is an input layer, the last layer is an output layer, and the middle layers are hidden layers. The layers are all connected, that is, any neuron of the ith layer is necessarily connected with any neuron of the (i + 1) th layer. Although DNN appears complex, it is not really complex in terms of the work of each layer, simply the following linear relational expression: here, an input vector, an output vector, and an offset vector, W is a weight matrix (also called coefficient), and α () is an activation function. Each layer is only to obtain an output vector by carrying out such simple operation on the input vector, and the number of coefficients W and offset vectors is large due to the large number of DNN layers. The definition of these parameters in DNN is as follows: taking coefficient W as an example: assume that in a three-level DNN, linear coefficients from the 4 th neuron of the second level to the 2 nd neuron of the third level are defined as the number of levels where the superscript 3 represents the coefficient W, and the subscripts correspond to the output third level index 2 and the input second level index 4. The summary is that: the coefficients of the kth neuron at layer L-1 to the jth neuron at layer L are defined to note that the input layer is without the W parameter. At depth of

in a neural network, more hidden layers make the network more capable of depicting complex situations in the real world. Theoretically, the more parameters the higher the model complexity, the larger the "capacity", which means that it can accomplish more complex learning tasks. The final goal of the process of training the deep neural network, i.e., learning the weight matrix, is to obtain the weight matrix (the weight matrix formed by the vectors W of many layers) of all the layers of the deep neural network that is trained.

(4) convolutional neural network

A Convolutional Neural Network (CNN) is a deep neural Network with a Convolutional structure. The convolutional neural network includes a feature extractor consisting of convolutional layers and sub-sampling layers. The feature extractor may be viewed as a filter and the convolution process may be viewed as convolving an input image or convolved feature plane (feature map) with a trainable filter. The convolutional layer is a neuron layer for performing convolutional processing on an input signal in a convolutional neural network. In convolutional layers of convolutional neural networks, one neuron may be connected to only a portion of the neighbor neurons. In a convolutional layer, there are usually several characteristic planes, and each characteristic plane may be composed of several neural units arranged in a rectangular shape. The neural units of the same feature plane share weights, where the shared weights are convolution kernels. Sharing weights may be understood as the way image information is extracted is location independent. The underlying principle is: the statistics of a certain part of the image are the same as the other parts. Meaning that image information learned in one part can also be used in another part. The same learned image information can be used for all positions on the image. In the same convolution layer, a plurality of convolution kernels can be used to extract different image information, and generally, the greater the number of convolution kernels, the more abundant the image information reflected by the convolution operation.

The convolution kernel can be initialized in the form of a matrix of random size, and can be learned to obtain reasonable weights in the training process of the convolutional neural network. In addition, sharing weights brings the direct benefit of reducing connections between layers of the convolutional neural network, while reducing the risk of overfitting.

(5) loss function

in the process of training the deep neural network, because the output of the deep neural network is expected to be as close to the value really expected to be predicted as possible, the weight vector of each layer of the neural network can be updated according to the difference between the predicted value of the current network and the really expected target value (of course, an initialization process is usually carried out before the first updating, namely parameters are preset for each layer in the deep neural network), for example, if the predicted value of the network is high, the weight vector is adjusted to be slightly lower, and the adjustment is carried out continuously until the deep neural network can predict the really expected target value or the value which is very close to the really expected target value. Therefore, it is necessary to define in advance "how to compare the difference between the predicted value and the target value", which are loss functions (loss functions) or objective functions (objective functions), which are important equations for measuring the difference between the predicted value and the target value. Taking the loss function as an example, if the higher the output value (loss) of the loss function indicates the larger the difference, the training of the deep neural network becomes the process of reducing the loss as much as possible.

(6) back propagation algorithm

The convolutional neural network can adopt a Back Propagation (BP) algorithm to correct the size of parameters in the initial super-resolution model in the training process, so that the reconstruction error loss of the super-resolution model is smaller and smaller. Specifically, error loss occurs when an input signal is transmitted in a forward direction until the input signal is output, and parameters in an initial super-resolution model are updated by reversely propagating error loss information, so that the error loss is converged. The back propagation algorithm is a back propagation motion with error loss as a dominant factor, aiming at obtaining the optimal parameters of the super-resolution model, such as a weight matrix.

it should be noted that, in some drawings of the embodiments of the present invention, in order to better conform to the term description in the industry, english description is used, and meanwhile, corresponding chinese definitions are also given in the embodiments. Embodiments of the present application are described below with reference to the drawings.

the technical problem to be solved by the embodiment of the application is to recommend human body postures under various daily scenes. When a user uses terminal equipment such as a mobile phone and the like to shoot a portrait in a daily scene, the human posture recommending method can recommend a series of professional portrait pendulum shooting pictures which are highly similar to the current environment according to the environment information of the current shot person, so that the shot person can select and refer to the pictures, and the posture sense and the aesthetic feeling of portrait shooting are further improved. Therefore, deep understanding of the current environment information needs to be realized to ensure a better recommendation effect, and meanwhile, the complexity of the model needs to be considered so as to be conveniently deployed on terminal equipment such as a mobile phone.

The embodiment of the application is mainly applied to the following scenes: on the aspect of assisting portrait photographing and intelligent composition, the method of the embodiment of the application can be used for assisting the user in carrying out human body posture swinging photographing, so that the interestingness and the aesthetic feeling of portrait pictures are improved. The method and the device for searching the pictures can be directly applied to scenes of searching the pictures by the pictures on the mobile equipment, and help users to search out the highly similar pictures.

application scenario 1: supplementary portrait photography

When a user utilizes terminal equipment such as a mobile phone and the like to shoot a portrait in different scenes, a plurality of shot persons often have no good idea of swinging the shooting postures, so that the shooting postures are single, and the integral aesthetic feeling of portrait pictures is influenced. The method of the invention obtains the current preview picture by using terminal equipment such as a mobile phone and the like, analyzes the environmental information of the current preview picture and the subject information of a shot person, and recommends a picture of a human body posture which is highly similar to the current scene and the attributes (number of people, sex, clothing and the like) of the shot person from a pre-screened professional portrait photography picture gallery or a user-defined/collected photography picture gallery for the shot person to refer to or imitate, thereby further improving the portrait photography posture sense. As shown in fig. 3, (a) a picture is taken according to the initial posture of the photographed person when the user takes a picture of the current scene, (b) a human posture picture recommended according to the environment where the user is located and the subject attribute of the photographed person, and (c) a shooting result obtained by referring to the recommended human posture picture and adjusting the posture of the photographed person are obtained, so that the human posture picture recommended by the present invention is obviously highly similar to the current environment, and a great beneficial effect is produced on the taking of the picture of the photographed person.

Application scenario 2: searching picture by picture

when a user searches for a picture by using a terminal device such as a mobile phone, in order to improve the searching effect, the user needs to use multi-level beneficial information of the picture and also needs to consider the computing power of the mobile device, so a high-precision lightweight solution is needed. The method of the invention fully utilizes multi-level rich information of the image, performs multi-feature information fusion and mining based on the similarity obtained by the information and by utilizing metric learning, realizes a very light-weight retrieval or recommendation scheme, and can run on mobile devices such as mobile phones and the like in real time. When a user has a sample image in a certain environment and wants to search or match the images in the similar environment in a gallery or a predefined gallery on a mobile phone of the user, the characteristics of the images can be extracted through the scheme of the invention, then the images are subjected to similarity matching with the existing images in the gallery, and the most similar images are displayed to the user according to the similarity sequence. As shown in fig. 4, (a) the sample is a sample picture used by the user, and (b) the similar picture searched by using the method of the present invention, it can be clearly found that the search result of the method of the present invention is very similar to the sample picture.

The system architecture provided by the embodiments of the present application is described below.

referring to fig. 1, the present application provides a system architecture 100. As shown in the system architecture 100, the data collecting device 160 is configured to collect training data, which in this embodiment of the present application includes: an image or image patch containing a human body; and stores the training data in the database 130. That is, the database stores a human body posture picture library.

the training device 120 trains to obtain the CNN feature extraction model 101 based on the training data maintained in the database 130. Embodiments later in this application will describe in more detail how the training device 120 derives the CNN feature extraction model 101 based on the training data. The CNN feature extraction model 101 can be used to input the pre-processed image or image block containing the human body into the CNN feature extraction model 101, so as to obtain the features of the image or image block containing the human body with a predetermined number of dimensions. These features are used to construct a library of human pose features.

the CNN feature extraction model 101 in the embodiment of the present application may be implemented by a CNN convolutional neural network. It should be noted that, in practical applications, the training data maintained in the database 130 may not all be acquired by the data acquisition device 160, and may also be received from other devices, for example, a user uploads the training data to the database directly through his electronic device. It should be noted that the training device 130 does not necessarily perform the training of the CNN feature extraction model 101 based on the training data maintained by the database 130, and may also obtain the training data from the cloud or other places for performing the model training.

the CNN feature extraction model 101 obtained by training according to the training device 120 may be applied to different systems or devices, for example, the execution device 110 shown in fig. 1, where the execution device 110 may be a terminal, such as a mobile phone terminal, a tablet computer, a notebook computer, an AR/VR, a vehicle-mounted terminal, or a server or a cloud. For example, when the execution device 110 is a mobile phone terminal, the CNN feature extraction model 101 may be packaged into an SDK and directly downloaded to a mobile phone for running. In fig. 1, the execution device 110 is configured with an I/O interface 112 for data interaction with an external device. The user enters data through the I/O interface 112 and optionally may interact with the I/O interface 112 through a client device 140, as described in fig. 1. The input data may include, in an embodiment of the present application: the image including the human body viewed by the user using the electronic device, or the image with the human body saved in the local storage of the execution device by the user.

During the input data is preprocessed by the execution device 110, or during the processing related to the calculation and the like (such as performing the process of finding similar pictures mentioned in this application) performed by the calculation module 111 of the execution device 110, the execution device 110 may call data, codes and the like in the data storage system 150 for corresponding processing, and may store the data, instructions and the like obtained by corresponding processing in the data storage system 150. For example, in one embodiment, the human body posture feature library obtained by the method of the embodiment of the present application may be stored in the data storage system 150.

finally, the I/O interface 112 returns the processing results, such as the found human body gesture pictures which can be used for recommendation, to the user and presents the user.

in the case shown in fig. 1, the user may manually give the input data, which may be operated through an interface provided by the I/O interface 112. Alternatively, the client device 140 may automatically send the input data to the I/O interface 112, and if the client device 140 is required to automatically send the input data to obtain authorization from the user, the user may set the corresponding permissions in the client device 140. The user can view the result output by the execution device 110 at the client device 140, and the specific presentation form can be display, sound, action, and the like. The client device 140 may also serve as a data collection terminal, collecting input data of the input I/O interface 112 and output results of the output I/O interface 112 as new sample data, and storing the new sample data in the database 130. Of course, the input data inputted to the I/O interface 112 and the output result outputted from the I/O interface 112 as shown in the figure may be directly stored in the database 130 as new sample data by the I/O interface 112 without being collected by the client device 140.

It should be noted that fig. 1 is only a schematic diagram of a system architecture provided by an embodiment of the present invention, and the position relationship between the devices, modules, etc. shown in the diagram does not constitute any limitation, for example, in fig. 1, the data storage system 160 is an external memory with respect to the execution device 120, and in other cases, the data storage system 160 may be disposed in the execution device 120. Optionally, in one embodiment, the client device 140 may also be located in the execution device 110.

The method and apparatus provided in the embodiment of the present application can also be used to expand a training database, for example, the I/O interface 112 of the execution device 120 shown in fig. 1 can send an image (such as an image containing a portrait, which can be captured by an electronic device such as a smart phone or a digital camera, or uploaded by a user) processed by the execution device to the database 130 as a training data pair, so that the training data maintained by the database 130 is richer, thereby providing richer training data for the training work of the training device 130.

The method for training the CNN feature extraction model provided by the embodiment of the application relates to the processing of computer vision, and can be particularly applied to data processing methods such as data training, machine learning and deep learning, symbolic and formal intelligent information modeling, extraction, preprocessing, training and the like are carried out on training data, and finally the trained CNN feature extraction model is obtained; in addition, the embodiment of the present application inputs input data (e.g., human pose images in the present application) into the trained CNN feature extraction model to obtain output data (e.g., as mentioned repeatedly below in the embodiment of the present application, the human pose images in each human pose image library extract features of a predetermined number of dimensions.

As described in the introduction of the basic concept, the convolutional neural network is a deep neural network with a convolutional structure, and is a deep learning (deep learning) architecture, and the deep learning architecture refers to performing multiple levels of learning at different abstraction levels through a machine learning algorithm. As a deep learning architecture, CNN is a feed-forward artificial neural network in which individual neurons can respond to images input thereto.

as shown in fig. 2, Convolutional Neural Network (CNN)200 may include an input layer 210, a convolutional/pooling layer 220 (where pooling is optional), and a neural network layer 230.

convolutional layer/pooling layer 220:

and (3) rolling layers:

The convolutional layer/pooling layer 220 shown in fig. 2 may include layers such as example 221 and 226, for example: in one implementation, 221 is a convolutional layer, 222 is a pooling layer, 223 is a convolutional layer, 224 is a pooling layer, 225 is a convolutional layer, 226 is a pooling layer; in another implementation, 221, 222 are convolutional layers, 223 is a pooling layer, 224, 225 are convolutional layers, and 226 is a pooling layer. I.e., the output of a convolutional layer may be used as input to a subsequent pooling layer, or may be used as input to another convolutional layer to continue the convolution operation.

The inner working principle of a convolutional layer will be described below by taking convolutional layer 221 as an example.

convolution layer 221 may include a number of convolution operators, also called kernels, whose role in image processing is to act as a filter to extract specific information from the input image matrix, and the convolution operator may be essentially a weight matrix, which is usually predefined, and during the convolution operation on the image, the weight matrix is usually processed pixel by pixel (or two pixels by two pixels … …, depending on the value of the step size stride) in the horizontal direction on the input image, so as to complete the task of extracting specific features from the image. The size of the weight matrix should be related to the size of the image, and it should be noted that the depth dimension (depth dimension) of the weight matrix is the same as the depth dimension of the input image, and the weight matrix extends to the entire depth of the input image during the convolution operation. Thus, convolving with a single weight matrix will produce a single depth dimension of the convolved output, but in most cases not a single weight matrix is used, but a plurality of weight matrices of the same size (row by column), i.e. a plurality of matrices of the same type, are applied. The outputs of each weight matrix are stacked to form the depth dimension of the convolved image, where the dimension is understood to be determined by "plurality" as described above. Different weight matrices may be used to extract different features in the image, e.g., one weight matrix to extract image edge information, another weight matrix to extract a particular color of the image, yet another weight matrix to blur unwanted noise in the image, etc. The plurality of weight matrices have the same size (row × column), the feature maps extracted by the plurality of weight matrices having the same size also have the same size, and the extracted feature maps having the same size are combined to form the output of the convolution operation.

The weight values in these weight matrices need to be obtained through a large amount of training in practical application, and each weight matrix formed by the trained weight values can be used to extract information from the input image, so that the convolutional neural network 210 can make correct prediction.

when convolutional neural network 220 has multiple convolutional layers, the initial convolutional layer (e.g., 221) tends to extract more general features, which may also be referred to as low-level features; as the depth of the convolutional neural network 210 increases, the more convolutional layers (e.g., 226) that go further back extract more complex features, such as features with high levels of semantics, the more highly semantic features are more suitable for the problem to be solved.

A pooling layer:

since it is often desirable to reduce the number of training parameters, it is often desirable to periodically introduce pooling layers after the convolutional layer, where the layers 221-226, as illustrated by 220 in fig. 2, may be one convolutional layer followed by one pooling layer, or multiple convolutional layers followed by one or more pooling layers. During image processing, the only purpose of the pooling layer is to reduce the spatial size of the image. The pooling layer may include an average pooling operator and/or a maximum pooling operator for sampling the input image to smaller sized images. The average pooling operator may calculate pixel values in the image over a certain range to produce an average as a result of the average pooling. The max pooling operator may take the pixel with the largest value in a particular range as the result of the max pooling. In addition, just as the size of the weighting matrix used in the convolutional layer should be related to the image size, the operators in the pooling layer should also be related to the image size. The size of the image output after the processing by the pooling layer may be smaller than the size of the image input to the pooling layer, and each pixel point in the image output by the pooling layer represents an average value or a maximum value of a corresponding sub-region of the image input to the pooling layer.

the neural network layer 230:

after processing by convolutional layer/pooling layer 220, convolutional neural network 200 is not sufficient to output the required output information. Because, as previously described, the convolutional layer/pooling layer 220 only extracts features and reduces the parameters brought by the input image. However, to generate the final output information (required class information or other relevant information), the convolutional neural network 200 needs to generate one or a set of the required number of classes of output using the neural network layer 230. Accordingly, a plurality of hidden layers (231, 232 to 23n shown in fig. 2) and an output layer 240 may be included in the neural network layer 230, and parameters included in the hidden layers may be pre-trained according to related training data of a specific task type, for example, the task type may include image recognition, image classification, image super-resolution reconstruction, and the like.

After the hidden layers in the neural network layer 230, i.e. the last layer of the whole convolutional neural network 200 is the output layer 240, the output layer 240 has a loss function similar to the classification cross entropy, and is specifically used for calculating the prediction error, once the forward propagation (i.e. the propagation from the direction 210 to 240 in fig. 2 is the forward propagation) of the whole convolutional neural network 200 is completed, the backward propagation (i.e. the propagation from the direction 240 to 210 in fig. 2 is the backward propagation) starts to update the weight values and the bias of the aforementioned layers, so as to reduce the loss of the convolutional neural network 200, and the error between the result output by the convolutional neural network 200 through the output layer and the ideal result.

It should be noted that the convolutional neural network 200 shown in fig. 2 is only an example of a convolutional neural network, and in a specific application, the convolutional neural network may also exist in the form of other network models.

the following describes specific embodiments of the present invention in detail.

Fig. 5 is a schematic diagram of a module implementation provided in the embodiment of the present invention. The floor product of the embodiment of the invention is in a form of a mobile phone terminal device, is deployed on a computing node of a related device, and optimizes the posture sense and the aesthetic feeling of portrait photography and intelligent composition in a form of auxiliary photography through software modification. An implementation module of the embodiment of the present invention is shown in fig. 5, and mainly includes an offline module and an online module. The off-line module is divided into two sub-modules: the on-line module is divided into two submodules: a CNN feature extraction sub-module and an online recommendation sub-module. The functions of the modules are described as follows:

(1) off-line module

The off-line module is completed before the model deploys terminal equipment such as a mobile phone and the like, and can be completed on any server meeting the requirement of training capacity, and the aim is to obtain a lightweight model with the environment information for understanding images and the attribute information of a shooting subject (person) and support the capability of recommending the model on the terminal equipment such as the mobile phone and the like.

The calculate image similarity submodule: and inputting the human body posture pictures in the human body posture library into the image similarity calculation module to obtain the similarity between every two images in the picture library. The similarity comprises background similarity and foreground similarity, the background similarity represents the similarity of the environmental scene where the image is located (for example, whether all the scenes are beach scenes), and the similarity is obtained through scene classification and scene analysis; the foreground similarity indicates subject attribute similarity of the subject (for example, whether the subject is the same sex or not and whether the clothing is similar or not), and is obtained by a human body attribute. The overall similarity of the pictures is obtained by fusing the foreground similarity and the background similarity, and the human posture recommendation can be accurately carried out by the similarity. Optionally, in one embodiment, fusing the foreground similarity and the background similarity may be performed by metric learning.

It should be noted that, in one embodiment, the human body posture library is collected in advance, and the library can be subsequently uploaded and expanded by the user.

searching for recommended posture pictures on the internet by inputting pictures is technically feasible, and it may also be useful to combine internet picture search (such as hundred-degree picture search) with the method of the present patent.

CNN feature learning submodule: according to the image similarity information obtained in the image similarity calculation submodule, a large number of triple samples are sampled in a human body posture library, and each triple sample < A, P and N > comprises three human body posture images: a certain human body posture image in the posture library A, P is a human body posture image (high in similarity) which can be directly recommended in the scene A, and N is a human body posture image (low in similarity) which cannot be directly recommended in the environment A. It should be noted that a refers to a certain human body posture image in the posture library, P is a human body posture image that can be directly recommended in the scene shot by the image a, and N is a human body posture image that cannot be directly recommended in the scene shot by the image a.

By utilizing a large amount of triple training data and using a metric learning mode to train a lightweight CNN feature extraction model, samples which can be recommended to each other are mapped to a feature space and then are as close as possible, and samples which cannot be recommended are mapped to the feature space and then are as far away as possible. After the CNN feature extraction model is trained, the trained metric learning CNN feature extraction model is used to extract features of a predetermined number of dimensions from the human posture images in each human posture image library (one possible implementation is that the back stage automatically extracts through the CNN feature extraction model), so as to form a human posture feature library. In one embodiment, the feature of the predetermined number of dimensions may be understood as an array of fixed length, for example, we define the feature dimensions as 10, [ x1, x2, x3, …, x10], where the 10-dimensional feature represents the information of the picture.

It should be noted that the human body posture gallery is used for storing original images and directly showing the original images to the user when recommending, and the human body posture feature gallery is used for storing each original image feature in the human body posture gallery and recommending by calculating similarity through a background algorithm.

(2) Online module

CNN feature extraction submodule: in the online stage, the CNN feature extraction model is learned by using off-line training metrics and deployed on mobile equipment such as a mobile phone. For a video stream captured by a camera, frames are taken at fixed intervals, a current frame picture is input into a CNN feature extraction model to extract the features of a preset number of dimensions of an image, and then the features are input into an online recommendation submodule.

Alternatively, the CNN feature extraction model can be packaged into an SDK to be directly downloaded to a mobile phone for running.

an online recommendation submodule: the input of the module is the characteristics of a preview picture, a human posture diagram library and a human posture characteristic library. Calculating the similarity between the features of the preview image and the features in each feature library, then sequencing the similarities, and feeding back the most similar pictures to the user for selection according to a preset preview mode. In order to speed up the similarity calculation and sorting process on a device such as a mobile phone, any indexing method can be used, including but not limited to hash index, decision tree, and the like.

a method flow of an embodiment of the invention is shown in fig. 6.

(1) And calculating the image similarity. In this step, the similarity between each two pictures in the human gesture picture library is calculated. The similarity comprises background similarity and foreground similarity, the background similarity represents the similarity of the environmental scene where the image is located (for example, whether all the scenes are beach scenes), and the similarity is obtained through scene classification and scene analysis; the foreground similarity represents subject attribute similarity of the subject (e.g., whether the subject is the same sex or not and whether the clothing is similar or not), and is obtained by human body attribute detection. And obtaining the overall similarity of the pictures by fusing the foreground similarity and the background similarity.

(2) And sampling the triple training samples. And for each picture in the human body posture library, collecting a triple training sample according to the similarity between every two pictures in the human body posture library. . For each image in the human posture library, a plurality of most similar images are calculated in the recommendation gallery as positive samples (for example, taking the first K images according to similarity ranking), and the rest images are taken as negative samples. Therefore, a large number of triple samples can be collected, and the subsequent training process is supported.

(3) And (5) training a CNN feature extraction model. And training the CNN characteristic network by using a metric learning mode according to the sampled triple training samples. In order to keep the scene information of the images consistent, a measurement learning and scene classification combined multi-task training mode is adopted, and meanwhile, a ranking loss function and a classification loss function are used for optimizing model parameters. The sequencing loss function label is obtained by triple sampling, and the scene classification label can be a label manually marked or a pseudo label obtained by a scene classification network.

(4) and constructing a human body posture feature library and extracting the features of the preview picture. The part respectively extracts image features of a human body posture library and preview image features on equipment such as a mobile phone by using a trained CNN feature extraction model, wherein the image features can be completed at any server end and deployed on the equipment such as the mobile phone along with the model, and the preview image features need to run on the equipment such as the mobile phone in real time.

(5) And (4) online recommendation. And recommending the characteristics of the library according to the characteristics of the preview picture and the characteristics of the human body posture. Calculating the similarity between the characteristics of the preview image and the image characteristics in each characteristic library, then sequencing the similarities, and feeding back the most similar pictures to the user for selection according to a preset preview mode. In order to speed up the similarity calculation and sorting process on a device such as a mobile phone, any indexing method can be used, including but not limited to hash index, decision tree, and the like.

Embodiment one of the invention

The embodiment of the invention describes a recommendation method and a module thereof, the main module of the embodiment of the invention is composed of an off-line module and an on-line module, and the off-line module is divided into two sub-modules: the method comprises the steps of calculating an image similarity submodule and a CNN feature learning submodule, wherein the two submodules are used for obtaining image similarity which is recommended to be beneficial for human body postures in an unsupervised mode, and then modeling a similarity relation by using metric learning; the online module is divided into two sub-modules: the device comprises a CNN feature extraction sub-module and an online recommendation sub-module, wherein the CNN feature extraction sub-module and the online recommendation sub-module comprise a CNN feature extraction model obtained by an offline module and are deployed on mobile equipment such as a mobile phone to perform online real-time feature extraction and human body posture recommendation. These modules of embodiments of the present invention are described in detail below:

(1) Off-line module

For offline processing, we divide into two sub-modules: and the module for calculating the image similarity and the CNN feature learning sub-module are respectively described in detail below.

image similarity calculating module

And the image similarity calculating module is used for extracting various beneficial environment information of the image, and calculating recommendation aiming at the similarity recommended by the human body posture by fusing the information.

The embodiment of the invention uses three levels of environment information characteristics: scene characteristics, spatial distribution characteristics of objects, and foreground human characteristics. The scene features are obtained by a pre-trained scene classification network, and the scene classification network of the embodiment of the invention can adopt various architectures, such as network structures of ResNet-152, DenseNet-161 and the like. The data set used by the scene classification network training comprises plants 365, SUN Database and the like, and covers most scenes of daily life.

The spatial distribution characteristics of the object are obtained by a scene analysis network, the scene analysis network in the embodiment of the present invention may use, but is not limited to, network architectures such as PSP-Net, RefineNet, and the like, and the training data set may include ADE20K, and the like.

The human body information is obtained by a human body detection network for detecting a human body and obtaining a human body region as an input of a human body attribute network, and the human body attribute network is used for identifying attribute information of each human body, mainly including gender, clothing, and the like. The human body detection and human body attribute network in the embodiment of the invention can use any high-precision structure, the human body detection can use open data such as MS COCO and the like for training a human body detection model, and the human body attribute can use a database such as PA-100K and the like for model training.

based on these multiple levels of information, image similarity is obtained using a multi-stage cascade method, a schematic flow of which is shown in fig. 7.

1 and 2: firstly, giving any picture (the similarity of the computed images in this section is used for generating triple training data, where any picture refers to any image in a training set which may contain the human body posture picture library in fig. 5) and a picture library (the picture library here refers to the training set which may contain the human body posture picture library in fig. 5);

3 and 5: obtaining a candidate similar image set which has similar scenes with the current input image and similar people number and human body attributes (optionally, specific human body attributes comprise gender, clothing and the like) from a picture library according to hard rules of scene classification, human body detection and human body attribute classification,

4: obtaining the characteristics of an image (the image refers to ' any given picture ', and can also be understood as an input picture ') according to a scene analysis network;

6: and calculating the similarity of the object space distribution of the input picture and each picture in the candidate similar image set, then sorting, finally selecting the first K candidate similar images with higher sorting as the similar pictures of the input picture, and taking the rest candidate similar images as the dissimilar pictures. Here, the scene-resolved features can be extracted directly from a specific layer in the pre-trained network, and represent the spatial distribution information of the objects in the image.

CNN feature learning submodule

the module samples a large amount of ternary group data to perform measurement learning training based on the image similarity of the image similarity calculation submodule so as to achieve the purpose of feature fusion. Specifically, each triplet sample < a, P, N > includes three body pose images: a certain human body posture image in the posture library A, P is a human body posture image which can be directly recommended in the scene A and is called a positive sample, and N is a human body posture image which cannot be directly recommended in the environment A and is called a negative sample. And for any picture in the image library, a similar image is obtained as a positive sample by the image similarity calculation submodule, and a dissimilar image is used as a negative sample. For example, a refers to a certain human body posture image in the posture library, P is a human body posture image that can be directly recommended in a scene shot by the image a, and N is a human body posture image that cannot be directly recommended in the scene shot by the image a. For example, in FIG. 8, A and P are recommendable pictures, N and A/P are not recommendable because N is the sitting position of the coffee shop, and the environment of A and P cannot make such a pendulum.

In one embodiment, when training the model, each picture corresponds to multiple positive samples and multiple negative samples, there are many triplets that we need, for example, we have thirty thousand picture training sets, and millions to tens of millions of triplets can be generated. Specifically, however, when implemented, some rules may be set to screen out some important triples, for example, only K1 most similar positive samples and K2 least similar negative samples are retained in each picture, so as to limit the number of triples.

A large number of triple samples can be obtained by the method and used as training data for metric learning. The CNN feature extraction model is trained and metric-learned, so that images which can be recommended mutually are as close as possible after being mapped to a feature space, and images which cannot be recommended mutually are as far away as possible after being mapped to the feature space. The embodiment trains the CNN feature extraction model by using a metric learning method. The metric learning model is a triple Network (triple Network), and the structure of the triple Network is shown in fig. 8.

And based on the three-tire network, learning the CNN feature extraction model through the network. The three-generation network is composed of three weight-shared CNN network branches, the CNN can be any lightweight CNN base network which can be deployed to a mobile terminal, including but not limited to ResNet, MobileNet, etc., the three network branches respectively correspond to three human posture pictures < a, P, N > in a triple sample, and the three network branches respectively obtain characteristic vectors f (a), f (P), and f (N) through forward propagation. The invention adopts a mode of combining scene classification and metric learning and multi-task training, namely simultaneously predicting the scene classification of the picture and simultaneously fitting the predefined similarity relation. Assuming that the feature extraction of CNN is represented by the function f (, x) and the input triplets are represented by < a, P, N >, the ordering penalty function for the three-tire network is:

where α is a parameter defining the optimized distance between the positive samples P and the negative samples N, the number of M triplet samples. In addition, assuming that the scene classification label of each three human body posture pictures < a, P, N > is respectively a pseudo label which can be obtained by the scene classification network and the label, and can also be a correct label which is artificially labeled, by introducing a scene classification loss function, as follows:

And C (, C) is a shallow multilayer perceptron and is used for classifier modeling, so that two loss functions are simultaneously used for optimization when the three-tire network is trained, and the correctness of the scene and the correctness of the similarity are simultaneously ensured. After training is finished, the features extracted by the CNN can be directly used for recommending the human body posture. Meanwhile, the CNN features of the human posture images in the human posture image library need to be extracted in advance offline, and the human posture feature library is constructed, so that efficient matching and recommendation can be performed online.

(2) Online module

For online processing, we divide into two sub-modules: a CNN feature extraction sub-module and an online recommendation sub-module, and the methods of the modules are described in detail below.

CNN feature extraction submodule:

In the online stage, the CNN feature extraction model is learned by using off-line training metrics and deployed on mobile equipment such as a mobile phone. For a video stream captured by a video camera (in a use scene recommended by the human body posture, the scene taken here is the video stream, and the video stream is used for extracting the environmental features shot by the current camera and calculating the similarity for recommending the human body posture), frames are taken at fixed intervals. The framing mode is directly related to the model running time. Assuming that the time for the model to extract features is t seconds, 1/t frame of image per second can be extracted for processing. Then, each frame of picture is input into a CNN characteristic extraction model to extract the characteristics of the dimension of the image with preset number. The CNN features extracted from the current frame picture can be directly used for the features of a certain frame picture, and the features of a plurality of adjacent frame pictures can also be fused. This feature is then entered into the online recommendation sub-module.

an online recommendation submodule:

the input of the module is a feature of a preview image, a human posture diagram library and a human posture feature library. Calculating the similarity of the features of the preview image and the features in each feature library according to the following formula:

And then sequencing according to the similarity, and feeding back the most similar pictures to the user for selection according to a preset preview mode. In order to speed up the similarity calculation and sorting process on the devices such as mobile phones, any indexing method such as hash index, decision tree and the like can be used.

fig. 9a is a technical effect diagram of the first embodiment of the present invention, where the first column of each row is an input picture in different scenes, and the last three columns are human body posture pictures recommended according to the method of the present invention. In addition, the model realized by the scheme can be successfully deployed on the mobile phone to recommend the posture of the human body in real time, and a good effect is achieved through actual field test. FIG. 9b is an illustration of some of the user's beats using our system. The first column is an initial swinging picture of the user, the second column is a human body posture image recommended according to the method, and the third column is a swinging picture of which the posture is adjusted by referring to the recommended picture of the user. It can be seen very clearly that the aesthetic feeling and the posture sense of the picture after the user takes the picture aiming at the recommendation result of the invention are better.

The embodiment of the invention is different from the prior art in that the multi-level characteristics of the image are used, the beneficial information in the human posture recommendation is deeply utilized, the similarity for the human posture recommendation is defined based on the beneficial information, the information fusion and the model training are effectively carried out through measurement learning, the scheme with light weight and high accuracy is realized, and meanwhile, the method can be deployed to mobile terminals such as mobile phones and the like to carry out the posture recommendation in real time.

embodiment two of the invention

this example illustrates a recommendation process based on a method. The flow of the method of the second embodiment is shown in fig. 10, and the embodiment of the present invention provides an upload mechanism. On one hand, the user can upload personal preferred pictures, add the pictures into a local picture library for subsequent photographing of the user, and the pictures can be recommended in subsequent similar scenes. On the other hand, the embodiment of the invention provides an online sharing mechanism for the user, the pictures are added into the picture library at the cloud end, the human posture recommendation picture library is updated for other people to refer to, and the user experience is further improved. Under the two modes, the recommended content can be automatically updated according to the pictures in the picture library, and the posture recommendation service can be conveniently provided.

The local intelligent recommendation in the embodiment of the invention is specifically completed on mobile terminals such as mobile phones, the image feature extraction and the online recommendation logic are completed locally, and at the moment, the matching and recommendation are only performed in a local human body posture gallery, so that a cloud database is not involved. The recommendation scene does not need any information uploaded by the user, the user privacy is guaranteed, and the recommendation efficiency is high. When the local human body posture gallery of the user cannot meet recommendation requirements, the user can be switched into a cloud intelligent recommendation mode.

optionally, can manually switch, also can find with the high in the clouds simultaneously locally, perhaps find locally earlier, find suitable going to the high in the clouds again and find, this patent algorithm of these several kinds of modes can all realize. The data are uploaded by the user when the cloud is found, and if the privacy problem of the user is involved under the condition that the user does not know, the recommendation mode can be switched manually.

The feature extraction of the image in the cloud intelligent recommendation mode is completed locally, then the features are transmitted to a cloud server and are matched and recommended on a remote server according to the recommendation method of the invention, and then the recommendation result is returned to the user for previewing and selecting in a predefined preview mode. At the moment, the user is required to upload the features and transmit the returned result, so that the recommendation efficiency is influenced by the network bandwidth, and the human body posture gallery in the cloud is usually richer than that in the local gallery, so that the recommendation result is better.

when the user obtains the personalized human body posture picture in a certain mode, the mode includes but is not limited to user self-defining of the human body posture picture or collection of the human body posture picture from an internet website, the user can share the personalized human body posture picture obtained by the user to the cloud end through a sharing mechanism, and a picture base of the cloud end is expanded to be used and referred by other users. Meanwhile, the user can download favorite human body posture pictures to the local on the remote server to expand a local gallery. For a newly added human body posture picture, the system automatically and directly extracts the characteristics of the newly added human body posture picture according to the metric learning model of the invention, establishes the corresponding relation between the original image and the characteristics (because the similarity of the human body posture is calculated based on the characteristic vector of each original image, the original image with high similarity is recommended to a user, therefore, each original image corresponds to the characteristic vectors with a preset number of dimensionalities, the similarity of the characteristic vectors directly reflects the similarity of the human body posture of the original image, a structure of a similar list (or dictionary) needs to be established, the ID of each original image in the human body posture library and the corresponding characteristic vector thereof are stored), then the characteristics are stored in the human body posture characteristic library, and the original image is stored in the human body posture library. Meanwhile, the embodiment of the invention also provides a mechanism for deleting the human body posture picture in the local gallery by the user, and only the corresponding characteristic and the original picture are deleted together to ensure that the picture cannot be recommended again.

When the newly added pictures at the cloud end reach a certain scale, the newly added pictures are required to be added into a training set of metric learning again on a remote server according to the method of the invention, the model is updated on the basis of the original data and the newly added data so as to further improve the robustness and usability of the model, and then the updated model is redeployed to the mobile equipment of the user in a system updating mode.

FIG. 11 is an example of a user uploading a personal preferred picture to join a local picture gallery.

As shown in fig. 11, the user enters the gesture recommendation interface (fig. b, c, d) by clicking the gesture recommendation icon (shown as the gesture icon in fig. 11) on the shooting interface, displays a preset number (9 pieces in the following example) of recommended gesture pictures on the interface by default (fig. b), arranges the most preferred pictures in the first place, and marks the selected identifier by default (the first recommended picture with a number-matching icon in fig. b/c), if none of the currently recommended pictures is satisfactory to the user, a batch of pictures may be changed in the manner shown in fig. b or more may be pull-up loaded in the manner shown in fig. c, to present more recommended pictures to the user, by clicking the upload picture icon (as shown in the following figure d "add my gesture" icon), a picture can be selected from the cell phone gallery as a gesture gallery.

FIG. 12 is another example of a user uploading a personal preferred picture to join a local picture gallery.

as shown in the view picture details interface of the cell phone gallery in fig. 12 (lower panel a), a currently displayed picture can be added to the gesture library by selecting a function menu (e.g., clicking the "add gesture library" menu item shown in fig. b).

54页详细技术资料下载

object recognition method and device

相关技术

网友询问留言