Method, system, and medium for re-illuminating an image using a predicted deep reflection field

文档序号：1850876 发布日期：2021-11-16 浏览：2次中文

阅读说明：本技术 用于使用预测深反射场重新照明图像的方法、系统以及介质 (Method, system, and medium for re-illuminating an image using a predicted deep reflection field ) 是由克里斯多佛·雷曼阿比米特拉·梅卡罗希特·库马尔·潘迪肖恩·瑞安·弗朗切斯科·法内洛马于 2019-10-16 设计创作，主要内容包括：提供了用于使用预测深反射场重新照明图像的方法、系统以及介质。在一些实施例中,所述方法包括：识别训练样本组,其中,每个训练样本包括(i)一次一灯(OLAT)图像组,所述一次一灯图像组在已经激活了布置在照明结构上的多盏灯中的一盏灯时已经分别被捕获；(ii)球形颜色梯度图像组,所述球形颜色梯度图像组在已经激活了布置在所述照明结构上的所述多盏灯以分别发出特定颜色时已经分别被捕获；以及(iii)照明方向,其中,所述OLAT图像组中的每个图像和每个所述球形颜色梯度图像是对象的图像,并且其中,所述照明方向指示灯相对于所述对象的相对定向；使用所述训练样本组来训练卷积神经网络,其中,训练所述卷积神经网络包括：针对一系列训练迭代中的每个训练迭代并且针对所述训练样本组中的每个训练样本：生成输出预测图像,其中,所述输出预测图像是具有来自与所述训练样本相关联的所述照明方向的照明的与所述训练样本相关联的所述对象的表示；识别所述训练样本的所述OLAT图像组中所包括的与所述训练样本的所述照明方向相对应的基准真相OLAT图像；计算指示所述输出预测图像与识别到的基准真相OLAT图像之间的感知差异的损失；以及基于计算出的损失来更新所述卷积神经网络的参数；识别包括第二球形颜色梯度图像组和第二照明方向的测试样本；以及使用经过训练的卷积神经网络,用来自所述第二照明方向的照明来生成每个所述第二球形颜色梯度图像组中所包括的所述对象的重新点亮图像。(Methods, systems, and media for re-illuminating an image using a predicted deep reflection field are provided. In some embodiments, the method comprises: identifying a set of training samples, wherein each training sample comprises (i) a one-light-at-a-time (OLAT) image set that has been respectively captured when one of a plurality of lights arranged on a lighting structure has been activated; (ii) a set of spherical color gradient images that have been respectively captured when the plurality of lamps arranged on the lighting structure have been activated to respectively emit a specific color; and (iii) a lighting direction, wherein each image in the set of OLAT images and each of the spherical color gradient images is an image of an object, and wherein the lighting direction indicates a relative orientation of the light with respect to the object; training a convolutional neural network using the training sample set, wherein training the convolutional neural network comprises: for each training iteration in a series of training iterations and for each training sample in the set of training samples: generating an output predictive image, wherein the output predictive image is a representation of the object associated with the training sample with illumination from the illumination direction associated with the training sample; identifying a reference true-phase OLAT image included in the OLAT image group of the training sample corresponding to the illumination direction of the training sample; calculating a loss indicative of a perceptual difference between the output predicted image and the identified reference true OLAT image; and updating parameters of the convolutional neural network based on the calculated loss; identifying a test sample comprising a second set of spherical color gradient images and a second illumination direction; and generating a relight image of the object included in each of the second set of spherical color gradient images with illumination from the second illumination direction using a trained convolutional neural network.)

1. A method for re-illuminating an image using a deep reflected field, the method comprising:

identifying a set of training samples, wherein each training sample comprises: (i) a one-light-at-a-time OLAT image group, the OLAT images having been respectively captured when one of a plurality of lights disposed on a lighting structure has been activated; (ii) a set of spherical color gradient images that have been respectively captured when the plurality of lamps arranged on the lighting structure have been activated to respectively emit a specific color; and (iii) an illumination direction, wherein each image in the set of OLAT images and each of the spherical color gradient images is an image of an object, and wherein the illumination direction indicates a relative orientation of the light with respect to the object;

training a convolutional neural network using the training sample set, wherein training the convolutional neural network comprises:

for each training iteration in a series of training iterations and for each training sample in the set of training samples:

generating an output predictive image, wherein the output predictive image is a representation of the object associated with the training sample with illumination from the illumination direction associated with the training sample;

identifying a reference true-phase OLAT image included in the OLAT image group of the training sample corresponding to the illumination direction of the training sample;

calculating a loss indicative of a perceptual difference between the output predicted image and the identified reference true OLAT image; and

updating parameters of the convolutional neural network based on the calculated loss;

identifying a test sample comprising a second set of spherical color gradient images and a second illumination direction; and

generating a relight image of the object included in each image in the second set of spherical color gradient images with illumination from the second illumination direction using a trained convolutional neural network.

2. The method of claim 1, wherein the lighting structure is a spherical dome, and wherein the plurality of lights are disposed on a surface of the spherical dome.

3. The method of claim 1 or 2, wherein the loss is calculated using a pre-trained neural network.

4. The method of claim 1 or 2, wherein the loss comprises a first loss component indicating the perceptual difference between the output image and the identified OLAT image based on texture information in each image, and wherein the loss comprises a second loss component indicating the perceptual difference between the output image and the identified OLAT image based on specular reflection information in each image.

5. The method of claim 4, wherein the second loss component is calculated using a trained neural network that has been trained to take OLAT images as input and to generate as output a light direction for a light that generates the OLAT images.

6. The method of any preceding claim, wherein the sets of OLAT images and the sets of spherical color gradient images for each of the training samples are captured from a first plurality of cameras, the first plurality of cameras respectively having viewpoints from a first plurality of viewpoints, and wherein the second set of spherical color gradient images corresponding to the test sample are captured from cameras having viewpoints not included in the first plurality of viewpoints.

7. The method of any preceding claim, further comprising generating an alignment reference true phase OLAT image prior to calculating the loss, wherein the loss is calculated using the alignment reference true phase image.

8. A system for re-illuminating an image using a deep reflected field, the system comprising:

a memory; and

a hardware processor, which when executing computer-executable instructions stored in the memory, is configured to:

identifying a set of training samples, wherein each training sample comprises: (i) a one-light-at-a-time OLAT image group, the OLAT images having been respectively captured when one of a plurality of lights disposed on a lighting structure has been activated; (ii) a set of spherical color gradient images that have been captured respectively when the plurality of lamps disposed on the lighting structure have been activated to emit respectively specific colors; and (iii) an illumination direction, wherein each image in the set of OLAT images and each of the spherical color gradient images is an image of an object, and wherein the illumination direction indicates a relative orientation of the light with respect to the object;

training a convolutional neural network using the training sample set, wherein training the convolutional neural network comprises:

for each training iteration in a series of training iterations and for each training sample in the set of training samples:

identifying a reference true-phase OLAT image included in the OLAT image group of the training sample corresponding to the illumination direction of the training sample;

calculating a loss indicative of a perceptual difference between the output predicted image and the identified reference true OLAT image; and

updating parameters of the convolutional neural network based on the calculated loss;

identifying a test sample comprising a second set of spherical color gradient images and a second illumination direction; and

9. The system of claim 8, wherein the lighting structure is a spherical dome, and wherein the plurality of lights are disposed on a surface of the spherical dome.

10. The system of claim 8 or 9, wherein the loss is calculated using a pre-trained neural network.

11. The system of claim 8 or 9, wherein the loss comprises a first loss component indicating the perceptual difference between the output image and the identified OLAT image based on texture information in each image, and wherein the loss comprises a second loss component indicating the perceptual difference between the output image and the identified OLAT image based on specular reflection information in each image.

12. The system of claim 11, wherein the second loss component is calculated using a trained neural network that has been trained to take an OLAT image as an input and to generate a light direction for a light that generates the OLAT image as an output.

13. The system of any of claims 8 to 12, wherein the sets of OLAT images and the sets of spherical color gradient images for each of the training samples are captured from a first plurality of cameras that respectively have viewpoints from a first plurality of viewpoints, and wherein the second set of spherical color gradient images corresponding to the test samples are captured from cameras that have viewpoints not included in the first plurality of viewpoints.

14. The system of any of claims 8 to 13, wherein the hardware processor is further configured to generate an alignment reference true phase OLAT image prior to calculating the loss, wherein the loss is calculated using the alignment reference true phase image.

15. A non-transitory computer-readable medium containing computer-executable instructions that, when executed by a processor, cause the processor to perform a method for re-illuminating an image using a deep reflected field, the method comprising:

training a convolutional neural network using the training sample set, wherein training the convolutional neural network comprises:

for each training iteration in a series of training iterations and for each training sample in the set of training samples:

identifying a reference true-phase OLAT image included in the OLAT image group of the training sample corresponding to the illumination direction of the training sample;

calculating a loss indicative of a perceptual difference between the output predicted image and the identified reference true OLAT image; and

updating parameters of the convolutional neural network based on the calculated loss;

identifying a test sample comprising a second set of spherical color gradient images and a second illumination direction; and

16. The non-transitory computer-readable medium of claim 15, wherein the lighting structure is a spherical dome, and wherein the plurality of lights are disposed on a surface of the spherical dome.

17. The non-transitory computer-readable medium of claim 15 or 16, wherein the loss is calculated using a pre-trained neural network.

18. The non-transitory computer-readable medium of claim 15 or 16, wherein the loss includes a first loss component indicating the perceptual difference between the output image and the identified OLAT image based on texture information in each image, and wherein the loss includes a second loss component indicating the perceptual difference between the output image and the identified OLAT image based on specular reflection information in each image.

19. The non-transitory computer-readable medium of claim 18, wherein the second loss component is calculated using a trained neural network that has been trained to take an OLAT image as an input and to generate a light direction for a light that generates the OLAT image as an output.

20. The non-transitory computer-readable medium of any of claims 15 to 19, wherein the sets of OLAT images and the sets of spherical color gradient images for each of the training samples are captured from a first plurality of cameras that respectively have viewpoints from a first plurality of viewpoints, and wherein the second sets of spherical color gradient images corresponding to the test samples are captured from cameras that have viewpoints not included in the first plurality of viewpoints.

21. The non-transitory computer readable medium of any of claims 15 to 19, wherein the method further comprises generating an alignment reference true phase OLAT image prior to calculating the loss, wherein the loss is calculated using the alignment reference true phase image.

22. A method for re-illuminating an image using a deep reflected field, the method comprising:

identifying a test sample comprising at least a first spherical color gradient image and a second spherical color gradient image of the object and a first illumination direction; and

generating a relight image of the object included in each of the spherical color gradient images with illumination from the first illumination direction by using a trained convolutional neural network trained by:

identifying a set of training samples, wherein each training sample comprises: (i) a one-light-at-a-time OLAT image group, the OLAT images having been respectively captured when one of a plurality of lights disposed on a lighting structure has been activated; (ii) a set of spherical color gradient images that have been captured respectively when the plurality of lamps disposed on the lighting structure have been activated to emit respectively specific colors; and (iii) an illumination direction, wherein each image in the set of OLAT images and each of the spherical color gradient images is an image of an object, and wherein the illumination direction indicates a relative orientation of the light with respect to the object; and

training the convolutional neural network using the training sample set, wherein training the convolutional neural network comprises:

for each training iteration in a series of training iterations and for each training sample in the set of training samples:

identifying a reference true-phase OLAT image included in the OLAT image group of the training sample corresponding to the illumination direction of the training sample;

calculating a loss indicative of a perceptual difference between the output predicted image and the identified reference true OLAT image; and

updating parameters of the convolutional neural network based on the calculated loss.

23. The method of claim 22, wherein the first spherical color gradient image of the object has an RGB value of ((1+ θ)_x)/2,(1+θ_y)/2,(1+θ_z)/2) Wherein the second spherical color gradient image has an RGB value of ((1-theta)_x)/2,(1-θ_y)/2,(1-θ_z) A lamp color of/2), and wherein θ is a direction vector of the lamp.

Technical Field

The disclosed subject matter relates to methods, systems, and media for re-illuminating an image using a predicted deep reflection field.

Background

Photo-realistic relighting of human faces is a highly sought after function, ranging from visual effects to truly immersive virtual experiences. That is, a human user is typically able to distinguish real faces from synthetic renderings. Thus, any human face that is photo-realistic to relight is indeed a challenge with many difficulties, including modeling sub-surface scattering and blood flow to estimate the interaction between the lamp and a single strand of hair, and is therefore laborious and computationally intensive. In addition, many current methods require a large set of input images and are therefore not suitable for use with moving or dynamic objects.

Therefore, there is a need to provide new methods, systems, and media for re-illuminating an image using a predicted deep reflection field.

Disclosure of Invention

Methods, systems, and media for re-illuminating an image using a predicted deep reflection field are provided.

According to a first aspect of the disclosed subject matter, there is provided a method for re-illuminating an image using a deep reflected field, the method comprising: identifying a set of training samples, wherein each training sample comprises (i) a one-light-at-a-time (OLAT) image set that has been respectively captured when one of a plurality of lights arranged on a lighting structure has been activated; (ii) a set of spherical color gradient images that have been captured respectively when a plurality of lamps arranged on the lighting structure have been activated to emit respectively a specific color; and (iii) a lighting direction, wherein each image in the set of OLAT images and each spherical color gradient image is an image of the object, and wherein the lighting direction indicates a relative orientation of the light with respect to the object; training a convolutional neural network using a training sample set; identifying a test sample comprising a second set of spherical color gradient images and a second illumination direction; and generating a relight image of the object included in each second set of spherical color gradient images with illumination from the second illumination direction using the trained convolutional neural network. In some embodiments, training the convolutional neural network comprises, for each training iteration in a series of training iterations and for each training sample in a set of training samples: generating an output prediction image, wherein the output prediction image is a representation of an object associated with the training sample with illumination from an illumination direction associated with the training sample; identifying a reference true phase OLAT image corresponding to the lighting direction of the training sample and included in an OLAT image group of the training sample; calculating a loss indicative of a perceptual difference between the output predicted image and the identified reference true OLAT image; and updating parameters of the convolutional neural network based on the calculated loss.

According to a second aspect of the disclosed subject matter, there is provided a method for training a convolutional neural network to re-illuminate an image using a deep reflection field, the method comprising: identifying a set of training samples, wherein each training sample comprises (i) a one-light-at-a-time (OLAT) image set that has been respectively captured when one of a plurality of lights arranged on a lighting structure has been activated; (ii) a set of spherical color gradient images that have been captured respectively when a plurality of lamps arranged on the lighting structure have been activated to emit respectively a specific color; and (iii) a lighting direction, wherein each image in the set of OLAT images and each spherical color gradient image is an image of the object, and wherein the lighting direction indicates a relative orientation of the light with respect to the object; and training the convolutional neural network using the training sample set. Training the convolutional neural network comprises, for each training iteration in a series of training iterations and for each training sample in a set of training samples: generating an output prediction image, wherein the output prediction image is a representation of an object associated with the training sample with illumination from an illumination direction associated with the training sample; identifying a reference true phase OLAT image corresponding to the lighting direction of the training sample and included in an OLAT image group of the training sample; calculating a loss indicative of a perceptual difference between the output predicted image and the identified reference true OLAT image; and updating parameters of the convolutional neural network based on the calculated loss.

According to a third aspect of the disclosed subject matter, there is provided a method for re-illuminating an image using a deep reflected field, the method comprising: identifying a test sample comprising at least a first spherical color gradient image and a second spherical color gradient image of the object and a first illumination direction; and generating a relight image of the object included in each spherical color gradient image with illumination from the first illumination direction by using the trained convolutional neural network. In some embodiments, the convolutional neural network is trained by: identifying a set of training samples, wherein each training sample comprises (i) a one-light-at-a-time (OLAT) image set that has been respectively captured when one of a plurality of lights arranged on a lighting structure has been activated; (ii) a set of spherical color gradient images that have been captured respectively when a plurality of lamps arranged on the lighting structure have been activated to emit respectively a specific color; and (iii) a lighting direction, wherein each image in the set of OLAT images and each spherical color gradient image is an image of the object, and wherein the lighting direction indicates a relative orientation of the light with respect to the object; and training a convolutional neural network using the training sample set, wherein training the convolutional neural network comprises: for each training iteration in a series of training iterations and for each training sample in a set of training samples: generating an output prediction image, wherein the output prediction image is a representation of an object associated with the training sample with illumination from an illumination direction associated with the training sample; identifying a reference true phase OLAT image corresponding to the lighting direction of the training sample and included in an OLAT image group of the training sample; calculating a loss indicative of a perceptual difference between the output predicted image and the identified reference true OLAT image; and updating parameters of the convolutional neural network based on the calculated loss.

In the first, second or third aspect, in some embodiments, the lighting structure is a spherical dome, and the plurality of lamps are arranged on a surface of the spherical dome.

In the first, second or third aspect, in some embodiments, the loss is calculated using a pre-trained neural network.

In the first, second or third aspect, in some embodiments, the loss comprises a first loss component indicating a perceptual difference between the output image and the identified OLAT image based on low frequency information in each image, and the loss comprises a second loss component indicating a perceptual difference between the output image and the identified OLAT image based on high frequency information in each image. In some embodiments, the second loss component is calculated using a trained neural network that has been trained to take OLAT images as input, and to generate as output the light direction of the lights used to generate the OLAT images.

In the first, second or third aspects, in some embodiments, the set of OLAT images and the set of spherical color gradient images for each training sample are captured from a first plurality of cameras that respectively have viewpoints from a first plurality of viewpoints, and the second set of spherical color gradient images corresponding to the test sample are captured from cameras that have viewpoints not included in the first plurality of viewpoints.

In the first, second or third aspect, in some embodiments, the method further comprises generating an alignment reference true phase OLAT image prior to calculating the loss, wherein the loss is calculated using the alignment reference true phase image.

In the first, second or third aspect, in some embodiments, the first spherical color gradient image of the object has an RGB value of ((1+ θ)_x)/2,(1+θ_y)/2,(1+θ_z) A lamp color of (/ 2), and the second spherical color gradient image has an RGB value of ((1-theta))_x)/2,(1-θ_y)/2,(1-θ_z) A/2) and where θ is a direction vector of the lamp.

In accordance with some embodiments of the disclosed subject matter, there is provided a system for re-illuminating an image using a deep reflected field, the system comprising: a memory; and a hardware processor, which when executing computer-executable instructions stored in the memory, is configured to: identifying a set of training samples, wherein each training sample comprises (i) a one-light-at-a-time (OLAT) image set that has been respectively captured when one of a plurality of lights arranged on a lighting structure has been activated; (ii) a set of spherical color gradient images that have been captured respectively when a plurality of lamps arranged on the lighting structure have been activated to emit respectively a specific color; and (iii) a lighting direction, wherein each image in the set of OLAT images and each spherical color gradient image is an image of the object, and wherein the lighting direction indicates a relative orientation of the light with respect to the object; training a convolutional neural network using a training sample set; identifying a test sample comprising a second set of spherical color gradient images and a second illumination direction; and generating a relight image of the object included in each second set of spherical color gradient images with illumination from the second illumination direction using the trained convolutional neural network. In some embodiments, training the convolutional neural network comprises, for each training iteration in a series of training iterations and for each training sample in a set of training samples: generating an output prediction image, wherein the output prediction image is a representation of an object associated with the training sample with illumination from an illumination direction associated with the training sample; identifying a reference true phase OLAT image corresponding to the lighting direction of the training sample and included in an OLAT image group of the training sample; calculating a loss indicative of a perceptual difference between the output predicted image and the identified reference true OLAT image; and updating parameters of the convolutional neural network based on the calculated loss.

In accordance with some embodiments of the disclosed subject matter, there is provided a system comprising a memory; and a hardware processor, which when executing computer executable instructions stored in the memory is configured to perform a method according to any aspect or embodiment described herein.

In accordance with some embodiments of the disclosed subject matter, there is provided a non-transitory computer-readable medium containing computer-executable instructions that, when executed by a processor, cause the processor to perform a method for re-illuminating an image using a deep reflected field, the method comprising: identifying a set of training samples, wherein each training sample comprises (i) a one-light-at-a-time (OLAT) image set that has been respectively captured when one of a plurality of lights arranged on a lighting structure has been activated; (ii) a set of spherical color gradient images that have been captured respectively when a plurality of lamps arranged on the lighting structure have been activated to emit respectively a specific color; and (iii) a lighting direction, wherein each image in the set of OLAT images and each spherical color gradient image is an image of the object, and wherein the lighting direction indicates a relative orientation of the light with respect to the object; training a convolutional neural network using a training sample set; identifying a test sample comprising a second set of spherical color gradient images and a second illumination direction; and generating a relight image of the object included in each second set of spherical color gradient images with illumination from the second illumination direction using the trained convolutional neural network. In some embodiments, training the convolutional neural network comprises, for each training iteration in a series of training iterations and for each training sample in a set of training samples: generating an output prediction image, wherein the output prediction image is a representation of an object associated with the training sample with illumination from an illumination direction associated with the training sample; identifying a reference true phase OLAT image corresponding to the lighting direction of the training sample and included in an OLAT image group of the training sample; calculating a loss indicative of a perceptual difference between the output predicted image and the identified reference true OLAT image; and updating parameters of the convolutional neural network based on the calculated loss.

According to some embodiments of the disclosed subject matter, there is provided a computer-readable medium containing computer-executable instructions that, when executed by a processor, cause the processor to perform a method according to any aspect or embodiment described herein. In some embodiments, the computer readable medium may be a non-transitory computer readable medium.

In accordance with some embodiments of the disclosed subject matter, there is provided a system for re-illuminating an image using a deep reflected field, the system comprising: means for identifying a set of training samples, wherein each training sample comprises (i) a one-light-at-a-time (OLAT) image set that has been respectively captured when one of a plurality of lights arranged on a lighting structure has been activated; (ii) a set of spherical color gradient images that have been captured respectively when a plurality of lamps arranged on the lighting structure have been activated to emit respectively a specific color; and (iii) a lighting direction, wherein each image in the set of OLAT images and each spherical color gradient image is an image of the object, and wherein the lighting direction indicates a relative orientation of the light with respect to the object; means for training a convolutional neural network using a training sample set; means for identifying a test sample comprising a second set of spherical color gradient images and a second illumination direction; and means for generating a relight image of the object included in each second set of spherical color gradient images with illumination from the second illumination direction using the trained convolutional neural network. In some embodiments, an apparatus for training a convolutional neural network comprises: for each training iteration in a series of training iterations and for each training sample in a set of training samples: means for generating an output prediction image, wherein the output prediction image is a representation of an object associated with the training sample with illumination from an illumination direction associated with the training sample; means for identifying a reference true phase OLAT image corresponding to an illumination direction of a training sample included in an OLAT image group of the training sample; means for calculating a loss indicative of a perceptual difference between the output predicted image and the identified reference true OLAT image; and means for updating parameters of the convolutional neural network based on the calculated loss.

Drawings

Various objects, features and advantages of the disclosed subject matter can be more fully understood by reference to the following detailed description of the disclosed subject matter when considered in connection with the following drawings, wherein like reference numerals refer to like elements.

Fig. 1 shows an illustrative example of a process for re-illuminating an image using a predicted deep reflection field, in accordance with some embodiments of the disclosed subject matter.

Fig. 2 shows an illustrative example of a process for generating training samples for training a network to re-illuminate an image, in accordance with some embodiments of the disclosed subject matter.

Fig. 3 shows an illustrative example of a process for training a neural network to re-illuminate an image with a predicted deep reflectance field in accordance with some embodiments of the disclosed subject matter.

Fig. 4 shows a schematic diagram of an illustrative system suitable for implementing the mechanisms described herein to re-illuminate an image using a predicted deep reflection field, in accordance with some embodiments of the disclosed subject matter.

Fig. 5 illustrates a detailed example of hardware that may be used in the server and/or user device of fig. 4 in accordance with some embodiments of the disclosed subject matter.

Fig. 6 shows a schematic diagram of an illustrative architecture of a neural network that may be trained to re-illuminate an image using a predicted deep reflectance field, in accordance with some embodiments of the disclosed subject matter.

Detailed Description

According to various embodiments, mechanisms (which may include methods, systems, and media) are provided for re-illuminating an image using a predicted deep reflection field.

In some embodiments, the mechanisms described herein may use a spherical color gradient image of an object to generate a relight image of the object. Additional details of spherical color Gradient images can be found, for example, in "Cosine Lobe Based Relighting from Gradient Illumination Photographs" (Cosine Lobe Based Illumination from Gradient Illumination Photographs) published by G.Fyffe et al, SIGGRAPH' 09. In some embodiments, a spherical color gradient image may be captured using a set of lights (e.g., two hundred lights, three hundred lights, etc.) arranged on a lighting structure (such as a spherical dome), and each programmed to emit light of a particular color and density. In some embodiments, the mechanisms described herein may generate images of objects in a lighting environment with illumination from a specified illumination direction relative to the object using a relatively small number (e.g., two and/or any other suitable number) of spherical color gradient images of the objects, and thus are less computationally intensive than current methods. It should be noted that in some embodiments, the object may be any suitable object, such as a portion of a human (e.g., a face and/or any other suitable portion), an animal, a plant, and/or any other suitable object. Additionally, it should be noted that since a small number of spherical color gradient images of an object may be captured in a short time, by using a relatively small number of spherical color gradient images of an object as input, the mechanisms described herein may generate a relight image of the object without requiring the object to sit or remain stationary for a long time. Thus, the mechanisms described herein may be used with moving or dynamic objects as well as stationary objects.

The mechanisms described herein may use a trained neural network that has been trained to learn a complete 4D reflectance field using spherical color gradient images to generate relight images of objects using spherical color gradient images of objects. In particular, the neural network may be trained using both spherical color gradient images captured using a set of lights of the lighting structure that have all been activated and a one-light-at-a-time (OLAT) set of images captured using each light of the lighting structure that has been activated individually. The nerve can then be trainedThe network reconstructs a particular OLAT image with light emanating from a particular illumination direction using a spherical color gradient image. For example, in some embodiments, a neural network may be trained to reconstruct an OLAT image corresponding to a particular light located at a particular orientation relative to an object, the particular orientation being associated with a particular illumination direction (e.g., from being located at a particular spherical coordinate)Illumination from a light source at and/or any other suitable location). It should be noted that after training, the neural network may be used to generate a relight image of the object using only the spherical color gradient image. That is, by learning the mapping of the OLAT image to a particular illumination direction using the spherical color gradient image and the OLAT image during training, the neural network may generate a relight image using only the spherical color gradient image. It should be noted that a more detailed technique for training a neural network is shown in fig. 3 and 6 and described below in connection with fig. 3 and 6.

In some embodiments, the spherical color gradient image and the OLAT image may be captured in any suitable manner. For example, in some embodiments, the spherical color gradient image and the OLAT image may be captured using a lighting environment generated by any one of a set of lights arranged on a lighting structure (e.g., a spherical dome, a lighting arm, and/or any other suitable type of lighting structure). As a more specific example, as described below in connection with fig. 2, a spherical color gradient image may have a lighting environment generated by activating each lamp of the lighting structure at a particular intensity and color, respectively. As another more specific example, the OLAT images may each have a lighting environment generated by activating a single lamp of the lighting structure, as described below in connection with fig. 2. It should be noted that in some embodiments, each image may be captured by any suitable camera having any suitable viewpoint of the object.

It should be noted that in some embodiments, each OLAT image may correspond to a particular light of the lighting structure that corresponds to a particular lighting direction that is activated. In some embodiments, the relight image may be an image having a lighting environment corresponding to a lamp having a lighting direction matching one of the OLAT images. For example, where a particular OLAT image corresponds to a light source having a particular spherical coordinate (e.g., (r,30 °,60 °) and/or any other suitable spherical coordinate), the relight image may be an image of the object that was lit with the light source at the same spherical coordinate. Additionally or alternatively, in some embodiments, the relight image may be an image having a lighting environment corresponding to a lamp having a lighting direction that is not included in any of the OLAT images used during the training of the neural network. That is, in some embodiments, the neural network may be trained such that the neural network may interpolate between illumination directions included in the OLAT images used during training. Additionally, it should be noted that in some embodiments, images (e.g., OLAT images and/or spherical color gradient images) used during training of the neural network may be captured from cameras in a set of cameras each having a particular viewpoint of the object. For example, in some embodiments, the images used during training may be captured from one of five cameras each having a different orientation relative to the subject. In some embodiments, a relight image generated by the trained neural network may be generated using spherical color gradient images of the object captured from cameras not used to capture the training image. That is, in some embodiments, the neural network may learn the summarized viewpoint during training.

Turning to fig. 1, an illustrative example 100 of a process for re-illuminating an image using a predicted deep reflection field is shown in accordance with some embodiments of the disclosed subject matter. In some embodiments, the blocks of process 100 may be performed on any suitable device, such as a server, desktop computer, laptop computer, and/or any other suitable device. In some embodiments, multiple devices may be used to perform the blocks of process 100.

The process 100 may begin at 102 by identifying a training sample set. In some embodiments, each training sample may include a set of images of the subject under different lighting conditions. For example, in some embodiments, the set of images may include images of an object captured using a set of lights of a lighting structure (e.g., lights disposed on a spherical dome, lights disposed on a lighting arm, and/or any other suitable type of lighting structure). As a more specific example, in some embodiments, the set of images may include an OLAT image set, where each OLAT image is captured with one lamp of the set of lamps of the lighting structure activated. In some such embodiments, the set of OLAT images may include any suitable number of OLAT images (e.g., one hundred, two hundred, three hundred, four hundred, and/or any other suitable number). It should be noted that in some embodiments, the number of OLAT images in an OLAT image group may correspond to the number of lights on the lighting structure. For example, in the case of three hundred lamps on a lighting structure, there may be three hundred OLAT images in an OLAT image set. As another more specific example, in some embodiments, the set of images may include a set of spherical color gradient images. In some embodiments, the set of spherical color gradient images may include any suitable number of spherical color gradient images (e.g., two, three, four, and/or any other suitable number). In some embodiments, the spherical color gradient image may be an image of an object captured using any of the lights of the lighting structure, which may each be programmed to have any suitable color (e.g., any suitable RGB color value and/or any other suitable color) and/or intensity.

It should be noted that in some embodiments, the set of images associated with a particular training sample may be captured from cameras having the same viewpoint of the subject. For example, in some embodiments, the set of OLAT images and the set of spherical color gradient images may be captured separately using cameras having the same orientation relative to the object. Additionally, it should be noted that in some embodiments, the set of images corresponding to different training samples may be captured by different cameras having different viewpoints of the subject. For example, in some embodiments, a first set of images corresponding to a first training sample may be captured using a first camera located at a particular position relative to the subject, and a second set of images corresponding to a second training sample may be captured using a second camera located at a different position relative to the subject. As a more specific example, where the illumination structure is a spherical dome, the first camera may be located directly in front of the object, while the second camera may be located to the side of the object. It should be noted that in some embodiments, any suitable number of cameras, each located at any suitable location, may be used to capture images associated with a training sample. Additionally, it should be noted that the training sample set may include images of any suitable number of subjects (e.g., five, ten, twenty, and/or any other suitable number). In some embodiments, the object may correspond to any suitable entity, such as a human being, a human face, an animal, an object (e.g., a plant, furniture, etc.), and/or any other suitable object of an image.

It should be noted that the techniques for capturing the images associated with each training sample are described in more detail below in conjunction with fig. 2.

In some embodiments, the process 100 may identify the training sample set in any suitable manner. For example, in some embodiments, process 100 may access a database of images and may construct a training sample set based on the retrieved images. It should be noted that in some embodiments, the process 100 may use the retrieved images to construct a training sample set and a validation sample set. In some such embodiments, the validation sample set may be similar to the training sample set, but may be disregarded during training of the neural network, as described below in connection with block 104. Additionally, it should be noted that in some embodiments, the training sample set may include images that are all captured using the camera set. In some such embodiments, the validation sample set may be constructed to include images captured with cameras not included in the sample set, the cameras being used to capture images included in the training sample set.

At 104, the process 100 may train the neural network to generate an image of the object relighted with illumination from the indicated lighting environment using the training samples. In some embodiments, the neural network may be trained to take any suitable image of the subject as an input and generate an image of the subject relighted using the indicated lighting environment as an output. For example, in some embodiments, process 100 may be trained to take as input spherical color gradient images of the object and the indicated lighting environment (e.g., lighting of a particular color from a particular direction, and/or any other suitable lighting environment), and to produce as output an image of the object relighted using the indicated lighting environment.

In some embodiments, the neural network may have any suitable type of architecture. For example, in some embodiments, the neural network may be an encoder-decoder network. As a more specific example, in some embodiments, the neural network may be a U-Net convolutional neural network that uses an encoder path (which includes any suitable number of encoder layers) to iteratively generate feature maps associated with images corresponding to each training sample and uses a corresponding decoder path (having a number of decoding layers corresponding to the number of encoding layers) that uses the feature maps generated by the encoder layers to generate output images. It should be noted that an example such as a U-Net architecture is shown in fig. 6 and described below in connection with fig. 6. Additionally, it should be noted that a detailed technique for training a neural network is shown in fig. 3 and described below in connection with fig. 3.

At 106, the process 100 may generate a relight image of the object using the trained neural network using an image of the object taken using the spherical gradient illumination and the indicated illumination environment as input. In some embodiments, the indicated lighting environment may correspond to any suitable lighting environment. For example, in some embodiments, the indicated lighting environment may include one or more lights from a particular direction relative to the object. It should be noted that in some embodiments, the position of the light may be indicated in any suitable manner, such as using spherical coordinates relative to the center of a spherical illumination dome, using spherical coordinates relative to the position of the object, and/or in any other suitable manner. It should be noted that in some embodiments, each illumination direction of one or more lamps may correspond to a particular lamp position of the lamp associated with the OLAT image included in the training sample. Alternatively, in some embodiments, the illumination direction may be a direction not included in the OLAT image. For example, where the set of OLAT images included in the training sample includes a first OLAT image having a lamp at spherical coordinates (5, 30 °,65 °) and a second OLAT image having a lamp at spherical coordinates (5,35 °,65 °), the neural network may generate an output image having an illumination direction interpolated between the two OLAT images (e.g., having spherical coordinates (5,32 °,65 °), and/or any other suitable interpolated coordinates). Additionally, it should be noted that in some embodiments, the indicated lighting environment may include a superposition of multiple lights each having an indicated lighting direction (each of these lights may correspond to an OLAT image in the training sample or may be a direction interpolated between OLAT images in the training sample). In some embodiments, the lighting environment may also indicate the color of each of the one or more lights, for example using RGB values and/or in any other suitable manner.

It should be noted that in some embodiments, the generated images may be stored in any suitable manner. For example, in some embodiments, the generated image may be stored as a new image in association with information associated with the generated image, such as information indicative of a lighting environment in which the object was re-lit to generate the generated image. Additionally or alternatively, in some embodiments, the generated image may be presented in any suitable manner. For example, in some embodiments, the generated image may be presented in a user interface (such as a user interface presented on a user device from which generation of a relight image using a trained neural network is initiated at block 106).

Turning to fig. 2, an illustrative example 200 of a process for generating training samples for training a neural network to generate a relight image is shown in accordance with some embodiments of the disclosed subject matter. In some embodiments, the blocks of process 200 may be performed by any suitable device. For example, in some embodiments, the blocks of process 200 may be processed by a device (such as a desktop or laptop computer that controls the lights of the lighting structure and/or one or more cameras suitable for capturing images from any suitable viewpoint). In some embodiments, images captured during execution of the blocks of process 200 may be stored on a device executing process 200. Additionally or alternatively, in some embodiments, images captured during execution of the blocks of process 200 may be transmitted to a different device (e.g., a server storing an image database and/or any other suitable server) for storage.

Process 200 may begin at 202 by capturing a series of OLAT images of a particular image object. In some embodiments, the OLAT image may be captured using lights arranged on any suitable lighting structure. For example, in some embodiments, the lighting structure may be a spherical dome with any suitable number (e.g., one hundred, two hundred, three hundred, four hundred, etc.) of lamps disposed at any suitable location on the spherical dome. In some such embodiments, the spherical dome may have any suitable diameter (e.g., 3.5 meters, 4 meters, and/or any other suitable diameter). As another example, in some embodiments, the illumination structure may be an illumination arm that may be moved to different positions. It should be noted that in some embodiments, the lamp may be any suitable type of lamp. For example, in some embodiments, the lights may be LED lights that are each programmed to have any suitable RGB value. As another example, in some embodiments, the lamp may be a tungsten lamp. It should be noted that in some embodiments, each lamp may be controlled by a driver, which may cause the lamp to emit light having the indicated intensity and/or color.

In some embodiments, each OLAT image in the series of OLAT images may be captured when one lamp of the lighting structure is activated but other lamps of the lighting structure are not activated. In some embodiments, the activated lights may be programmed to have any suitable color. For example, in some embodiments, the activated lights may have a white color. As a more specific example, in some embodiments, the activated light may be programmed to have an RGB value of (255 ). In some embodiments, each OLAT image in the series of OLAT images may be captured with a different light activated on the lighting structure. For example, in some embodiments, the series of OLAT images may be captured by iterating the lights of the lighting structure one at a time, where each image in the series of OLAT images corresponds to activation of a different light of the lighting structure.

In some embodiments, each OLAT image may be captured by any suitable camera.

In some embodiments, the camera may have any suitable viewpoint of the object. That is, in some embodiments, the camera may be located at any suitable orientation relative to the object. In some embodiments, process 200 may have access to any suitable number of cameras (e.g., one, two, five, ten, and/or any other suitable number) that are each positioned at a different orientation relative to the object to capture images of the object from different viewpoints. In some embodiments, each camera in the camera or set of cameras may be any one of any suitable model (e.g., sony IMX253 camera and/or any other suitable model).

In some embodiments, each camera may be capable of capturing images at any suitable resolution (e.g., 10MP, 12MP, 15MP, and/or any other suitable resolution) and at any suitable rate (e.g., 60Hz and/or any other suitable rate). It should be noted that in some embodiments, the camera(s) and lights of the lighting structure may be synchronized via any suitable type of hardware trigger.

At 204, the process 200 may capture a tracking frame image taken with all of the lights of the lighting structure activated. In some embodiments, by activating all of the lights of the lighting structure, an image of the object may be captured with uniform illumination.

In some embodiments, the tracking frame images may be used in any suitable manner to compensate for movement of objects between successive OLAT images, as described above in connection with block 202. For example, in some embodiments, the tracking frame image may be captured after a subset of the OLAT image has been captured. Process 200 may then loop back to block 202 to capture a second subset of the OLAT images and may continue to iterate through blocks 202 and 204 until all OLAT images have been captured. As a more specific example, where there are 331 lamps on the lighting structure, and therefore 331 OLAT images are to be captured, the process 200 may capture a track frame image after 11 OLAT images are captured, and may loop through blocks 202 and 204 after every 11 OLAT images.

In some embodiments, the tracking frame may be used in any suitable manner to compute the correspondence of objects across all OLAT images. For example, in some embodiments, after all OLAT images are captured, the process 200 may designate the final tracking frame (i.e., the most recently captured tracking frame image) as a reference, and then may use any suitable method or technique(s) to calculate a dense optical flow field across the tracking frame. In some embodiments, process 200 may then provide correspondence across the OLAT image by interpolating the optical flow field linearly in time. It should be noted that in some embodiments, the correspondence across OLAT images may be provided at any suitable point in time, such as at a point in time after all OLAT images have been captured, before training of the neural network begins (as described below in connection with fig. 3), and/or at any other suitable point in time.

At 206, process 200 may capture a spherical color gradient image using the lights of the illumination structure. In some embodiments, process 200 may capture any suitable number of spherical color gradient images (e.g., one, two, five, and/or any other suitable number). In some embodiments, process 200 may use a camera to capture spherical color gradient images in any suitable orientation relative to the object. For example, in some embodiments, a spherical color gradient image may be captured using the same camera (and thus with the same viewpoint of the object) used to capture the OLAT image at block 202 and/or the tracking frame image at block 204.

In some embodiments, a spherical color gradient image may be captured with a lamp programmed to have an illumination structure of any suitable color(s). For example, in some embodiments, each lamp of the lighting structure may be programmed to emit light based on the relative position of the lamp on the lighting structureAnd (4) color. As a more specific example, in the case of capturing two spherical color gradient images, for a lamp whose direction vector is θ (where θ is_xIs the x component of a directional vector θ, where θ_yIs the y-component of the direction vector theta, and where theta_zIs the z-component of the direction vector θ), the lamp can be programmed to emit a lamp color with the following RGB values for the first of the two spherical color gradient images:

((1+θ_x)/2,(1+θ_y)/2,(1+θ_z)/2)，

and for the second of the two spherical color gradient images, a lamp color is emitted with the following RGB values:

((1-θ_x)/2,(1-θ_y)/2,(1-θ_z)/2)。

it should be noted that in some embodiments, the lamp pattern of the lamps used to capture the spherical color gradient image may be selected to satisfy any suitable criteria. For example, in some embodiments, the lamps may be programmed to emit color for each spherical color gradient image such that when timed in, the lamp pattern produces a fully on white light condition that displays the total reflection (i.e., diffuse plus specular reflection) of the object, and such that the difference of the spherical color gradient images encodes the average reflectance direction as RGB color channels (which may be a strong cue for the surface normal). Additionally, in some embodiments, the size of the difference image relative to the sum image may be not only a function of the Bidirectional Reflectance Distribution Function (BRDF), but also a local self-occlusion, which may provide a hint to the occlusion estimation. In some embodiments, the lamp pattern used during the capture of the spherical color gradient image may provide both geometric and albedo information to the neural network.

Turning to fig. 3, an illustrative example 300 of a process for training a neural network to generate a relight image of an object using a deep reflection field is shown in accordance with some embodiments of the disclosed subject matter. In some embodiments, the blocks of process 300 may be performed on any suitable device or any suitable combination of devices. For example, in some embodiments, the blocks of process 300 may be performed on any suitable number of Graphics Processing Units (GPUs), such as one, five, ten, twelve, and/or any other suitable number.

The process 300 may begin by identifying a set of training samples, where each training sample comprises a set of images of an object. As described above in connection with fig. 1 and 2, the set of images of objects corresponding to one training sample may include any suitable image, such as a series of OLAT images captured with different lights of an activated lighting structure as described above in connection with block 202, a series of spherical color gradient images as described above in connection with block 206, and/or any other suitable image. In some embodiments, each image in the set of images corresponding to one training sample may be captured from the same camera and thus may have the same orientation as the object. It should be noted that in some embodiments, different training samples may correspond to different objects and may be associated with images captured from different orientations relative to the object. In some embodiments, the process 300 may identify the training sample set in any suitable manner. For example, in some embodiments, process 300 may retrieve a set of images corresponding to each training sample from a database of images. It should be noted that in some embodiments, each image associated with a training sample set may have any suitable size (e.g., 2560 pixels by 3072 pixels, and/or any other suitable size).

At 302, the process 300 may generate a cropped image for each image associated with the training sample. In some embodiments, each cropped image may have any suitable size (e.g., 512 pixels by 512 pixels, and/or any other suitable size). In some embodiments, the process 300 may generate cropped images corresponding to each image associated with the training sample in any suitable manner. For example, in some embodiments, the process 300 may identify random pixel locations within the images associated with the training samples, and may crop each of the images associated with the training samples to a cropped image size centered on the identified random pixel locations.

It should be noted that in some embodiments, the process 300 may identify any suitable subset of the set of images associated with the training sample, and the process 300 may crop the images in the identified subset of the set of images. For example, in some embodiments, process 300 may crop any of the spherical color gradient images (e.g., all of the spherical color gradient images and/or any suitable subset of the spherical color gradient images). As another example, in some embodiments, the process 300 may identify any suitable subset of OLAT images, and the process 300 may crop the OLAT images in the identified suitable subset of OLAT images. In some such embodiments, the subset of the OLAT images may be identified in any suitable manner. For example, in some embodiments, a subset of the OLAT images may be randomly selected. In some embodiments, the subset of OLAT images may include any suitable number of OLAT images (e.g., five, ten, twenty, and/or any other suitable number). In some embodiments, the number of OLAT images in the subset of OLAT images may be based on the number of GPUs performing process 300. For example, where 12 GPUs are used to perform the process 300, the process 300 may select 12 OLAT images from an OLAT image group.

At 306, the process 300 may identify an illumination direction. In some embodiments, the illumination direction may be identified in any suitable manner. For example, in some embodiments, the process 300 may identify random illumination directions. As a more specific example, in some embodiments, the process 300 may identify random illumination directions corresponding to locations on a sphere associated with any suitable spherical coordinate. It should be noted that in some embodiments, the illumination direction may additionally indicate the distance of the light source from the object. That is, in some embodiments, the illumination direction may be indicated by spherical coordinates asWherein r indicates the distance of the lamp from the object, and θ andthe relative orientation of the light and the object is indicated by an angular coordinate. Additionally, it should be noted that in some embodiments, the illumination direction may beIs the direction and/or distance corresponding to the light source associated with one of the OLAT images. Further, where a subset of the OLAT images is identified at 304, the illumination direction may correspond to a direction and/or distance of a light source associated with an OLAT image in the subset of the OLAT images.

At 308, the process 300 may use the cropped image generated at block 304 and the illumination direction identified at block 306 as inputs to the neural network.

It should be noted that in some embodiments, the neural network may have any suitable architecture. For example, in some embodiments, the neural network may be a Convolutional Neural Network (CNN). As a more specific example, in some embodiments, the neural network may be a U-Net convolutional neural network. In some embodiments, the U-Net architecture can have an encoder path including a series of encoder layers and a decoder path having a series of decoder layers connected to the encoder path. In some such embodiments, the U-Net architecture can use the encoder path to iteratively generate a series of feature maps corresponding to the input image. The U-Net architecture may then use the decoder path to reconstruct an input image with a different illumination direction (e.g., the illumination direction identified at block 306) using the series of decoder layers. It should be noted that a more detailed technique for using the U-Net architecture is shown and described below in connection with fig. 6.

At 310, the process 300 may generate an output image corresponding to a region of the image included in the cropped image with illumination from the indicated illumination direction. For example, where each cropped image shows a particular portion of the face of the subject (e.g., the mouth of the subject, the nose of the subject, the shoulders of the subject, etc.), the output image may correspond to the same portion of the face of the subject and may be lit with illumination corresponding to the illumination direction indicated in the input, as described above in connection with block 308.

In some embodiments, process 300 may generate the output image in any suitable manner and using any suitable technique(s). For example, where the neural network is a U-Net convolutional neural network, the process 300 may iteratively generate a continuous feature map corresponding to the input image using the encoder layer of the U-Net convolutional neural network. In some embodiments, after iterating through all encoder layers of the U-Net convolutional neural network, the process 300 can traverse a series of decoder layers of the U-Net convolutional neural network using a feature map generated by an encoder of the U-Net convolutional neural network. In some embodiments, the final decoder of the U-Net convolutional neural network may generate an output image that is a reproduction of the input image that is illuminated from the indicated illumination direction.

Turning to fig. 6, an illustrative example 600 of a U-Net architecture that can be used to generate a relight image using deep reflection fields is shown in accordance with some embodiments of the disclosed subject matter. It should be noted that architecture 600 shows an illustrative example of an architecture that may be used at inferred times, that is, after the neural network shown in architecture 600 has been trained to generate output images for input test images. However, as described below, the same architecture may be used to train the neural network. It should be noted that the image size shown in fig. 6 may correspond to the size of the image (i.e., the undipped image) used during the inferred time (e.g., after the network has been trained). Although the architecture 600 may be used during training of the neural network, in some embodiments, the training images may have a smaller size (i.e., crop size, as described above in connection with block 304 of fig. 3).

The architecture 600 may receive as input a series of images and illumination directions, as shown in FIG. 6. For example, the input may include a set of spherical color gradient images and a set of OLAT images, as described above in connection with blocks 302 and 304 of FIG. 3. It should be noted that in some embodiments, each image may be of any suitable size W × H. Additionally, it should be noted that in some embodiments, the illumination direction may be appended as RGB values to each pixel of the spherical color gradient image. Thus, where two spherical color gradient images are used (as described above in connection with block 206 of fig. 2), the two spherical color gradient images may be combined with the illumination direction to generate a tensor of size W × H × 9. That is, each pixel of each spherical color gradient may be associated with an RGB value of the pixel such that each of the two spherical color gradient images has a size of W × H × 3, respectively, and an illumination direction may be appended to each pixel of the spherical color gradient image, wherein the illumination direction itself is specified as the RGB channel value of each pixel. It should be noted that for the test image when the trained neural network is used at the inferred time, the images may each have a first width and height (e.g., 2560 pixels by 3072 pixels, as shown in fig. 6). However, during training of the neural network, each image may be a different cropped image of a smaller size (e.g., 512 pixels by 512 pixels, and/or any other suitable cropping size). In addition, it should be noted that the inputs as shown in fig. 6 only show the illumination direction and the spherical color gradient image, since these are the inputs used by the trained neural network at the inferred time. During training of the neural network, the network may additionally take as input the OLAT images or the identified subsets of OLAT images, as described above in connection with block 202 of fig. 2 and blocks 302 and 304 of fig. 3.

As shown in fig. 6, architecture 600 may have an encoder path 602. In some embodiments, the encoder path 602 may include a series of encoder layers. For example, architecture 600 shows eight encoder layers in encoder path 602. In some embodiments, each encoder layer may take an image as input and may generate a feature map of the image. In some embodiments, the generated feature map may then become an input image that is fed to the next encoder layer. In some embodiments, each generated feature map of the input image may have a smaller width and height relative to the input to the encoder layer, but a greater number of channels (e.g., greater depth) relative to the input to the encoder layer. For example, as shown in fig. 6, the first encoder layer may take an input image of size 2560 × 3072 and may generate a feature map of size 1280 × 1536.

As shown in fig. 6, architecture 600 may have a decoder path 612. In some embodiments, the decoder path 612 may include a series of decoder layers. In some embodiments, the number of decoder layers included in the decoder path 612 may be the same as the number of encoder layers included in the encoder path 602. For example, as shown in fig. 6, there are eight decoder layers included in the decoder path 612. In some embodiments, each decoder layer may receive the feature map as an input and generate an output image that is larger in size than the size of the input feature map. For example, as shown in fig. 6, the encoder layer may receive as input a feature map of size 1280 × 1536, and may generate as output an image of size 2560 × 3072.

It should be noted that in some embodiments, the neural network may learn the features represented in the image by passing the set of input images through the series of encoder layers, and may learn to reconstruct the image using the feature map, but with the illumination direction indicated in the input, by passing the feature map through the series of decoder layers.

In some embodiments, each encoder layer in the encoder path 602 may have a convolution block 604, a rectifier block 606, a normalization block 608, and a max-pooling block 610.

In some embodiments, the convolution block 604 may include applying any suitable convolution kernel of any suitable size (e.g., 3 x 3, and/or any other suitable size) to the image at the input of the encoder layer. It should be noted that in some embodiments, the convolution kernel applied at each volume block may be changed as the loss function is optimized during training of the neural network, as described below in connection with block 312.

In some embodiments, the rectifier block 606 may include applying any suitable rectifier function or activation function to the output of the convolution block 604. In some embodiments, any suitable type of rectifier or activation function may be used, such as a softplus rectifier, a half-wave rectifier, a logic rectifier, a hyperbolic tangent activation function, a rectifying linear unit (ReLU) activation function, a leakage ReLU activation function, and/or any other suitable type of rectifier or activation function.

In some embodiments, normalization block 608 may perform any suitable normalization on the output of rectifier block 606. For example, in some embodiments, the normalization block 608 may normalize the pixel values to any suitable range. In some embodiments, the normalization may be a batch normalization across all images associated with the training samples.

In some embodiments, the max-pooling block 610 may pool values within the image at the output of the normalization block 608 to downsample the image. In some embodiments, the maximum pooling may be performed in any suitable manner and using any suitable size of blocks. For example, in some embodiments, a 2 × 2 kernel may be applied to the image blocks, where the maximum value in each 2 × 2 block is selected as the output value, thereby downsampling the image and relatively selecting the important features to be retained in the output feature map.

In some embodiments, the decoder layer of the decoder path 612 may include a convolution block 614, a rectifier block 616, a normalization block 618, and an upsampling block 620. In some embodiments, the convolution block 614, rectifier block 616, and normalization block 618 may be similar to the convolution block 604, rectifier block 606, and normalization block 610, respectively.

In some embodiments, the upsampling block 620 may take the feature map as input and may generate a larger image relative to the size of the input feature map. For example, as shown in fig. 6, the upsampling block 620 may take a feature map of size 160 x 192 as an input and may generate an image of size 320 x 384 as an output. In some embodiments, the upsampling block 620 may upsample in any suitable manner and using any suitable upsampling technique. For example, in some embodiments, the upsampling block 620 may use bilinear upsampling. In some embodiments, any other suitable upsampling technique may be used, such as transposed convolution (e.g., using a 2 x2 convolution kernel, and/or any other suitable convolution kernel), depoling, interpolation (e.g., bilinear interpolation, cubic interpolation, nearest neighbor interpolation, and/or any other suitable interpolation), and/or any other suitable upsampling technique.

It should be noted that as shown in fig. 6, the output of the upsampling block 620 may be concatenated with the feature map from the encoder layer of the same depth using a skip 622. In some embodiments, by propagating the feature map from the encoder layer to the decoder layer at the corresponding depth, the decoder layer is able to use the feature map generated by the encoder layer at the corresponding depth.

Referring again to fig. 3, at 312, the process 300 may be based on the output image (referred to herein as I) generated at block 310_pred) And a reference true phase OLAT image to calculate the loss. In some embodiments, the process 300 may identify a baseline true-phase OLAT image (hereinafter referred to as I) by identifying OLAT images included in a set of images corresponding to a training sample_GT) The set of images corresponding to the training samples are captured with lights from the direction corresponding to the illumination direction identified at block 306. For example, if the current run of the neural network has an illumination direction identified at block 306 of the lamp associated with a spherical coordinate (3.5,30 °,60 °), then the process 300 may identify an OLAT image having the same illumination direction.

In some embodiments, the process 300 may calculate an indicator of I in feature space_predAnd I_GTThe loss of difference between (that is, rather than calculating the pixel-by-pixel difference). For example, the loss can be defined as:

where Perc () indicates the perceptual feature space.

In some embodiments, the loss function L may be further divided into two loss components L_pre-trainedAnd L_specific. In some embodiments, the two loss separations may be combined as:

L＝L_pre-trained+λL_specific，

where λ may be a weight of any suitable value (e.g., 0.5, and/or any other suitable value). In some embodiments, L_pre-trainedAnd L_specificThe loss, or I, may be indicated based on different image details_predAnd I_GTThe difference between them. For example, in some embodiments, L_pre-trainedThe loss may be indicated based on a first type of characteristic or information associated with the image (such as texture information), and/or any other suitable type of image characteristic or information. As another example, in some embodiments, L_specificThe loss may be indicated based on a second type of characteristic or information associated with the image, such as specular reflection (specularity), and/or any other suitable type of image characteristic or information. Note that in some embodiments, L_pre-trainedMay be based on the relative composition of L_specificThe loss is indicated by the relatively low frequency detail of the loss represented.

In some embodiments, L may be determined using a trained convolutional neural network (e.g., a VGG network and/or any other suitable network), referred to herein as a loss network_pre-trainedAnd L_specificEach of which. In some embodiments, the loss (e.g., L) may be calculated using an activation layer of a trained loss network_pre-trainedOr L_specific). For example, in some embodiments, the loss network may be trained using any suitable training image. After training the loss network, I may be calculated at different activation levels for each trained loss network_predAnd I_GTL of the feature space difference therebetween₂Norm to calculate specific I_predIs lost. It should be noted that in some embodiments, I is as described above_predAnd I_GTMay be an OLAT image. In some embodiments, any suitable number of activation layers may be used for each trained loss network (e.g., five and/or any other suitable number), and these activation layers may be combined in any suitable manner (e.g., a weighted sum, and/or in any other suitable manner). Additionally, it should be noted that in some embodiments, the activations may be rescaled in any suitable manner (e.g., by feature length, and/or in any other suitable manner) to ensure that each activation layer contributes in the same manner to the final loss.

In some embodiments, any suitable image database may be used (e.g., a database of images)Such as ImageNet, and/or any other suitable image database) to pre-train for calculating L_pre-trainedTo compute I_predAnd I_GTThe perceptual loss in between. In some embodiments, for calculating L_pre-trainedThe loss network of (a) may use a VGG architecture and/or any other suitable convolutional neural network architecture having any suitable number of convolutional layers.

In some embodiments, for calculating L_specificThe loss network of (a) may have any suitable architecture, such as a VGG architecture and/or any other suitable convolutional neural network architecture having any suitable number of convolutional layers. In some embodiments, the task available to regress high frequency details (such as specular reflection) may be used to train for calculating L_specificThe loss network of (2). For example, in some embodiments, the image I may be generated by converting the image I from a reference true phase OLAT image_GTIs used as input to train for calculating L_specificWherein the model is trained to determine which light direction generated the input patch. In some embodiments, the loss network may be trained such that L₂Loss or L₁The losses are minimized. It should be noted that in some embodiments, let L be₁Loss minimization may produce sharper results for the image-to-image conversion task.

It should be noted that in some embodiments, I may be coupled in any suitable manner_GTAnd I_predAlignment to prevent slight pixel misalignment between the two images leads to complication of the loss calculation. In some embodiments, any suitable alignment technique may be used to align I_GTAnd I_predAnd (6) aligning. For example, in some embodiments, the offsets in the x and y directions (referred to herein as x 'and y') may be calculated as:

x′，y′＝argmin x，y∑_u∑_v||I_gt(u-x，v-y)-I_pred(u，v)||₁。

in some embodiments, I (u, v) may indicate an intensity value for a certain pixel position (u, v). In some embodiments, any suitable method may be usedWindow (e.g., [ -20, 20)]×[-20,20]Window, and/or any other suitable window) to sample the offsets x and y. In some embodiments, the offsets x 'and y' may be used to generate an aligned reference true phase image,in some embodiments, as described above, the aligned reference true phase image may then be used to calculate the loss (e.g., L)_specificAnd/or L_pretrained)。

Referring again to fig. 3, the process 300 may update any suitable weights or parameters based on the calculated loss to minimize the loss function, and may loop back to block 304 and may generate a cropped image for the new training sample.

It should be noted that process 300 may optimize the loss function using any suitable technique or combination of techniques. For example, in some embodiments, the process 300 may use any suitable optimizer (e.g., ADAM, and/or any other suitable optimizer). As another example, in some embodiments, the process 300 may use any suitable learning rate (e.g., 1e-4, and/or any other suitable learning rate). In some embodiments, the process 300 may use an exponential decay of the learning rate with any suitable decay (e.g., 0.1, and/or any other suitable decay) over any suitable number of iterations (e.g., 1e6 iterations, and/or any other suitable number). In some embodiments, the neural network may be optimized to perform any suitable number of iterations before the training converges.

Turning to fig. 4, an illustrative example 400 of hardware for re-illuminating an image using a predicted deep reflection field is shown that may be used in accordance with some embodiments of the disclosed subject matter. As illustrated, hardware 400 may include a server 402, a communication network 404, and/or one or more user devices 406, such as user devices 408 and 410.

Server 402 may be any suitable server(s) for storing information, data, programs, and/or any other suitable type of content. In some embodiments, the server 402 may perform any suitable functions. For example, in some embodiments, the server 402 may be used to train a neural network to re-illuminate an image using a predicted deep reflectance field, as described above in connection with fig. 1 and 3. As another example, in some embodiments, server 402 may be used to generate any suitable training samples for use by a neural network, as described above in connection with fig. 2. It should be noted that in some embodiments, the server 402 may store images that are used to train the neural network on different devices. In some such embodiments, the server 402 may transmit the images used to train the neural network to the device in any suitable manner.

In some embodiments, the communication network 404 may be any suitable combination of one or more wired and/or wireless networks. For example, the communication network 404 may include any one or more of the internet, an intranet, a Wide Area Network (WAN), a Local Area Network (LAN), a wireless network, a Digital Subscriber Line (DSL) network, a frame relay network, an Asynchronous Transfer Mode (ATM) network, a Virtual Private Network (VPN), and/or any other suitable communication network. The user device 406 may be connected to the communication network 404 via one or more communication links (e.g., communication link 412), and the communication network 404 may be linked to the server 402 via one or more communication links (e.g., communication link 414). The communication links may be any communication links suitable for communicating data between user devices 406 and server 402, such as network links, dial-up links, wireless links, hardwired links, any other suitable communication links, or any suitable combination of such links.

The user device 406 may include any user device or devices suitable for training a neural network to relight images, suitable for generating relight images using a trained neural network, transmitting instructions to a server (e.g., server 402 described above) to train a neural network using a server, and/or transmitting instructions to a server to generate relight images using a trained neural network. For example, in some embodiments, the user device 406 may perform any of the blocks of the processes 100, 200, and 300 as shown in fig. 1-3 and described above in connection with fig. 1-3. As another example, in some embodiments, the user device 406 may transmit instructions to the server 402 to perform any of the functions described above in connection with fig. 1 and 3. In some embodiments, user device 406 may comprise any suitable type of device. For example, in some embodiments, user device 406 may comprise a desktop computer, a laptop computer, a mobile phone, a tablet computer, and/or any other suitable type of user device.

Although the server 402 is illustrated as one device, in some embodiments any suitable number of devices may be used to perform the functions performed by the server 402. For example, in some embodiments, multiple devices may be used to implement the functions performed by server 402.

Although two user devices 408 and 410 are shown in fig. 4 to avoid complicating the figure, in some embodiments, any suitable number of user devices and/or any suitable type of user devices may be used.

In some embodiments, the server 402 and the user device 406 may be implemented using any suitable hardware. For example, in some embodiments, devices 402 and 406 may be implemented using any suitable general purpose or special purpose computer. For example, a mobile phone may be implemented using a special purpose computer. Any such general purpose or special purpose computer may include any suitable hardware. For example, as illustrated in the example hardware 500 of fig. 5, such hardware may include a hardware processor 502, memory and/or storage 504, an input device controller 506, an input device 508, a display/audio driver 510, display and audio output circuitry 512, communication interface(s) 514, an antenna 516, and a bus 518.

In some embodiments, hardware processor 502 may include any suitable hardware processor, such as a microprocessor, microcontroller, digital signal processor(s), dedicated logic, and/or any other suitable circuitry for controlling the functions of a general purpose or special purpose computer. In some embodiments, hardware processor 502 may be controlled by a server program stored in memory and/or storage of a server (such as server 402). In some embodiments, the hardware processor 502 may be controlled by computer programs stored in memory and/or storage 504 of the user device 406.

In some embodiments, memory and/or storage 504 may be any suitable memory and/or storage for storing programs, data, and/or any other suitable information. For example, memory and/or storage 504 may include random access memory, read only memory, flash memory, hard disk memory, optical media, and/or any other suitable memory.

In some embodiments, the input device controller 506 may be any suitable circuitry for controlling and receiving input from one or more input devices 508. For example, the input device controller 506 may be circuitry for receiving input from a touch screen, from a keyboard, from one or more buttons, from voice recognition circuitry, from a microphone, from a camera, from an optical sensor, from an accelerometer, from a temperature sensor, from a near field sensor, from a pressure sensor, from an encoder, and/or any other type of input device.

In some embodiments, display/audio driver 510 may be any suitable circuitry for controlling and driving the output of one or more display/audio output devices 512. For example, the display/audio driver 510 may be circuitry for driving a touch screen, a flat panel display, a cathode ray tube display, a projector, one or more speakers, and/or any other suitable display and/or presentation device.

Communication interface(s) 514 may be any suitable circuitry for interfacing with one or more communication networks, such as computer network 404. For example, interface(s) 514 may include network interface card circuitry, wireless communication circuitry, and/or any other suitable type of communication network circuitry.

In some embodiments, antenna 516 may be any suitable antenna or antennas for wireless communication with a communication network (e.g., communication network 404). In some embodiments, antenna 516 may be omitted.

In some embodiments, bus 518 may be any suitable mechanism for communicating between two or more components 502, 504, 506, 510, and 514.

According to some embodiments, any other suitable components may be included in hardware 500.

In some embodiments, at least some of the above-described blocks of the processes of fig. 1-3 may be performed or carried out in any order or sequence, which is not limited to the order and sequence shown in and described in connection with the figures. Further, some of the above blocks of fig. 1-3 may be performed or carried out substantially simultaneously or in parallel where appropriate to reduce latency and processing time. Additionally or alternatively, some of the above-described blocks of the processes of fig. 1-3 may be omitted.

In some embodiments, any suitable computer-readable medium may be used to store instructions for performing the functions and/or processes herein. For example, in some embodiments, the computer-readable medium may be transitory or non-transitory. For example, a non-transitory computer-readable medium may include media such as non-transitory forms of magnetic media (such as hard disks, floppy disks, and/or any other suitable magnetic media), non-transitory forms of optical media (such as compact discs, digital video discs, blu-ray discs, and/or any other suitable optical media), non-transitory forms of semiconductor media (such as flash memories, electrically programmable read-only memories (EPROMs), electrically erasable programmable read-only memories (EEPROMs), and/or any other suitable semiconductor media), any suitable media that is not transitory or permanent during transmission, and/or any suitable tangible media. As another example, a transitory computer-readable medium may include a signal on a network in a wire, conductor, fiber, circuit, any suitable medium that is transitory and not permanent during transmission, and/or any suitable intangible medium.

Accordingly, methods, systems, and media for re-illuminating an image using a predicted deep reflection field are provided.

While the invention has been described and illustrated in the foregoing illustrative embodiments, it is understood that the present disclosure has been made only by way of example, and that numerous changes in the details of implementation of the invention may be made without departing from the spirit and scope of the invention, which is limited only by the claims which follow. The features of the disclosed embodiments may be combined and rearranged in various ways.

28页详细技术资料下载

Method, system, and medium for re-illuminating an image using a predicted deep reflection field

相关技术

网友询问留言