Neural Super-sampling for Real-time Rendering
In one embodiment, a method includes receiving a pair of stereo images having a resolution lower than a target resolution, generating an initial first feature map for a first image of the pair based on first channels associated with the first image and generating an initial second feature map for a second image of the pair based on second channels associated with the second image, generating a first feature map based on combining the first channels with the initial first feature map, generating a second feature map based on combining the second channels with the initial second feature map, up-sampling the first feature map and the second feature map to the target resolution, warping the up-sampled second feature map, and generating a reconstructed image corresponding to the first image having the target resolution based on the up-sampled first feature map and the up-sampled and warped second feature map.
This application is a continuation under 35 U.S.C. § 120 of U.S. patent application Ser. No. 17/039,263, filed 30 Sep. 2020, which claims the benefit under 35 U.S.C. § 119(e) of U.S. Provisional Patent Application No. 63/027,258, filed 19 May 2020, which is incorporated herein by reference.
TECHNICAL FIELDThis disclosure generally relates to graphic rendering, and in particular relates to graphic rendering for graphics applications.
BACKGROUNDVirtual reality (VR) is a simulated experience that can be similar to or completely different from the real world. Applications of virtual reality can include entertainment (i.e. video games) and educational purposes (i.e. medical or military training). Other, distinct types of VR style technology include augmented reality and mixed reality. Currently standard virtual reality systems use either virtual reality headsets or multi-projected environments to generate realistic images, sounds and other sensations that simulate a user's physical presence in a virtual environment. A person using virtual reality equipment is able to look around the artificial world, move around in it, and interact with virtual features or items. The effect is commonly created by VR headsets consisting of a head-mounted display with a small screen in front of the eyes, but can also be created through specially designed rooms with multiple large screens. Virtual reality typically incorporates auditory and video feedback, but may also allow other types of sensory and force feedback through haptic technology.
Augmented reality (AR) is an interactive experience of a real-world environment where the objects that reside in the real world are enhanced by computer-generated perceptual information, sometimes across multiple sensory modalities, including visual, auditory, haptic, somatosensory and olfactory. AR can be defined as a system that fulfills three basic features: a combination of real and virtual worlds, real-time interaction, and accurate 3D registration of virtual and real objects. The overlaid sensory information can be constructive (i.e. additive to the natural environment), or destructive (i.e. masking of the natural environment). This experience is seamlessly interwoven with the physical world such that it is perceived as an immersive aspect of the real environment. In this way, augmented reality alters one's ongoing perception of a real-world environment, whereas virtual reality completely replaces the user's real-world environment with a simulated one.
SUMMARY OF PARTICULAR EMBODIMENTSDue to higher resolutions and refresh rates, as well as more photorealistic effects, real-time rendering has become increasingly challenging for video games, emerging virtual/augmented reality headsets, and other graphics applications. To meet this demand, modern graphics hardware and game engines often reduce the computational cost by rendering at a lower resolution and then up-sampling to the native resolution. Following the recent advances in image and video super-resolution in computer vision, the embodiments disclosed herein propose a machine learning approach that is specifically tailored for high-quality up-sampling of rendered content in real-time applications including video games, virtual reality, augmented reality, mixed reality, or any suitable graphics applications. One insight of the embodiments disclosed herein may be that in rendered content, the image pixels are point-sampled, but precise temporal dynamics is available. The embodiments disclosed combine this specific information that is typically available in modern renderers (i.e., depth and dense motion vectors) with a novel temporal network design that takes into account such specifics and is aimed at maximizing video quality while delivering real-time performance. By training on a large synthetic dataset rendered from multiple 3D scenes with recorded camera motion, the embodiments disclosed demonstrate high fidelity and temporally stable results in real time, even in the highly challenging 4×4 up-sampling scenario, significantly outperforming existing super-resolution and temporal antialiasing work.
In particular embodiments, a computing system may receive a first frame and one or more second frames of a video having a resolution lower than a target resolution. The first frame may be associated with a first time and each second frame may be associated with a second time prior to the first time. The computing system may generate a first feature map for the first frame and one or more second feature maps for the one or more second frames. In particular embodiments, the computing system may then up-sample the first feature map and the one or more second feature maps to the target resolution. The computing system may warp each of the one or more up-sampled second feature maps according to a motion estimation between the associated second time and the first time. The computing system may further generate a reconstructed frame corresponding to the first frame by using a machine-learning model to process the up-sampled first feature map and the one or more up-sampled and warped second feature maps, the reconstructed frame having the target resolution.
Embodiments of the invention may include or be implemented in conjunction with an artificial reality system. Artificial reality is a form of reality that has been adjusted in some manner before presentation to a user, which may include, e.g., a virtual reality (VR), an augmented-reality (AR), a mixed reality (MR), a hybrid reality, or some combination and/or derivatives thereof. Artificial reality content may include completely generated content or generated content combined with captured content (e.g., real-world photographs). The artificial reality content may include video, audio, haptic feedback, or some combination thereof, and any of which may be presented in a single channel or in multiple channels (such as stereo video that produces a three-dimensional effect to the viewer). Additionally, in some embodiments, artificial reality may be associated with applications, products, accessories, services, or some combination thereof, that are, e.g., used to create content in an artificial reality and/or used in (e.g., perform activities in) an artificial reality. The artificial reality system that provides the artificial reality content may be implemented on various platforms, including a head-mounted display (HMD) connected to a host computer system, a standalone HMD, a mobile device or computing system, or any other hardware platform capable of providing artificial reality content to one or more viewers.
The embodiments disclosed herein are only examples, and the scope of this disclosure is not limited to them. Particular embodiments may include all, some, or none of the components, elements, features, functions, operations, or steps of the embodiments disclosed herein. Embodiments according to the invention are in particular disclosed in the attached claims directed to a method, a storage medium, a system and a computer program product, wherein any feature mentioned in one claim category, e.g. method, can be claimed in another claim category, e.g. system, as well. The dependencies or references back in the attached claims are chosen for formal reasons only. However any subject matter resulting from a deliberate reference back to any previous claims (in particular multiple dependencies) can be claimed as well, so that any combination of claims and the features thereof are disclosed and can be claimed regardless of the dependencies chosen in the attached claims. The subject-matter which can be claimed comprises not only the combinations of features as set out in the attached claims but also any other combination of features in the claims, wherein each feature mentioned in the claims can be combined with any other feature or combination of other features in the claims. Furthermore, any of the embodiments and features described or depicted herein can be claimed in a separate claim and/or in any combination with any embodiment or feature described or depicted herein or with any of the features of the attached claims.
The patent or application file contains at least one drawing executed in color. Copies of this patent or patent application publication with color drawing(s) will be provided by the Office upon request and payment of the necessary fee.
Due to higher resolutions and refresh rates, as well as more photorealistic effects, real-time rendering has become increasingly challenging for video games, emerging virtual/augmented reality headsets, and other graphics applications. To meet this demand, modern graphics hardware and game engines often reduce the computational cost by rendering at a lower resolution and then up-sampling to the native resolution. Following the recent advances in image and video super-resolution in computer vision, the embodiments disclosed herein propose a machine learning approach that is specifically tailored for high-quality up-sampling of rendered content in real-time applications including video games, virtual reality, augmented reality, mixed reality, or any suitable graphics applications. One insight of the embodiments disclosed herein may be that in rendered content, the image pixels are point-sampled, but precise temporal dynamics is available. The embodiments disclosed combine this specific information that is typically available in modern renderers (i.e., depth and dense motion vectors) with a novel temporal network design that takes into account such specifics and is aimed at maximizing video quality while delivering real-time performance. By training on a large synthetic dataset rendered from multiple 3D scenes with recorded camera motion, the embodiments disclosed demonstrate high fidelity and temporally stable results in real time, even in the highly challenging 4×4 up-sampling scenario, significantly outperforming existing super-resolution and temporal antialiasing work.
In particular embodiments, a computing system may receive a first frame and one or more second frames of a video having a resolution lower than a target resolution. The first frame may be associated with a first time and each second frame may be associated with a second time prior to the first time. The computing system may generate a first feature map for the first frame and one or more second feature maps for the one or more second frames. In particular embodiments, the computing system may then up-sample the first feature map and the one or more second feature maps to the target resolution. The computing system may warp each of the one or more up-sampled second feature maps according to a motion estimation between the associated second time and the first time. The computing system may further generate a reconstructed frame corresponding to the first frame by using a machine-learning model to process the up-sampled first feature map and the one or more up-sampled and warped second feature maps, the reconstructed frame having the target resolution.
Real-time rendering for modern desktop, mobile, virtual reality, augmented reality applications, or any suitable graphics application may be challenging due to increasing display resolutions and demands for photorealistic visual quality. As an example and not by way of limitation, a virtual reality (VR) headset or an augmented reality (AR) headset may require rendering 2880×1600 pixels at 90-144 Hz and recent gaming monitors may support 3840×2160 resolution at 144 Hz, which, together with the recent advances in physically based shading and real-time ray tracing, may set a high demand on computational power for high-quality rendering.
A multitude of techniques have been introduced to address this problem in recent years. One technique applies fixed foveated rendering, for which peripheral regions are rendered at low resolution. Another technique employs gaze-contingent foveated reconstruction by rendering non-uniform sparse pixel samples followed by neural reconstruction. Another technique introduces the temporal antialiasing upscaling (TAAU) method which utilizes pixel color statistics and temporal accumulation for super-sampling. Variable rate shading has been introduced recently to accelerate rendering by reducing the shading complexity for foveated and high-resolution displays. Another technique has recently released deep-learned super-sampling (DLSS) that up-samples low-resolution rendered content with a neural network in real-time. However, these methods either introduce obvious visual artifacts into the up-sampled images, especially at up-sampling ratios higher than 2×2, or rely on proprietary technologies and/or hardware that may be unavailable on all platforms.
The embodiments disclosed herein introduce a method that may be easy to integrate with modern game engines, require no special hardware (e.g., eye tracking) or software (e.g., proprietary drivers for DLSS), making it applicable to a wider variety of existing software platforms, acceleration hardware and displays. In particular embodiments, a computing system may take common inputs from modern game engines, i.e., color, depth and motion vectors at a lower resolution, and significantly up-sample the input imagery to the target high resolution using a temporal convolutional neural network. Different than most existing real-time super-sampling methods, which typically aim for no more than 2×2 up-sampling in practice, the embodiments disclosed herein may allow for compelling 4×4 up-sampling from highly aliased input and produce high fidelity and temporally stable results in real-time.
While prominent advances have been demonstrated for photographic image and video up-sampling with deep learning techniques, these methods may not apply to rendered content. The fundamental difference in image formation between rendered and photographic images may be that each sample in the rendering is a point sample in both space and time, in contrast to a pixel area integral in photographic images. Therefore, the rendered content may be highly aliased, especially at a low resolution. This may make up-sampling for rendered content both an antialiasing and interpolation problem, rather than the deblurring problem as studied in existing super-resolution work in computer vision community. On the other hand, pixel samples in real-time rendering may be accurate, and more importantly, motion vectors (i.e. geometric correspondences between pixels in sequential frames) may be available nearly for free at subpixel precision. These inputs may bring both new benefits and challenges into the super-resolution problem for rendering, which motivates the embodiments disclosed herein to revisit the deep learning techniques for rendering.
Large datasets may be necessary for training robust networks. To train for temporal stability, the datasets should also represent realistic camera motions (e.g., with large rotation and translation). The embodiments disclosed herein found that no existing datasets may satisfy our requirements. Therefore, the embodiments disclosed herein build a large-scale dataset generation pipeline in Unity (i.e., a cross-platform game engine), replay head motion captured from VR user studies, and render color, depth and motion vectors for thousands of frames for each of our representative dynamic scenes. This new dataset may enable us to train and test neural networks on realistic use cases, including the disclosed architecture herein and existing learned super-resolution methods. With such comparisons, the embodiments disclosed herein demonstrate that our network significantly outperforms prior state-of-the-art learned super-resolution and temporal antialiasing upscaling work.
The technical contributions of the embodiments disclosed herein may be summarized as follows.
-
- The embodiments disclosed herein introduce a temporal neural network tailored for image super-sampling of rendered content that employs rich rendering attributes (i.e., color, depth, and motion vectors) and that is optimized for real-time applications including video games, virtual reality, augmented reality, mixed reality, or any suitable graphics applications.
- The embodiments disclosed herein demonstrate the first learned super-sampling method that achieves significant 4×4 super-sampling with high spatial and temporal fidelity.
- The embodiments disclosed herein significantly outperform prior work, including real-time temporal antialiasing upscaling and state-of-the-art image and video super-resolution methods, both in terms of visual fidelity and quantitative metrics of image quality.
In rendering, a motion vector points at an analytically computed screen-space location where a 3D point that is visible at the current frame may appear in the previous frame, with a subpixel precision, as shown in
In particular embodiments, a computing system may first warp previous frames to align with the current frame, in order to reduce the required receptive field and complexity of the reconstruction network. In contrast to existing work, however, to better exploit the specifics of rendered data, i.e., point-sampled colors and subpixel-precise motion vectors, the computing system may apply the frame warping at the target (high) resolution space rather than at the input (low) resolution. In particular embodiments, up-sampling the first feature map and the one or more second feature maps to the target resolution may be based on zero up-sampling. Specifically, the computing system may project the input pixels to the high-resolution space, prior to the warping, by zero-upsampling.
As the rendered motion vectors do not reflect disocclusion or shading changes between frames, the warped previous frames may contain invalid pixels mismatching with the current frame, which may mislead the post-reconstruction. To address this problem, the embodiments disclosed herein include a reweighting mechanism before the reconstruction network to ideally de-select those invalid pixels. The reweighting mechanism may be related to the confidence map approaches used for multi-frame blending in various applications. In contrast to these methods, however, the computing system may utilize a neural network to learn the reweighting weights.
Lastly, the preprocessed previous frames (after zero-upsampling, warping and reweighting) may be stacked together with the current frame (after zero-upsampling), and fed into a reconstruction network for generating the desired high-resolution image.
Feature Extraction. In particular embodiments, generating the first feature map for the first frame and the one or more second feature maps for the one or more second frames may be based on one or more convolutional neural networks. The feature extraction module may contain a 3-layer convolutional neural network. This subnetwork may process each input frame individually and share weights across all frames except for the current frame. In particular embodiments, generating each of the first feature map for the first frame and the one or more second feature maps for the one or more second frames may comprise learning an initial feature map for each of the first frame and the one or more second frames and combining the initial feature map, a corresponding input color, and a corresponding depth for each of the first frame and the one or more second frames to generate each of the first feature map and the one or more second feature maps. In particular embodiments, the initial feature map may be based on a first number of channels whereas each of the first feature map and the one or more second feature maps may be based on a second number of channels. As an example and not by way of limitation, for each frame, the subnetwork may take color and depth as input and generate 8-channel learned features, which are then concatenated with the input color and depth, resulting in 12-channel features in total.
Temporal Reprojection. To reduce the required receptive field and thus complexity of the reconstruction network, the computing system may apply temporal reprojection to project pixel samples and learned features of each previous frame to the current, by using the rendered motion vectors. In order to fully exploit the subpixel backward motion vectors, the computing system may conduct the temporal reprojection at the target (high) resolution space. First, the computing system may project the pixel samples from input (low) resolution space to the high-resolution space, by zero up-sampling. The zero up-sampling may comprise assigning each input pixel of each of the first feature map and the one or more second feature maps to its corresponding pixel at the target resolution and setting all missing pixels around the input pixel as zeros. The location of each input pixel may fall equally in between s pixels in the high resolution, where s is the up-sampling ratio. Zero up-sampling may be chosen for its efficiency and because it provides the network information on which samples are valid or invalid.
In particular embodiments, the computing system may determine the motion estimation between the associated second time and the first time. The determining may comprise identifying a motion vector for the corresponding second frame having the resolution lower than the target resolution and resizing the motion vector to the target resolution based on bilinear up-sampling. The computing system may resize the rendered low-resolution map of motion vectors to high resolution simply by bilinear up-sampling, taking advantage of the fact that the motion vectors are piece-wise smooth. While such simple up-sampling may introduce errors to the up-sampled map at discontinuous regions, it may well recover the majority of regions compared to ground truth. In particular embodiments, warping each of the one or more up-sampled second feature maps may comprise using the motion estimation with bilinear interpolation during warping. In other words, the computing system may apply backward warping of the zero-upsampled previous frames using the up-sampled motion vectors, while bilinear interpolation may be adopted during warping.
Performing warping at the zero-upsampled target resolution space may reduce the effect of low-pass interpolation during warping and thus protect the high-frequency information contained in the rendered point samples. This may make the embodiments disclosed herein distinct from existing super-resolution work that typically warps frames at the input low resolution space.
Feature Reweighting. The rendered motion vectors may not reflect dynamic disocclusions or shading changes between frames. Thus, the warped frames may contain artifacts such as ghosting at disocclusion regions and mismatched pixels at inconsistent shading regions.
To address this problem, the embodiments disclosed herein introduce a feature reweighting module to be able to mask out these mismatched samples. In particular embodiments, the computing system may input the up-sampled first feature map and the one or more up-sampled and warped second feature maps to a feature reweighting module. The feature reweighting module may be based on one or more convolutional neural networks. The computing system may generate, by the feature weighting module, a pixel-wise weighting map for each of the one or more up-sampled and warped second feature maps. The computing system may further multiply the pixel-wise weighting map with the corresponding up-sampled and warped second feature map to generate a reweighted feature map for the corresponding second frame. As an example and not by way of limitation, the feature reweighting module may be a 3-layer convolutional neural network, which may take the RGB-D of the zero-upsampled current frame as well as the zero-upsampled, warped previous frames as input, and generate a pixel-wise weighting map for each previous frame, with values between 0 and 10, where 10 is a hyperparameter. The hyperparameter may be set to allow the learned map to not just attenuate, but also amplify the features per pixel, and empirically the embodiments disclosed herein found the dynamic range of 10 was enough.
Then each weighting map may be multiplied to all features of the corresponding previous frame. The reason of feeding only RGB-D, instead of the whole 12-channel features, into the reweighting network may be to further reduce the network complexity. The network details are given in
Reconstruction. In particular embodiments, generating the reconstructed frame corresponding to the first frame may comprise combining the up-sampled first feature map and the reweighted feature maps associated with the one or more second frames. Finally, the features of the current frame and the reweighted features of previous frames may be concatenated and fed into a reconstruction network, which may output the recovered high-resolution image of the current frame. In other words, the machine-learning model for generating the reconstructed frame corresponding to the first frame may be based on a convolutional neural network with one or more skip connections. The embodiments disclosed herein adopt a 3-scale, 10-layer U-Net with skip connections for the reconstruction subnetwork. The network details are given in
Color Space. In particular embodiments, the first frame may comprise an RGB image. The computing system may optionally convert the input RGB image of first frame to a YCbCr image in the YCbCr color space, before feeding it to the neural network. The direct output of the network and the training loss may stay in YCbCr space, before the result is converted back to RGB space for viewing. While optional, the embodiments disclosed herein experimentally find the color space conversion slightly improves reconstruction quality, i.e. 0.1 dB improvement in peak signal-to-noise ratio (PSNR).
The training loss of our method, as given in Eq. (1), may be a weighted combination of the perceptual loss computed from a pretrained VGG-16 network and the structural similarity index (SSIM).
where x and
In particular embodiments, the computing system may need to render content from stereo images from AR/VR headsets. In this case, the computing system may additionally leverage the particular information provided by AR/VR headsets for reconstruction. As mentioned above, the computing system may use previous frames to provide additional information to help fill in the missing information of the up-sampled current frame. With AR/VR headsets, for each timestamp, the computing system may need to render a pair of stereo images, one for each eye of a user. The two stereo images may provide slightly different information about the same scene since they are rendered from different viewpoints. Such difference may be considered as additional information, which may be conceptually similar to the previous frames. As an example and not by way of limitation, the first frame may comprise a first stereo image captured by a first camera. Each of the one or more second frames may comprise a second stereo image captured by a second camera. For two stereo images, when up-sampling the first stereo image, the computing system may use the second stereo image to provide the additional information needed for filling in the missing information. Similar to the previous frames, the computing system may extract feature from the second stereo image and up-sample the feature and the RGB-D information as the second stereo image may be also generated at low resolution like the first stereo image. Then the computing system may perform warping. The warping may not be based on motion vectors. Instead, since the geometry of the rendered scene (e.g., the depth and location of objects) and the relative position between the first camera taking the first stereo image and the second camera taking the second stereo image are known, the computing system may warp the feature map of the second stereo image to the viewpoint of the first camera. In other words, warping each of the one or more up-sampled second feature maps may comprise warping the up-sampled second feature map of the second stereo image to a viewpoint of the first camera. The computing system may then perform feature reweighting based on the warped image of the second stereo image. After reweighting, the computing system may further perform reconstruction using similar process as aforementioned.
Another type of information to leverage may be head motion of the user wearing AR/VR headset. The computing system may generate motion vectors based on the head motion instead of using the motion vectors provided by game engines. In particular embodiments, the first frame and the one or more second frames may be received from a client device. The first frame and the one or more second frames may be associated with a head motion detected by the client device. Accordingly, the motion estimation may be determined based on the head motion.
In particular embodiments, the computing system may train a separate network for each 3D scene unless specified in the experiments. Large datasets may be necessary for training robust networks. The embodiments disclosed herein collected several representative, dynamic scenes in Unity and built a large-scale dataset generation program to render the training and test data. The program replays head motions that were captured from user studies in a VR headset, and renders color, depth and motion vectors of every frame.
Specifically, the computing system may render 100 videos from each scene, and each video contains 60 frames. Each video's camera starts from a random position in the scene and moves as defined in a pre-captured head motion path that is randomly selected for each video from a large candidate pool. For reference images, the computing system may first render the images at 4800×2700 with 8×MSAA and then downscale the images to 1600×900 with 3×3 box filters to further reduce aliasing. For low-resolution input images, the computing system may turn off MSAA and adjust mip level bias for texture sampling to match the selected mip level with the full resolution images. The mip level bias approach may be applied to reduce prefiltering in the rendered low-resolution images and may be similarly done in existing super-sampling algorithms such as TAAU.
During training, 80 videos are used to generate training batches, 10 for validation batches, and the remaining 10 are for testing. For training and validation, the computing system may divide the images into overlapped patches with resolution 256×256 pixels, while for testing the computing system may run the network on the full frames with 1600×900 pixels. Our network may be fully convolutional, so it may be able to take any resolution as input.
The computing system may train our networks with TensorFlow. The network weights may be initialized following a conventional work. The ADAM method with default hyperparameters may be used for training optimization, with learning rate 1e−4, batch size 8, and 100 epochs of the data. Each network may take around 1.5 days to train on a Titan V GPU.
After training, the network models may be optimized with Nvidia TensorRT at 16-bit precision and tested on a Titan V GPU. In Table 1, the embodiments disclosed herein report the total runtime of our method for 4×4 super-sampling at varying target resolutions, including 720p (1280×720), Oculus Rift (1080×1200) and 1080p (1920×1080). In Table 2, the embodiments disclosed herein report the runtime breakdown of our method with 4×4 super-sampling at 1080p. The runtime is reported in unit of milliseconds (ms).
To study the trade-off between network complexity and reconstruction quality, in Tables 1, 2 and 3, the embodiments disclosed herein report two flavors of our method, i.e., the primary network, namely “Ours”, and a lighter version, namely “Ours-Fast”. The hyperparameters of the primary network are given in
The embodiments disclosed herein compare our method to several state-of-the-art super-resolution work, including single image super-resolution methods ESPCN, EDSR and RCAN, and video super-resolution methods VESPCN and DUF. The embodiments disclosed herein re-implemented and trained all the methods on the same datasets as in our method with the same training procedure. For the video super-resolution methods, the embodiments disclosed herein adjusted their networks to take only current and previous frames as input, avoiding any future frames. The number of input previous frames used in video super-resolution methods is also increased to 4 to match our method.
The embodiments disclosed herein evaluate the results with three quality metrics: peak signal to-noise ratio (PSNR), structural similarity index (SSIM), and spatio-temporal entropic difference (STRRED). PSNR and SSIM are well-known for single image assessment, the higher the better. STRRED is widely used for video quality assessment that includes temporal stability, the lower the better. The embodiments disclosed herein evaluate the results on four representative scenes, namely Robots, Village, DanceStudio and Spaceship. In Table 4, the embodiments disclosed herein compare the above quality metrics, averaged over 10 test videos from our dataset.
In addition, the embodiments disclosed herein compare to the temporal antialiasing upscaling (TAAU) method from Unreal Engine (i.e., a real-time 3D creation platform). The computing system took the Robots scene as an example, and converted it to Unreal to collect the TAAU results.
Rendering Efficiency. The embodiments disclosed herein take the Spaceship scene as a representative scenario to demonstrate how the end-to-end rendering efficiency may be improved by applying our method. The computing system renders on a Nvidia Titan RTX GPU using the expensive and high-quality ray-traced global illumination effect available in Unity. The render pass for a full resolution image may take 140.6 ms at 1600×900. On the other hand, rendering the image at 400×225 takes 26.40 ms, followed by our method, which may take 17.68 ms (the primary network) to up-sample the image to the target 1600×900 resolution, totaling to 44.08 ms. This leads to an over 3× rendering performance improvement, while providing high-fidelity results.
Generalization. While the embodiments disclosed herein choose to train a network for each scene to maximize its quality, an open question may be how it generalizes across scenes. In Table 3, the embodiments disclosed herein report the quality of our primary network trained jointly on all four scenes (“Ours-AllScenes”) and trained on all scenes but the one tested (“Ours-AllButOne”), respectively, and compare them to the primary network trained on each scene separately (“Ours”). The test quality reduces slightly with Ours-AllScenes (0.05-0.4 dB in PSNR) and more with Ours-AllButOne (0.5-1 dB in PSNR). However, both networks still noticeably outperform all comparison methods that are trained on each scene separately. This indicates that the network may generalize across scenes with different appearance although including the test scenes into training datasets seems to always improve the quality. However, a full evaluation of network generalization may require collecting more scenes.
Previous Frames. In Table 5, the embodiments disclosed herein report the reconstruction quality by using a varying number of previous frames. The quality increases as more previous frames are used. However, the network runtime likewise increases. Of note is that runtime may be dominated by the reconstruction sub-network (Table 2). Only the first layer of this part may be affected by the number of frames, so adding more previous frames only slightly may increase runtime. Thus, applications may vary this parameter to get to a sweet spot in quality/runtime trade-off.
Super-sampling Ratios. In Table 6, the embodiments disclosed herein report the reconstruction quality of our method with varying super-sampling ratios from 2×2 to 6×6. In this experiment, the embodiments disclosed herein keep the target resolution the same and vary the input image resolution according to the super-sampling ratio. As expected, the reconstruction quality gracefully improves as the super-sampling ratio reduces. Additionally, to verify the performance advantage of our method at varying super-sampling ratios, the embodiments disclosed herein train all existing methods with 2×2 up-sampling and report the results in Table 7. Our method significantly outperforms the existing work.
Quality Gain from Additional Inputs. While our method outperforms all compared methods by a large margin, we would like to understand the quality gain from its additional depth and motion vector inputs. The embodiments disclosed herein revise the VESPCN method to take the same depth and motion vector input as ours, namely “VESPCN+”, where the motion vectors replace the optical flow estimation module in the original VESPCN and the depth is fed as an additional channel together with the RGB color input. As reported in Table 8, with the additional inputs, VESPCN+ improves moderately (1.1-1.3 dB in PSNR) upon VESPCN, however it is still noticeably worse (2.2-3.1 dB in PSNR) than our method. This indicates that both the additional inputs and the specifically tailored network design of our method may play important roles in our performance achievement.
Zero-Upsampling and Warping. Our method may project input pixels to the target (high) resolution space by zero-upsampling, and then warp the up-sampled previous frames to the current frame for post-processing. To understand its impact on the reconstruction quality, the embodiments disclosed herein experiment with alternative ways for temporal reprojection, i.e., replacing zero-upsampling with bilinear up-sampling and/or warping at the input (low) resolution space instead, and the results are reported in Table 9. We observe about 1 dB improvement in PSNR by warping at the target resolution compared to at the input resolution, and about 0.3 dB additional improvement by using zero-upsampling compared to bilinear up-sampling. This may indicate the benefit of our approach tailored for effectively leveraging the rendering-specific inputs, i.e., point-sampled color and subpixel-precise motion vectors.
Network Modules. In Table 10, the embodiments disclosed herein report the ablation experiments for analyzing the quality improvements from the feature extraction and feature reweighting modules. Average results are reported on the 10 test videos of the Robots scene. While the numeric results show only minor improvements from the reweighting module, the results are averaged over large amounts of data, and the regions affected by disocclusion and mismatched pixels (the parts of images most impacted by this module) only make up a relatively small part of the images.
Discussion with DLSS. While DLSS (i.e., a conventional work) also aims for learned super-sampling of rendered content, no public information is available on the details of its algorithm, performance or training datasets, which may make direct comparisons impossible. Instead, the embodiments disclosed herein provide a preliminary ballpark analysis of its quality performance with respect to our method, however, on different types of scenes. Specifically, the embodiments disclosed herein took the game “Islands of Nyne” supporting DLSS as an example, and captured two pairs of representative screenshots, where each pair of screenshots include the DLSS-upsampled image and the full-resolution image with no up-sampling, both at 4K resolution. The content is chosen to be similar to our Spaceship and Robots scene in terms of geometric and materials complexity, with metallic (glossy) boxes and walls and some thin structures (railings, geometric floor tiles). The computing system computed the PSNR and SSIM of the up-sampled images after masking out mismatched pixels due to dynamic objects, plot the numerical quality as a distribution, and add our result's quality to the same chart.
Embodiments of the invention may include or be implemented in conjunction with an artificial reality system. Artificial reality is a form of reality that has been adjusted in some manner before presentation to a user, which may include, e.g., a virtual reality (VR), an augmented-reality (AR), a mixed reality (MR), a hybrid reality, or some combination and/or derivatives thereof. Artificial reality content may include completely generated content or generated content combined with captured content (e.g., real-world photographs). The artificial reality content may include video, audio, haptic feedback, or some combination thereof, and any of which may be presented in a single channel or in multiple channels (such as stereo video that produces a three-dimensional effect to the viewer). Additionally, in some embodiments, artificial reality may be associated with applications, products, accessories, services, or some combination thereof, that are, e.g., used to create content in an artificial reality and/or used in (e.g., perform activities in) an artificial reality. The artificial reality system that provides the artificial reality content may be implemented on various platforms, including a head-mounted display (HMD) connected to a host computer system, a standalone HMD, a mobile device or computing system, or any other hardware platform capable of providing artificial reality content to one or more viewers.
This disclosure contemplates any suitable number of computer systems 1300. This disclosure contemplates computer system 1300 taking any suitable physical form. As example and not by way of limitation, computer system 1300 may be an embedded computer system, a system-on-chip (SOC), a single-board computer system (SBC) (such as, for example, a computer-on-module (COM) or system-on-module (SOM)), a desktop computer system, a laptop or notebook computer system, an interactive kiosk, a mainframe, a mesh of computer systems, a mobile telephone, a personal digital assistant (PDA), a server, a tablet computer system, or a combination of two or more of these. Where appropriate, computer system 1300 may include one or more computer systems 1300; be unitary or distributed; span multiple locations; span multiple machines; span multiple data centers; or reside in a cloud, which may include one or more cloud components in one or more networks. Where appropriate, one or more computer systems 1300 may perform without substantial spatial or temporal limitation one or more steps of one or more methods described or illustrated herein. As an example and not by way of limitation, one or more computer systems 1300 may perform in real time or in batch mode one or more steps of one or more methods described or illustrated herein. One or more computer systems 1300 may perform at different times or at different locations one or more steps of one or more methods described or illustrated herein, where appropriate.
In particular embodiments, computer system 1300 includes a processor 1302, memory 1304, storage 1306, an input/output (I/O) interface 1308, a communication interface 1310, and a bus 1312. Although this disclosure describes and illustrates a particular computer system having a particular number of particular components in a particular arrangement, this disclosure contemplates any suitable computer system having any suitable number of any suitable components in any suitable arrangement.
In particular embodiments, processor 1302 includes hardware for executing instructions, such as those making up a computer program. As an example and not by way of limitation, to execute instructions, processor 1302 may retrieve (or fetch) the instructions from an internal register, an internal cache, memory 1304, or storage 1306; decode and execute them; and then write one or more results to an internal register, an internal cache, memory 1304, or storage 1306. In particular embodiments, processor 1302 may include one or more internal caches for data, instructions, or addresses. This disclosure contemplates processor 1302 including any suitable number of any suitable internal caches, where appropriate. As an example and not by way of limitation, processor 1302 may include one or more instruction caches, one or more data caches, and one or more translation lookaside buffers (TLBs). Instructions in the instruction caches may be copies of instructions in memory 1304 or storage 1306, and the instruction caches may speed up retrieval of those instructions by processor 1302. Data in the data caches may be copies of data in memory 1304 or storage 1306 for instructions executing at processor 1302 to operate on; the results of previous instructions executed at processor 1302 for access by subsequent instructions executing at processor 1302 or for writing to memory 1304 or storage 1306; or other suitable data. The data caches may speed up read or write operations by processor 1302. The TLBs may speed up virtual-address translation for processor 1302. In particular embodiments, processor 1302 may include one or more internal registers for data, instructions, or addresses. This disclosure contemplates processor 1302 including any suitable number of any suitable internal registers, where appropriate. Where appropriate, processor 1302 may include one or more arithmetic logic units (ALUs); be a multi-core processor; or include one or more processors 1302. Although this disclosure describes and illustrates a particular processor, this disclosure contemplates any suitable processor.
In particular embodiments, memory 1304 includes main memory for storing instructions for processor 1302 to execute or data for processor 1302 to operate on. As an example and not by way of limitation, computer system 1300 may load instructions from storage 1306 or another source (such as, for example, another computer system 1300) to memory 1304. Processor 1302 may then load the instructions from memory 1304 to an internal register or internal cache. To execute the instructions, processor 1302 may retrieve the instructions from the internal register or internal cache and decode them. During or after execution of the instructions, processor 1302 may write one or more results (which may be intermediate or final results) to the internal register or internal cache. Processor 1302 may then write one or more of those results to memory 1304. In particular embodiments, processor 1302 executes only instructions in one or more internal registers or internal caches or in memory 1304 (as opposed to storage 1306 or elsewhere) and operates only on data in one or more internal registers or internal caches or in memory 1304 (as opposed to storage 1306 or elsewhere). One or more memory buses (which may each include an address bus and a data bus) may couple processor 1302 to memory 1304. Bus 1312 may include one or more memory buses, as described below. In particular embodiments, one or more memory management units (MMUs) reside between processor 1302 and memory 1304 and facilitate accesses to memory 1304 requested by processor 1302. In particular embodiments, memory 1304 includes random access memory (RAM). This RAM may be volatile memory, where appropriate. Where appropriate, this RAM may be dynamic RAM (DRAM) or static RAM (SRAM). Moreover, where appropriate, this RAM may be single-ported or multi-ported RAM. This disclosure contemplates any suitable RAM. Memory 1304 may include one or more memories 1304, where appropriate. Although this disclosure describes and illustrates particular memory, this disclosure contemplates any suitable memory.
In particular embodiments, storage 1306 includes mass storage for data or instructions. As an example and not by way of limitation, storage 1306 may include a hard disk drive (HDD), a floppy disk drive, flash memory, an optical disc, a magneto-optical disc, magnetic tape, or a Universal Serial Bus (USB) drive or a combination of two or more of these. Storage 1306 may include removable or non-removable (or fixed) media, where appropriate. Storage 1306 may be internal or external to computer system 1300, where appropriate. In particular embodiments, storage 1306 is non-volatile, solid-state memory. In particular embodiments, storage 1306 includes read-only memory (ROM). Where appropriate, this ROM may be mask-programmed ROM, programmable ROM (PROM), erasable PROM (EPROM), electrically erasable PROM (EEPROM), electrically alterable ROM (EAROM), or flash memory or a combination of two or more of these. This disclosure contemplates mass storage 1306 taking any suitable physical form. Storage 1306 may include one or more storage control units facilitating communication between processor 1302 and storage 1306, where appropriate. Where appropriate, storage 1306 may include one or more storages 1306. Although this disclosure describes and illustrates particular storage, this disclosure contemplates any suitable storage.
In particular embodiments, I/O interface 1308 includes hardware, software, or both, providing one or more interfaces for communication between computer system 1300 and one or more I/O devices. Computer system 1300 may include one or more of these I/O devices, where appropriate. One or more of these I/O devices may enable communication between a person and computer system 1300. As an example and not by way of limitation, an I/O device may include a keyboard, keypad, microphone, monitor, mouse, printer, scanner, speaker, still camera, stylus, tablet, touch screen, trackball, video camera, another suitable I/O device or a combination of two or more of these. An I/O device may include one or more sensors. This disclosure contemplates any suitable I/O devices and any suitable I/O interfaces 1308 for them. Where appropriate, I/O interface 1308 may include one or more device or software drivers enabling processor 1302 to drive one or more of these I/O devices. I/O interface 1308 may include one or more I/O interfaces 1308, where appropriate. Although this disclosure describes and illustrates a particular I/O interface, this disclosure contemplates any suitable I/O interface.
In particular embodiments, communication interface 1310 includes hardware, software, or both providing one or more interfaces for communication (such as, for example, packet-based communication) between computer system 1300 and one or more other computer systems 1300 or one or more networks. As an example and not by way of limitation, communication interface 1310 may include a network interface controller (NIC) or network adapter for communicating with an Ethernet or other wire-based network or a wireless NIC (WNIC) or wireless adapter for communicating with a wireless network, such as a WI-FI network. This disclosure contemplates any suitable network and any suitable communication interface 1310 for it. As an example and not by way of limitation, computer system 1300 may communicate with an ad hoc network, a personal area network (PAN), a local area network (LAN), a wide area network (WAN), a metropolitan area network (MAN), or one or more portions of the Internet or a combination of two or more of these. One or more portions of one or more of these networks may be wired or wireless. As an example, computer system 1300 may communicate with a wireless PAN (WPAN) (such as, for example, a BLUETOOTH WPAN), a WI-FI network, a WI-MAX network, a cellular telephone network (such as, for example, a Global System for Mobile Communications (GSM) network), or other suitable wireless network or a combination of two or more of these. Computer system 1300 may include any suitable communication interface 1310 for any of these networks, where appropriate. Communication interface 1310 may include one or more communication interfaces 1310, where appropriate. Although this disclosure describes and illustrates a particular communication interface, this disclosure contemplates any suitable communication interface.
In particular embodiments, bus 1312 includes hardware, software, or both coupling components of computer system 1300 to each other. As an example and not by way of limitation, bus 1312 may include an Accelerated Graphics Port (AGP) or other graphics bus, an Enhanced Industry Standard Architecture (EISA) bus, a front-side bus (FSB), a HYPERTRANSPORT (HT) interconnect, an Industry Standard Architecture (ISA) bus, an INFINIBAND interconnect, a low-pin-count (LPC) bus, a memory bus, a Micro Channel Architecture (MCA) bus, a Peripheral Component Interconnect (PCI) bus, a PCI-Express (PCIe) bus, a serial advanced technology attachment (SATA) bus, a Video Electronics Standards Association local (VLB) bus, or another suitable bus or a combination of two or more of these. Bus 1312 may include one or more buses 1312, where appropriate. Although this disclosure describes and illustrates a particular bus, this disclosure contemplates any suitable bus or interconnect.
Herein, a computer-readable non-transitory storage medium or media may include one or more semiconductor-based or other integrated circuits (ICs) (such, as for example, field-programmable gate arrays (FPGAs) or application-specific ICs (ASICs)), hard disk drives (HDDs), hybrid hard drives (HHDs), optical discs, optical disc drives (ODDs), magneto-optical discs, magneto-optical drives, floppy diskettes, floppy disk drives (FDDs), magnetic tapes, solid-state drives (SSDs), RAM-drives, SECURE DIGITAL cards or drives, any other suitable computer-readable non-transitory storage media, or any suitable combination of two or more of these, where appropriate. A computer-readable non-transitory storage medium may be volatile, non-volatile, or a combination of volatile and non-volatile, where appropriate.
Herein, “or” is inclusive and not exclusive, unless expressly indicated otherwise or indicated otherwise by context. Therefore, herein, “A or B” means “A, B, or both,” unless expressly indicated otherwise or indicated otherwise by context. Moreover, “and” is both joint and several, unless expressly indicated otherwise or indicated otherwise by context. Therefore, herein, “A and B” means “A and B, jointly or severally,” unless expressly indicated otherwise or indicated otherwise by context.
The scope of this disclosure encompasses all changes, substitutions, variations, alterations, and modifications to the example embodiments described or illustrated herein that a person having ordinary skill in the art would comprehend. The scope of this disclosure is not limited to the example embodiments described or illustrated herein. Moreover, although this disclosure describes and illustrates respective embodiments herein as including particular components, elements, feature, functions, operations, or steps, any of these embodiments may include any combination or permutation of any of the components, elements, features, functions, operations, or steps described or illustrated anywhere herein that a person having ordinary skill in the art would comprehend. Furthermore, reference in the appended claims to an apparatus or system or a component of an apparatus or system being adapted to, arranged to, capable of, configured to, enabled to, operable to, or operative to perform a particular function encompasses that apparatus, system, component, whether or not it or that particular function is activated, turned on, or unlocked, as long as that apparatus, system, or component is so adapted, arranged, capable, configured, enabled, operable, or operative. Additionally, although this disclosure describes or illustrates particular embodiments as providing particular advantages, particular embodiments may provide none, some, or all of these advantages.
Claims
1. A method comprising, by one or more computing systems:
- receiving a pair of stereo images having a resolution lower than a target resolution;
- generating (a) an initial first feature map for a first image of the pair of stereo images based on one or more first channels associated with the first image and (b) an initial second feature map for a second image of the pair of stereo images based on one or more second channels associated with the second image;
- generating a first feature map based on combining the one or more first channels with the initial first feature map;
- generating a second feature map based on combining the one or more second channels with the initial second feature map;
- up-sampling the first feature map and the second feature map to the target resolution;
- warping the up-sampled second feature map; and
- generating a reconstructed image corresponding to the first image having the target resolution based on the up-sampled first feature map and the up-sampled and warped second feature map.
2. The method of claim 1, wherein the first or second image comprises an RGB image with depth information.
3. The method of claim 2, further comprising:
- converting the RGB image with depth information to a YCbCr image.
4. The method of claim 1, wherein generating the first or second feature map is based on one or more convolutional neural networks.
5. The method of claim 1, wherein each of the initial first feature map and the initial second feature map is based on a first number of channels, and wherein each of the first feature map and the second feature map is based on a second number of channels.
6. The method of claim 1, wherein up-sampling the first feature map and the second feature map to the target resolution is based on zero up-sampling, wherein the zero up-sampling comprises:
- assigning each input pixel of each of the first feature map and the second feature map to its corresponding pixel at the target resolution; and
- setting all missing pixels around the input pixel as zeros.
7. The method of claim 1, wherein warping the up-sampled second feature map is based on a motion estimation associated with the pair of stereo images.
8. The method of claim 7, wherein the pair of stereo images are received from a client device, wherein the method further comprises determining the motion estimation based on a head motion detected by the client device, comprising:
- identifying a motion vector based on the head motion; and
- resizing the motion vector to the target resolution based on bilinear up-sampling.
9. The method of claim 7, wherein warping the up-sampled second feature map comprises using the motion estimation with bilinear interpolation during warping.
10. The method of claim 1, wherein up-sampling the first feature map to the target resolution is based on information associated with the second image.
11. The method of claim 1, further comprising:
- inputting the up-sampled first feature map and the up-sampled and warped second feature map to a feature reweighting module, wherein the feature reweighting module is based on one or more convolutional neural networks.
12. The method of claim 11, further comprising:
- generating, by the feature weighting module, a pixel-wise weighting map for the up-sampled and warped second feature map; and
- multiplying the pixel-wise weighting map with the up-sampled and warped second feature map to generate a reweighted feature map for the second image.
13. The method of claim 12, wherein generating the reconstructed image corresponding to the first image comprises:
- combining the up-sampled first feature map and the reweighted feature map for the second frame.
14. The method of claim 1, wherein generating the reconstructed image corresponding to the first image is based on a machine-learning model, wherein the machine-learning model is based on a convolutional neural network with one or more skip connections.
15. The method of claim 1, wherein the first image is captured by a first camera, wherein the second image is captured by a second camera.
16. The method of claim 15, wherein warping the up-sampled second feature map comprises:
- warping the up-sampled second feature map of the second image to a viewpoint of the first camera.
17. One or more computer-readable non-transitory storage media embodying software that is operable when executed to:
- receive a pair of stereo images having a resolution lower than a target resolution;
- generate (a) an initial first feature map for a first image of the pair of stereo images based on one or more first channels associated with the first image and (b) an initial second feature map for a second image of the pair of stereo images based on one or more second channels associated with the second image;
- generate a first feature map based on combining the one or more first channels with the initial first feature map;
- generate a second feature map based on combining the one or more second channels with the initial second feature map;
- up-sample the first feature map and the second feature map to the target resolution;
- warp the up-sampled second feature map; and
- generate a reconstructed image corresponding to the first image having the target resolution based on the up-sampled first feature map and the up-sampled and warped second feature map.
18. A system comprising: one or more processors; and a non-transitory memory coupled to the processors comprising instructions executable by the processors, the processors operable when executing the instructions to:
- receive a pair of stereo images having a resolution lower than a target resolution;
- generate (a) an initial first feature map for a first image of the pair of stereo images based on one or more first channels associated with the first image and (b) an initial second feature map for a second image of the pair of stereo images based on one or more second channels associated with the second image;
- generate a first feature map based on combining the one or more first channels with the initial first feature map;
- generate a second feature map based on combining the one or more second channels with the initial second feature map;
- up-sample the first feature map and the second feature map to the target resolution;
- warp the up-sampled second feature map; and
- generate a reconstructed image corresponding to the first image having the target resolution based on the up-sampled first feature map and the up-sampled and warped second feature map.
Type: Application
Filed: May 16, 2022
Publication Date: Sep 1, 2022
Inventors: Lei Xiao (Redmond, WA), Salah Eddine Nouri (Bellevue, WA), Douglas Robert Lanman (Bellevue, WA), Anton S Kaplanyan (Redmond, WA), Alexander Jobe Fix (Seattle, WA), Matthew Steven Chapman (Redmond, WA)
Application Number: 17/745,641