Patents by Inventor Willi Menapace
Willi Menapace has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).
-
Publication number: 20260148438Abstract: Hierarchical patch-wise diffusion models (HPDMs) use a diffusion paradigm that learns a hierarchical distribution of patches instead of whole videos for efficient patch-wise training of diffusion models. To enforce consistency between the patches, deep context fusion may be used to propagate the context information from low-scale to high-scale patches in a hierarchical manner. To accelerate patch-wise training and inference, adaptive computation also may be used to allocate more computational resources and network capacity towards coarse image details and to cheapen synthesis of high-frequency texture details. All the processing stages are jointly trained to provide spatially aligned global context to the higher levels of the cascade. As a result, the model does not operate on the full-resolution inputs, which allows the model to be trained on high-resolution video datasets in an end-to-end fashion.Type: ApplicationFiled: December 11, 2025Publication date: May 28, 2026Inventors: Willi Menapace, Aliaksandr Siarohin, Ivan Skorokhodov, Sergey Tulyakov
-
Publication number: 20260112124Abstract: Examples relate to systems and methods for generating digital effects experiences. The system performs operations including accessing a set of instructions that defines a digital effects experience. The system processes the set of instructions by a generative machine learning model to generate one or more digital effects comprising the digital effects experience. The system continuously processes one or more inputs, received by a user device, while the one or more digital effects comprising the digital effects experience are presented on the user device, along with the set of instructions in real time by the generative machine learning model to update presentation of the one or more digital effects comprising the digital effects experience.Type: ApplicationFiled: October 18, 2024Publication date: April 23, 2026Inventors: Willi Menapace, Robert Cornelius Murphy, Aliasksandr Siarohin, Aleksei Stoliar, Sergey Tulyakov
-
Publication number: 20260089303Abstract: A method for generating photorealistic 4D scenes from text inputs is disclosed. The method utilizes a text-to-video diffusion model to generate a reference video and a freeze-time video. A canonical 3D representation is reconstructed using deformable 3D Gaussian Splats (D-3DGS) based on the freeze-time video. Temporal deformations are learned to capture dynamic interactions in the reference video. The method employs a novel Score Distillation Sampling strategy combining multi-view and temporal aspects to enhance consistency and robustness. The resulting 4D scenes feature multiple objects interacting with detailed background environments, viewable from different angles and times. The method enables flexible camera control and integration with augmented and virtual reality applications. Some examples include features such as image-to-4D generation.Type: ApplicationFiled: September 23, 2024Publication date: March 26, 2026Inventors: Hsin-Ying Lee, Willi Menapace, Aliaksandr Siarohin, Sergey Tulyakov, Chaoyang Wang, Heng Yu, Peiye Zhuang
-
Publication number: 20260073578Abstract: The present disclosure addresses technological challenges arising in the field of artificial intelligence (AI) with respect to inefficient use of computing resources and runtime delay. In particular, the present disclosure provides for development of a machine learning model that generates an image sample for a video in a single forward pass. The development of this machine learning model uses an adversarial training approach involving training two machine learning models, a generator model and a discriminator model. With the generator model trained in this way, the generator model can be used to generate image samples for a video in a single forward pass.Type: ApplicationFiled: September 10, 2024Publication date: March 12, 2026Inventors: Junli Cao, Anil Kag, Yanyu Li, Willi Menapace, Jian Ren, Aliaksandr Siarohin, Ivan Skorokhodov, Sergey Tulyakov, Yushu Wu, Zhixing Zhang
-
Publication number: 20260051121Abstract: An environment synthesis framework generates virtual environments from a synthesized two-dimensional (2D) satellite map of a geographic area, a three-dimensional (3D) voxel environment, and a voxel-based neural rendering framework. In an example implementation, the synthesized 2D satellite map is generated by a map synthesis generative adversarial network (GAN) which is trained using sample city datasets. The multi-stage framework lifts the 2D map into a set of 3D octrees, generates an octree-based 3D voxel environment, and then converts it into a texturized 3D virtual environment using a neural rendering GAN and a set of pseudo ground truth images. The resulting 3D virtual environment is texturized, lifelike, editable, traversable in virtual reality (VR) and augmented reality (AR) experiences, and very large in scale.Type: ApplicationFiled: October 23, 2025Publication date: February 19, 2026Inventors: Menglei Chai, Hsin-Ying Lee, Chieh Lin, Willi Menapace, Aliaksandr Siarohin, Sergey Tulyakov
-
Patent number: 12524925Abstract: Hierarchical patch-wise diffusion models (HPDMs) use a diffusion paradigm that learns a hierarchical distribution of patches instead of whole videos for efficient patch-wise training of diffusion models. To enforce consistency between the patches, deep context fusion may be used to propagate the context information from low-scale to high-scale patches in a hierarchical manner. To accelerate patch-wise training and inference, adaptive computation also may be used to allocate more computational resources and network capacity towards coarse image details and to cheapen synthesis of high-frequency texture details. All the processing stages are jointly trained to provide spatially aligned global context to the higher levels of the cascade. As a result, the model does not operate on the full-resolution inputs, which allows the model to be trained on high-resolution video datasets in an end-to-end fashion.Type: GrantFiled: February 9, 2024Date of Patent: January 13, 2026Assignee: Snap Inc.Inventors: Willi Menapace, Aliaksandr Siarohin, Ivan Skorokhodov, Sergey Tulyakov
-
Publication number: 20250356569Abstract: Unsupervised volumetric 3D animation (UVA) of non-rigid deformable objects without annotations learns the 3D structure and dynamics of objects solely from single-view red/green/blue (RGB) videos and decomposes the single-view RGB videos into semantically meaningful parts that can be tracked and animated. Using a 3D autodecoder framework, paired with a keypoint estimator via a differentiable perspective-n-point (PnP) algorithm, the UVA model learns the underlying object 3D geometry and parts decomposition in an entirely unsupervised manner from still or video images. This allows the UVA model to perform 3D segmentation, 3D keypoint estimation, novel view synthesis, and animation. The UVA model can obtain animatable 3D objects from a single or a few images. The UVA method also features a space in which all objects are represented in their canonical, animation-ready form. Applications include the creation of lenses from images or videos for social media applications.Type: ApplicationFiled: July 29, 2025Publication date: November 20, 2025Inventors: Menglei Chai, Hsin-Ying Lee, Willi Menapace, Kyle Olszewski, Jian Ren, Aliaksandr Siarohin, Ivan Skorokhodov, Sergey Tulyakov
-
Publication number: 20250345710Abstract: A framework trains game-engine-like neural models from annotated videos to generate a Learnable Game Engine (LGE) that maintains states of the scene, objects and agents in it, and enables rendering the environment from a controllable viewpoint. The LGE models the logic of the game and the rules of physics, making it possible for the user to play the game by specifying both high- and low-level action sequences. The LGE also unlocks a director's mode where the game is played by plotting behind the scenes, specifying high-level actions and goals for the agents using text-based instructions. To implement the director's mode, a trained diffusion-based animation model navigates the scene using high-level constraints, to enable play against an adversary, and to devise the strategy to win a point. To render the resulting state of the environment and its agents, a compositional neural radiance field (NeRF) representation is used in a synthesis model.Type: ApplicationFiled: July 16, 2025Publication date: November 13, 2025Inventors: Willi Menapace, Aliaksandr Siarohin, Sergey Tulyakov
-
Patent number: 12469217Abstract: An environment synthesis framework generates virtual environments from a synthesized two-dimensional (2D) satellite map of a geographic area, a three-dimensional (3D) voxel environment, and a voxel-based neural rendering framework. In an example implementation, the synthesized 2D satellite map is generated by a map synthesis generative adversarial network (GAN) which is trained using sample city datasets. The multi-stage framework lifts the 2D map into a set of 3D octrees, generates an octree-based 3D voxel environment, and then converts it into a texturized 3D virtual environment using a neural rendering GAN and a set of pseudo ground truth images. The resulting 3D virtual environment is texturized, lifelike, editable, traversable in virtual reality (VR) and augmented reality (AR) experiences, and very large in scale.Type: GrantFiled: December 29, 2022Date of Patent: November 11, 2025Assignee: Snap Inc.Inventors: Menglei Chai, Hsin-Ying Lee, Chieh Lin, Willi Menapace, Aliaksandr Siarohin, Sergey Tulyakov
-
Patent number: 12400388Abstract: Unsupervised volumetric 3D animation (UVA) of non-rigid deformable objects without annotations learns the 3D structure and dynamics of objects solely from single-view red/green/blue (RGB) videos and decomposes the single-view RGB videos into semantically meaningful parts that can be tracked and animated. Using a 3D autodecoder framework, paired with a keypoint estimator via a differentiable perspective-n-point (PnP) algorithm, the UVA model learns the underlying object 3D geometry and parts decomposition in an entirely unsupervised manner from still or video images. This allows the UVA model to perform 3D segmentation, 3D keypoint estimation, novel view synthesis, and animation. The UVA model can obtain animatable 3D objects from a single or a few images. The UVA method also features a space in which all objects are represented in their canonical, animation-ready form. Applications include the creation of lenses from images or videos for social media applications.Type: GrantFiled: December 28, 2022Date of Patent: August 26, 2025Assignee: Snap Inc.Inventors: Menglei Chai, Hsin-Ying Lee, Willi Menapace, Kyle Olszewski, Jian Ren, Aliaksandr Siarohin, Ivan Skorokhodov, Sergey Tulyakov
-
Publication number: 20250265448Abstract: An asymmetrically distributed convolution-attention neural network (AsCAN) includes a simple hybrid architecture in which the number of convolutional and transformer blocks is asymmetrically distributed in different processing stages. AsCAN adopts more convolutional blocks in the early processing stages, where the feature maps have relatively large spatial sizes, and more transformer blocks at the later processing stages. Transformer layers are incorporated in the early processing stages as well, except that fewer transformer blocks are used compared to convolutions in the early part. This trend is reversed at the lower resolution in the later processing stages. This uneven distribution of the convolutional and transformer blocks yields better throughput due to improved accelerator utilization at various batch sizes during the inference stage.Type: ApplicationFiled: February 21, 2024Publication date: August 21, 2025Inventors: Junli Cao, Anil Kag, Willi Menapace, Jian Ren, Aliaksandr Siarohin, Sergey Tulyakov
-
Patent number: 12390738Abstract: A framework trains game-engine-like neural models from annotated videos to generate a Learnable Game Engine (LGE) that maintains states of the scene, objects and agents in it, and enables rendering the environment from a controllable viewpoint. The LGE models the logic of the game and the rules of physics, making it possible for the user to play the game by specifying both high- and low-level action sequences. The LGE also unlocks a director's mode where the game is played by plotting behind the scenes, specifying high-level actions and goals for the agents using text-based instructions. To implement the director's mode, a trained diffusion-based animation model navigates the scene using high-level constraints, to enable play against an adversary, and to devise the strategy to win a point. To render the resulting state of the environment and its agents, a compositional neural radiance field (NeRF) representation is used in a synthesis model.Type: GrantFiled: March 14, 2023Date of Patent: August 19, 2025Assignee: Snap Inc.Inventors: Willi Menapace, Aliaksandr Siarohin, Sergey Tulyakov
-
Publication number: 20250259339Abstract: Hierarchical patch-wise diffusion models (HPDMs) use a diffusion paradigm that learns a hierarchical distribution of patches instead of whole videos for efficient patch-wise training of diffusion models. To enforce consistency between the patches, deep context fusion may be used to propagate the context information from low-scale to high-scale patches in a hierarchical manner. To accelerate patch-wise training and inference, adaptive computation also may be used to allocate more computational resources and network capacity towards coarse image details and to cheapen synthesis of high-frequency texture details. All the processing stages are jointly trained to provide spatially aligned global context to the higher levels of the cascade. As a result, the model does not operate on the full-resolution inputs, which allows the model to be trained on high-resolution video datasets in an end-to-end fashion.Type: ApplicationFiled: February 9, 2024Publication date: August 14, 2025Inventors: Willi Menapace, Aliaksandr Siarohin, Ivan Skorokhodov, Sergey Tulyakov
-
Publication number: 20250245897Abstract: A text-to-video framework including a far-reaching interleaved transformer (FIT) block configured to learn a compressed representation of video input using a set of learnable latent tokens. The FIT block includes a diffusion framework and joint spatiotemporal modeling. The FIT block performs patchification of the video input to produce a sequence of patch tokens that are divided into groups. The FIT block instantiates the set of latent tokens and applies a sequence of computational blocks, and projects the patch tokens to generate video frames.Type: ApplicationFiled: January 31, 2024Publication date: July 31, 2025Inventors: Tsai-Shien Chen, Yuwei Fang, Anil Kag, Willi Menapace, Jian Ren, Aliaksandr Siarohin, Ivan Skorokhodov, Sergey Tulyakov
-
Publication number: 20240307783Abstract: A framework trains game-engine-like neural models from annotated videos to generate a Learnable Game Engine (LGE) that maintains states of the scene, objects and agents in it, and enables rendering the environment from a controllable viewpoint. The LGE models the logic of the game and the rules of physics, making it possible for the user to play the game by specifying both high- and low-level action sequences. The LGE also unlocks a director's mode where the game is played by plotting behind the scenes, specifying high-level actions and goals for the agents using text-based instructions. To implement the director's mode, a trained diffusion-based animation model navigates the scene using high-level constraints, to enable play against an adversary, and to devise the strategy to win a point. To render the resulting state of the environment and its agents, a compositional neural radiance field (NeRF) representation is used in a synthesis model.Type: ApplicationFiled: March 14, 2023Publication date: September 19, 2024Inventors: Willi Menapace, Aliaksandr Siarohin, Sergey Tulyakov
-
Publication number: 20240221309Abstract: An environment synthesis framework generates virtual environments from a synthesized two-dimensional (2D) satellite map of a geographic area, a three-dimensional (3D) voxel environment, and a voxel-based neural rendering framework. In an example implementation, the synthesized 2D satellite map is generated by a map synthesis generative adversarial network (GAN) which is trained using sample city datasets. The multi-stage framework lifts the 2D map into a set of 3D octrees, generates an octree-based 3D voxel environment, and then converts it into a texturized 3D virtual environment using a neural rendering GAN and a set of pseudo ground truth images. The resulting 3D virtual environment is texturized, lifelike, editable, traversable in virtual reality (VR) and augmented reality (AR) experiences, and very large in scale.Type: ApplicationFiled: December 29, 2022Publication date: July 4, 2024Inventors: Menglei Chai, Hsin-Ying Lee, Chieh Lin, Willi Menapace, Aliaksandr Siarohin, Sergey Tulyakov
-
Publication number: 20240221258Abstract: Unsupervised volumetric 3D animation (UVA) of non-rigid deformable objects without annotations learns the 3D structure and dynamics of objects solely from single-view red/green/blue (RGB) videos and decomposes the single-view RGB videos into semantically meaningful parts that can be tracked and animated. Using a 3D autodecoder framework, paired with a keypoint estimator via a differentiable perspective-n-point (PnP) algorithm, the UVA model learns the underlying object 3D geometry and parts decomposition in an entirely unsupervised manner from still or video images. This allows the UVA model to perform 3D segmentation, 3D keypoint estimation, novel view synthesis, and animation. The UVA model can obtain animatable 3D objects from a single or a few images. The UVA method also features a space in which all objects are represented in their canonical, animation-ready form. Applications include the creation of lenses from images or videos for social media applications.Type: ApplicationFiled: December 28, 2022Publication date: July 4, 2024Inventors: Menglei Chai, Hsin-Ying Lee, Willi Menapace, Kyle Olszewski, Jian Ren, Aliaksandr Siarohin, Ivan Skorokhodov, Sergey Tulyakov