Patents by Inventor Zhiding Yu
Zhiding Yu has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).
-
Publication number: 20250020481Abstract: Apparatuses, systems, and techniques are presented to determination about objects in an environment. In at least one embodiment, a neural network can be used to determine one or more positions of one or more objects within a three-dimensional (3D) environment and to generate a segmented map of the 3D environment based, at least in part, on one or more two dimensional (2D) images of the one or more objects.Type: ApplicationFiled: April 7, 2022Publication date: January 16, 2025Inventors: Enze Xie, Zhiding Yu, Jonah Philion, Anima Anandkumar, Sanja Fidler, Jose Manuel Alvarez Lopez
-
Publication number: 20240416963Abstract: Apparatuses, systems, and techniques of using one or more machine learning processes (e.g., neural network(s)) to predict occupancy using an image input. In at least one embodiment, image data is processed using a neural network to predict occupancy in a 3D voxel space. In at least one embodiment, image data is processed using a neural network to detect objects in a 3D space.Type: ApplicationFiled: October 12, 2023Publication date: December 19, 2024Inventors: Zhiqi Li, Zhiding Yu, David Austin, Shiyi Lan, Jan Kautz, Jose Manuel Alvarez Lopez
-
Patent number: 12169882Abstract: Embodiments of the present disclosure relate to learning dense correspondences for images. Systems and methods are disclosed that disentangle structure and texture (or style) representations of GAN synthesized images by learning a dense pixel-level correspondence map for each image during image synthesis. A canonical coordinate frame is defined and a structure latent code for each generated image is warped to align with the canonical coordinate frame. In sum, the structure associated with the latent code is mapped into a shared coordinate space (canonical coordinate space), thereby establishing correspondences in the shared coordinate space. A correspondence generation system receives the warped coordinate correspondences as an encoded image structure. The encoded image structure and a texture latent code are used to synthesize an image. The shared coordinate space enables propagation of semantic labels from reference images to synthesized images.Type: GrantFiled: September 1, 2022Date of Patent: December 17, 2024Assignee: NVIDIA CorporationInventors: Sifei Liu, Jiteng Mu, Shalini De Mello, Zhiding Yu, Jan Kautz
-
Publication number: 20240386586Abstract: In various examples, systems and methods are disclosed relating to using neural networks for object detection or instance/semantic segmentation for, without limitation, autonomous or semi-autonomous systems and applications. In some implementations, one or more neural networks receive an image (or other sensor data representation) and a bounding shape corresponding to at least a portion of an object in the image. The bounding shape can include or be labeled with an identifier, class, and/or category of the object. The neural network can determine a mask for the object based at least on processing the image and the bounding shape. The mask can be used for various applications, such as annotating masks for vehicle or machine perception and navigation processes.Type: ApplicationFiled: May 19, 2023Publication date: November 21, 2024Applicant: NVIDIA CorporationInventors: Alperen DEGIRMENCI, Jiwoong CHOI, Zhiding YU, Ke CHEN, Shubhranshu SINGH, Yashar ASGARIEH, Subhashree RADHAKRISHNAN, James SKINNER, Jose Manuel ALVAREZ LOPEZ
-
Publication number: 20240378799Abstract: In various examples, bi-directional projection techniques may be used to generate enhanced Bird's-Eye View (BEV) representations. For example, a system(s) may generate one or more BEV features associated with a BEV of an environment using a projection process that associates 2D image features to one or more first locations of a 3D space. At least partially using the BEV feature(s), the system(s) may determine one or more second locations of the 3D space that correspond to one or more regions of interest in the environment. The system(s) may then generate one or more additional BEV features corresponding to the second location(s) using a different projection process that associates the second location(s) from the 3D space to at least a portion of the 2D image features. The system(s) may then generate an updated BEV of the environment based at least on the BEV feature(s) and/or the additional BEV feature(s).Type: ApplicationFiled: April 22, 2024Publication date: November 14, 2024Inventors: Zhiqi Li, Zhiding Yu, Animashree Anandkumar, Jose Manuel Alvarez Lopez
-
Publication number: 20240312219Abstract: In various examples, temporal-based perception for autonomous or semi-autonomous systems and applications is described. Systems and methods are disclosed that use a machine learning model (MLM) to intrinsically fuse feature maps associated with different sensors and different instances in time. To generate a feature map, image data generated using image sensors (e.g., cameras) located around a vehicle are processed using a MLM that is trained to generate the feature map. The MLM may then fuse the feature maps in order to generate a final feature map associated with a current instance in time. The feature maps associated with the previous instances in time may be preprocessed using one or more layers of the MLM, where the one or more layers are associated with performing temporal transformation before the fusion is performed. The MLM may then use the final feature map to generate one or more outputs.Type: ApplicationFiled: March 16, 2023Publication date: September 19, 2024Inventors: Jiwoong Choi, Jose Manuel Alvarez Lopez, Shiyi Lan, Yashar Asgarieh, Zhiding Yu
-
Publication number: 20240265690Abstract: A vision-language model learns skills and domain knowledge via distinct and separate task-specific neural networks, referred to as experts. Each expert is independently optimized for a specific task, facilitating the use of domain-specific data and architectures that are not feasible with a single large neural network trained for multiple tasks. The vision-language model implemented as an ensemble of pre-trained experts and is more efficiently trained compared with the single large neural network. During training, the vision-language model integrates specialized skills and domain knowledge, rather than trying to simultaneously learn multiple tasks, resulting in effective multi-modal learning.Type: ApplicationFiled: December 19, 2023Publication date: August 8, 2024Inventors: Animashree Anandkumar, Linxi Fan, Zhiding Yu, Chaowei Xiao, Shikun Liu
-
Publication number: 20240249538Abstract: 3D object detection is a computer vision task that generally detects (e.g. classifies and localizes) objects in 3D space from the 2D images or videos that capture the objects. Current techniques used for 3D object detection rely on machine learning processes that learn to detect 3D objects from existing images annotated with high-quality 3D information including depth information generally obtained using lidar technology. However, due to lidar's limited measurable range, current machine learning solutions to 3D object detection do not support detection of 3D objects beyond the lidar range, which is needed for numerous applications, including autonomous driving applications where existing close or midrange 3D object detection does not always meet the safety-critical requirement of autonomous driving. The present disclosure provides for 3D object detection using a technique that supports long-range detection (i.e. detection beyond the lidar range).Type: ApplicationFiled: July 18, 2023Publication date: July 25, 2024Inventors: Zetong Yang, Zhiding Yu, Ren Hao Wang, Chris Choy, Anima Anandkumar, Jose M. Alvarez Lopez
-
Publication number: 20240221166Abstract: Video instance segmentation is a computer vision task that aims to detect, segment, and track objects continuously in videos. It can be used in numerous real-world applications, such as video editing, three-dimensional (3D) reconstruction, 3D navigation (e.g. for autonomous driving and/or robotics), and view point estimation. However, current machine learning-based processes employed for video instance segmentation are lacking, particularly because the densely annotated videos needed for supervised training of high-quality models are not readily available and are not easily generated. To address the issues in the prior art, the present disclosure provides point-level supervision for video instance segmentation in a manner that allows the resulting machine learning model to handle any object category.Type: ApplicationFiled: December 22, 2023Publication date: July 4, 2024Inventors: Zhiding Yu, Shuaiyi Huang, De-An Huang, Shiyi Lan, Subhashree Radhakrishnan, Jose M. Alvarez Lopez, Anima Anandkumar
-
Publication number: 20240169545Abstract: Class agnostic object mask generation uses a vision transformer-based auto-labeling framework requiring only images and object bounding boxes to generate object (segmentation) masks. The generated object masks, images, and object labels may then be used to train instance segmentation models or other neural networks to localize and segment objects with pixel-level accuracy. The generated object masks may supplement or replace conventional human generated annotations. The human generated annotations may be misaligned compared with the object boundaries, resulting in poor quality labeled segmentation masks. In contrast with conventional techniques, the generated object masks are class agnostic and are automatically generated based only on a bounding box image region without relying on either labels or semantic information.Type: ApplicationFiled: July 20, 2023Publication date: May 23, 2024Inventors: Shiyi Lan, Zhiding Yu, Subhashree Radhakrishnan, Jose Manuel Alvarez Lopez, Animashree Anandkumar
-
Patent number: 11960570Abstract: A multi-level contrastive training strategy for training a neural network relies on image pairs (no other labels) to learn semantic correspondences at the image level and region or pixel level. The neural network is trained using contrasting image pairs including different objects and corresponding image pairs including different views of the same object. Conceptually, contrastive training pulls corresponding image pairs closer and pushes contrasting image pairs apart. An image-level contrastive loss is computed from the outputs (predictions) of the neural network and used to update parameters (weights) of the neural network via backpropagation. The neural network is also trained via pixel-level contrastive learning using only image pairs. Pixel-level contrastive learning receives an image pair, where each image includes an object in a particular category.Type: GrantFiled: August 25, 2021Date of Patent: April 16, 2024Assignee: NVIDIA CorporationInventors: Taihong Xiao, Sifei Liu, Shalini De Mello, Zhiding Yu, Jan Kautz
-
Publication number: 20240104842Abstract: A method for generating, by an encoder-based model, a three-dimensional (3D) representation of a two-dimensional (2D) image is provided. The encoder-based model is trained to infer the 3D representation using a synthetic training data set generated by a pre-trained model. The pre-trained model is a 3D generative model that produces a 3D representation and a corresponding 2D rendering, which can be used to train a separate encoder-based model for downstream tasks like estimating a triplane representation, neural radiance field, mesh, depth map, 3D key points, or the like, given a single input image, using the pseudo ground truth 3D synthetic training data set. In a particular embodiment, the encoder-based model is trained to predict a triplane representation of the input image, which can then be rendered by a volume renderer according to pose information to generate an output image of the 3D scene from the corresponding viewpoint.Type: ApplicationFiled: September 22, 2023Publication date: March 28, 2024Inventors: Koki Nagano, Alexander Trevithick, Chao Liu, Eric Ryan Chan, Sameh Khamis, Michael Stengel, Zhiding Yu
-
Publication number: 20240095534Abstract: Apparatuses, systems, and techniques to perform neural networks. In at least one embodiment, a most consistent output of one or more pre-trained neural networks is to be selected. In at least one embodiment, a most consistent output of one or more pre-trained neural networks is to be selected based, at least in part, on a plurality of variances of one or more inputs to the one or more neural networks.Type: ApplicationFiled: September 7, 2023Publication date: March 21, 2024Inventors: Anima Anandkumar, Chaowei Xiao, Weili Nie, De-An Huang, Zhiding Yu, Manli Shu
-
Publication number: 20240087222Abstract: An artificial intelligence framework is described that incorporates a number of neural networks and a number of transformers for converting a two-dimensional image into three-dimensional semantic information. Neural networks convert one or more images into a set of image feature maps, depth information associated with the one or more images, and query proposals based on the depth information. A first transformer implements a cross-attention mechanism to process the set of image feature maps in accordance with the query proposals. The output of the first transformer is combined with a mask token to generate initial voxel features of the scene. A second transformer implements a self-attention mechanism to convert the initial voxel features into refined voxel features, which are up-sampled and processed by a lightweight neural network to generate the three-dimensional semantic information, which may be used by, e.g., an autonomous vehicle for various advanced driver assistance system (ADAS) functions.Type: ApplicationFiled: November 20, 2023Publication date: March 14, 2024Inventors: Yiming Li, Zhiding Yu, Christopher B. Choy, Chaowei Xiao, Jose Manuel Alvarez Lopez, Sanja Fidler, Animashree Anandkumar
-
Publication number: 20240078423Abstract: A vision transformer (ViT) is a deep learning model that performs one or more vision processing tasks. ViTs may be modified to include a global task that clusters images with the same concept together to produce semantically consistent relational representations, as well as a local task that guides the ViT to discover object-centric semantic correspondence across images. A database of concepts and associated features may be created and used to train the global and local tasks, which may then enable the ViT to perform visual relational reasoning faster, without supervision, and outside of a synthetic domain.Type: ApplicationFiled: August 22, 2022Publication date: March 7, 2024Inventors: Xiaojian Ma, Weili Nie, Zhiding Yu, Huaizu Jiang, Chaowei Xiao, Yuke Zhu, Anima Anandkumar
-
Publication number: 20240062534Abstract: A vision transformer (ViT) is a deep learning model that performs one or more vision processing tasks. ViTs may be modified to include a global task that clusters images with the same concept together to produce semantically consistent relational representations, as well as a local task that guides the ViT to discover object-centric semantic correspondence across images. A database of concepts and associated features may be created and used to train the global and local tasks, which may then enable the ViT to perform visual relational reasoning faster, without supervision, and outside of a synthetic domain.Type: ApplicationFiled: August 22, 2022Publication date: February 22, 2024Inventors: Xiaojian Ma, Weili Nie, Zhiding Yu, Huaizu Jiang, Chaowei Xiao, Yuke Zhu, Anima Anandkumar
-
Patent number: 11899749Abstract: In various examples, training methods as described to generate a trained neural network that is robust to various environmental features. In an embodiment, training includes modifying images of a dataset and generating boundary boxes and/or other segmentation information for the modified images which is used to train a neural network.Type: GrantFiled: March 15, 2021Date of Patent: February 13, 2024Assignee: NVIDIA CORPORATIONInventors: Subhashree Radhakrishnan, Partha Sriram, Farzin Aghdasi, Seunghwan Cha, Zhiding Yu
-
Publication number: 20240037756Abstract: Apparatuses, systems, and techniques to track one or more objects in one or more frames of a video. In at least one embodiment, one or more objects in one or more frames of a video are tracked based on, for example, one or more sets of embeddings.Type: ApplicationFiled: May 5, 2023Publication date: February 1, 2024Inventors: De-An Huang, Zhiding Yu, Anima Anandkumar
-
Publication number: 20240013504Abstract: One embodiment of a method for training a machine learning model includes receiving a training data set that includes at least one image, text referring to at least one object included in the at least one image, and at least one bounding box annotation associated with the at least one object, and performing, based on the training data set, one or more operations to generate a trained machine learning model to segment images based on text, where the one or more operations to generate the trained machine learning model include minimizing a loss function that comprises at least one of a multiple instance learning loss term or an energy loss termType: ApplicationFiled: October 31, 2022Publication date: January 11, 2024Inventors: Zhiding YU, Boyi LI, Chaowei XIAO, De-An HUANG, Weili NIE, Linxi FAN, Anima ANANDKUMAR
-
Publication number: 20230385687Abstract: Approaches for training data set size estimation for machine learning model systems and applications are described. Examples include a machine learning model training system that estimates target data requirements for training a machine learning model, given an approximate relationship between training data set size and model performance using one or more validation score estimation functions. To derive a validation score estimation function, a regression data set is generated from training data, and subsets of the regression data set are used to train the machine learning model. A validation score is computed for the subsets and used to compute regression function parameters to curve fit the selected regression function to the training data set. The validation score estimation function is then solved for and provides an output of an estimate of the number additional training samples needed for the validation score estimation function to meet or exceed a target validation score.Type: ApplicationFiled: May 31, 2022Publication date: November 30, 2023Inventors: Rafid Reza Mahmood, James Robert Lucas, David Jesus Acuna Marrero, Daiqing Li, Jonah Philion, Jose Manuel Alvarez Lopez, Zhiding Yu, Sanja Fidler, Marc Law