Patents by Inventor Junnan LI

Junnan LI has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

  • Patent number: 12657400
    Abstract: Embodiments described herein provide a method of generating a vision-language task output to a text instruction relating to an input image, the method comprising receiving, via a data interface, the input image and the text instruction comprising an instruction relating to the image. The method further includes encoding, via an image encoder, the image into a first image representation. The method further includes generating, by a multimodal encoder, a second image representation based on cross-attending the first image representation to the text instruction. The method further includes generating, by a neural network based language model, a vision-language task output in response to the text instruction based on an input combining the second image representation and the text instruction.
    Type: Grant
    Filed: November 9, 2023
    Date of Patent: June 16, 2026
    Assignee: Salesforce, Inc.
    Inventors: Wenliang Dai, Junnan Li, Chu Hong Hoi, Dongxu Li
  • Publication number: 20260154879
    Abstract: Embodiments described herein provide systems and methods of subject-driven image generation. In at least one embodiment, a system receives, via a data interface, an image containing a subject, a text description of the subject in the image, and a text prompt relating to a different rendition of the subject. The system encodes, via an image encoder, the image into an image feature vector. The system encodes, via a text encoder, the text description int a text feature vector. The system generates, by a multimodal encoder, a vector representation of the subject based on the image feature vector and the text feature vector. The system generates, by a neural network based image generation model, an output image based on an input combining the text prompt and the vector representation.
    Type: Application
    Filed: January 23, 2026
    Publication date: June 4, 2026
    Inventors: Junnan Li, Chu Hong Hoi, Dongxu Li
  • Patent number: 12566823
    Abstract: An interpolative centroid contrastive learning (ICCL) framework is disclosed for learning a more discriminative representation for tail classes. Specifically, data samples, such as natural images, are projected into a low-dimensional embedding space, and class centroids for respective classes are created as average embeddings of samples that belong to a respective class. Virtual training samples are then created by interpolating two images from two samplers: a class-agnostic sampler which returns all images from both the head class and the tail class with an equal probability, and a class-aware sampler which focuses more on tail-class images by sampling images from the tail class with a higher probability compared to images from the head class. The sampled images, e.g., images from the class-agnostic sampler and images from the class-aware sampler may be interpolated to generate interpolated images.
    Type: Grant
    Filed: March 1, 2021
    Date of Patent: March 3, 2026
    Assignee: Salesforce, Inc.
    Inventors: Anthony Meng Huat Tiong, Junnan Li, Chu Hong Hoi
  • Patent number: 12536725
    Abstract: Embodiments described herein provide systems and methods of subject-driven image generation. In at least one embodiment, a system receives, via a data interface, an image containing a subject, a text description of the subject in the image, and a text prompt relating to a different rendition of the subject. The system encodes, via an image encoder, the image into an image feature vector. The system encodes, via a text encoder, the text description int a text feature vector. The system generates, by a multimodal encoder, a vector representation of the subject based on the image feature vector and the text feature vector. The system generates, by a neural network based image generation model, an output image based on an input combining the text prompt and the vector representation.
    Type: Grant
    Filed: October 31, 2023
    Date of Patent: January 27, 2026
    Assignee: Salesforce, Inc.
    Inventors: Junnan Li, Chu Hong Hoi, Dongxu Li
  • Patent number: 12506970
    Abstract: The present invention provides an image processing apparatus (10) including a detection unit (12) that detects a plurality of predetermined points of a body of each of a plurality of persons from an image in an image circle of a fisheye image, a gravity direction determination unit (13) that determines a gravity direction in a position of each of the plurality of persons from the plurality of predetermined points, a reference point decision unit (14) that decides a reference point, based on the gravity direction in the position of each of the plurality of persons, a complementary circular image generation unit (16) that generates a complementary circular image that is a circular image acquired by adding a complementary image to the image in the image circle of the fisheye image, and that has, as a center, the reference point different from a center of the image in the image circle, and an expansion unit (17) that panoramically expands the complementary circular image, based on the reference point, and generat
    Type: Grant
    Filed: July 30, 2024
    Date of Patent: December 23, 2025
    Assignee: NEC CORPORATION
    Inventors: Jianquan Liu, Junnan Li
  • Patent number: 12468952
    Abstract: Embodiments described herein provide systems and methods for noise-robust contrastive learning. In view of the need for a noise-robust learning system, embodiments described herein provides a contrastive learning mechanism that combats noise by learning robust representations of the noisy data samples. Specifically, the training images are projected into a low-dimensional subspace, and the geometric structure of the subspace is regularized with: (1) a consistency contrastive loss that enforces images with perturbations to have similar embeddings; and (2) a prototypical contrastive loss augmented with a predetermined learning principle, which encourages the embedding for a linearly-interpolated input to have the same linear relationship with respect to the class prototypes. The low-dimensional embeddings are also trained to reconstruct the high-dimensional features, which preserves the learned information and regularizes the classifier.
    Type: Grant
    Filed: September 9, 2020
    Date of Patent: November 11, 2025
    Assignee: Salesforce, Inc.
    Inventors: Junnan Li, Chu Hong Hoi
  • Patent number: 12462592
    Abstract: Embodiments described herein provide a multimodal vision-language model. The multimodal vision-language model contains a Generalist Multimodal Transformer capable of complete multiple tasks using the same set of parameters learning from pre-training. The Generalist Multimodal Transformer allows alignment between frozen, unimodal encoders, such as image encoders and large language models. The Generalist Multimodal Transformer eliminates the need for fine-tuning the image encoders and large language models.
    Type: Grant
    Filed: January 27, 2023
    Date of Patent: November 4, 2025
    Assignee: Salesforce, Inc.
    Inventors: Junnan Li, Chu Hong Hoi
  • Patent number: 12450428
    Abstract: Embodiments described herein provides a contrastive learning framework that leverages hard negative examples, that are mined globally from the entire training corpus for a given query to improve the quality of code and natural language representations. Specifically, similar examples from the training corpus are extracted and used as hard negatives in an online manner during training while keeping the minibatch construction random.
    Type: Grant
    Filed: November 19, 2021
    Date of Patent: October 21, 2025
    Assignee: Salesforce, Inc.
    Inventors: Akhilesh Deepak Gotmare, Junnan Li, Shafiq Rayhan Joty, Chu Hong Hoi
  • Patent number: 12430849
    Abstract: A method of training a neural network based three-dimensional (3D) encoder is provided. A first plurality of samples of a training dataset are generated using a first 3D model. An image generator with multi-view rendering is used to generate a plurality of two-dimensional (2D) images having different viewpoints of the first 3D model. A first language model is used to generate a plurality of texts corresponding to the plurality of 2D images respectively. A first text for a first image is generated by using one or more text descriptions generated by the first language model. A point cloud is generated by randomly sampling points in the 3D model. The first plurality of samples are generated using the plurality of 2D images, the corresponding plurality of texts, and the point cloud. The neural network based 3D encoder is trained using the training dataset including the first plurality of samples.
    Type: Grant
    Filed: October 24, 2023
    Date of Patent: September 30, 2025
    Assignee: Salesforce, Inc.
    Inventors: Le Xue, Ning Yu, Shu Zhang, Junnan Li, Caiming Xiong, Silvio Savarese, Juan Carlos Niebles Duque, Ran Xu
  • Patent number: 12400068
    Abstract: Embodiments are directed to translating a natural language query into a code snippet in a programing language that semantically represents the query. The embodiments include a cascading neural network that includes an encoder network and a classifier network. The encoder network being faster but less accurate than the classifier network. The encoder network is trained using a contrastive learning framework to identify code candidates from a large set of code snippets. The classifier network is trained using a binary classifier to identify the code snippet that semantically represents the query from the code candidates.
    Type: Grant
    Filed: January 28, 2022
    Date of Patent: August 26, 2025
    Assignee: Salesforce, Inc.
    Inventors: Akhilesh Deepak Gotmare, Junnan Li, Chu Hong Hoi
  • Publication number: 20250245973
    Abstract: Embodiments described herein provide bootstrapping language-images pre-training for unified vision-language understanding and generation (BLIP), a unified VLP framework which transfers flexibly to both vision-language understanding and generation tasks. BLIP enables a wider range of downstream tasks, improving on both shortcomings of existing models.
    Type: Application
    Filed: April 18, 2025
    Publication date: July 31, 2025
    Inventors: Junnan Li, Chu Hong Hoi
  • Patent number: 12374099
    Abstract: Embodiments described herein provide a zero-shot visual question answering (VQA) framework, which conjoins foundation network models with zero additional training. A first image and a question relating to the first image are received. The first image is divided into a plurality of image patches. A plurality of relevant image patches that are relevant to the question are determined, using a first neural network model, from the plurality of image patches. A plurality of image captions are generated, using a second neural network model, based on the plurality of relevant image patches. An answer to the question is generated based on the plurality of image captions.
    Type: Grant
    Filed: September 23, 2022
    Date of Patent: July 29, 2025
    Assignee: Salesforce, Inc.
    Inventors: Anthony Meng Huat Tiong, Junnan Li, Chu Hong Hoi
  • Patent number: 12354013
    Abstract: Embodiments described herein provide a masked self-training (MaST) which is an unsupervised learning approach leveraging two complimentary sources of supervision: pseudo-labels and raw image pixels. Specifically, MaST jointly optimizes three objectives to finetune a pre-trained classification model on unlabeled images: (1) self-training objective to learn global task-specific class prediction; (2) masked image modeling objective to learn local pixel-level information; (3) global-local feature alignment objective to bridge the knowledge learned from the two sources of supervision.
    Type: Grant
    Filed: May 27, 2022
    Date of Patent: July 8, 2025
    Assignee: Salesforce, Inc.
    Inventors: Junnan Li, Chu Hong Hoi
  • Patent number: 12314861
    Abstract: Embodiments described herein provide an approach (referred to as “Co-training” mechanism throughout this disclosure) that jointly learns two representations of the training data, their class probabilities and low-dimensional embeddings. Specifically, two representations of each image sample are generated: a class probability produced by the classification head and a low-dimensional embedding produced by the projection head. The classification head is trained using memory-smoothed pseudo-labels, where pseudo-labels are smoothed by aggregating information from nearby samples in the embedding space. The projection head is trained using contrastive learning on a pseudo-label graph, where samples with similar pseudo-labels are encouraged to have similar embeddings.
    Type: Grant
    Filed: January 28, 2021
    Date of Patent: May 27, 2025
    Assignee: Salesforce, Inc.
    Inventors: Junnan Li, Chu Hong Hoi
  • Patent number: 12299961
    Abstract: Embodiments described herein provide systems, methods, and devices for pre-training a multimodal encoder-decoder (MED) model for vision-language tasks. A method may include encoding, by an image encoder of the MED, an image into an image representation; encoding, by a text encoder of the MED, a text into a text representation; generating, by an image-grounded text encoder of the MED, a multimodal representation based on the image representation and the text; generating, by an image-grounded text decoder of the MED, a predicted text based on the image representation and the text; generating, through an image-text matching (ITM) head, a binary classification indicating whether the image and the text are a match; computing a first loss, ITM loss, and third loss based on the image representation, text representation, binary classification, predicted text and text; jointly updating the MED based on the first loss, the second loss and the third loss.
    Type: Grant
    Filed: May 16, 2022
    Date of Patent: May 13, 2025
    Assignee: Salesforce, Inc.
    Inventors: Junnan Li, Chu Hong Hoi
  • Patent number: 12288380
    Abstract: Embodiments described herein provide systems, methods, and devices for generating enhanced vison-language training data. A method may include: receiving, from a communication interface, a first training dataset of image-text pairs and a second training dataset of annotated image-text pairs; fine-tuning an image-grounded text decoder and an image-grounded text encoder using the second training dataset of annotated image-text pairs; generating, by the fine-tuned image-grounded text decoder, a predicted text based on a training image from the first training dataset; generating, by the fine-tuned image-grounded text encoder, a filtering decision based on the training image and the predicted text; adding the training image and the predicted text to form a third training dataset of image-text pairs depending on the filter decision; and training a vision-language model using the third training dataset of image-text pairs.
    Type: Grant
    Filed: May 16, 2022
    Date of Patent: April 29, 2025
    Assignee: Salesforce, Inc.
    Inventors: Junnan Li, Chu Hong Hoi
  • Patent number: 12271792
    Abstract: Embodiments described herein provide visual-and-language (V+L) systems and methods for learning vision and language representations. Specifically, a method may comprise receiving a training dataset comprising a plurality of image samples and a plurality of text samples; encoding the plurality of image samples into a plurality of encoded image samples and the plurality of text samples into a plurality of encoded text samples; computing a first loss objective based on the plurality of encoded image samples and the plurality of encoded text samples; encoding a first subset of the plurality of encoded image samples and a second subset of the plurality of encoded text samples into a plurality of encoded image-text samples; computing a second loss objective based on the plurality of encoded image-text samples; and updating the V+L model based at least in part on the first loss objective and the second loss objective.
    Type: Grant
    Filed: July 8, 2021
    Date of Patent: April 8, 2025
    Assignee: Salesforce, Inc.
    Inventors: Junnan Li, Chu Hong Hoi
  • Patent number: 12210976
    Abstract: Embodiments described herein provide systems and methods for learning representation from unlabeled videos. Specifically, a method may comprise generating a set of strongly-augmented samples and a set of weakly-augmented samples from the unlabeled video samples; generating a set of predictive logits by inputting the set of strongly-augmented samples into a student model and a first teacher model; generating a set of artificial labels by inputting the set of weakly-augmented samples to a second teacher model that operates in parallel to the first teacher model, wherein the second teacher model shares one or more model parameters with the first teacher model; computing a loss objective based on the set of predictive logits and the set of artificial labels; updating student model parameters based on the loss objective via backpropagation; and updating the shared parameters for the first teacher model and the second teacher model based on the updated student model parameters.
    Type: Grant
    Filed: March 31, 2021
    Date of Patent: January 28, 2025
    Assignee: Salesforce, Inc.
    Inventors: Hualin Liu, Chu Hong Hoi, Junnan Li
  • Patent number: 12198432
    Abstract: Embodiments described a method of video-text pre-learning to effectively learn cross-modal representations from sparse video frames and text. Specifically, an align and prompt framework provides a video and language pre-training framework that encodes the frames and text independently using a transformer-based video encoder and a text encoder. A multi-modal encoder is then employed to capture cross-modal interaction between a plurality of video frames and a plurality of texts. The pre-training includes a prompting entity modeling that enables the model to capture fine-grained region-entity alignment.
    Type: Grant
    Filed: December 30, 2021
    Date of Patent: January 14, 2025
    Assignee: Salesforce, Inc.
    Inventors: Dongxu Li, Junnan Li, Chu Hong Hoi
  • Publication number: 20240388804
    Abstract: The present invention provides an image processing apparatus (10) including a detection unit (12) that detects a plurality of predetermined points of a body of each of a plurality of persons from an image in an image circle of a fisheye image, a gravity direction determination unit (13) that determines a gravity direction in a position of each of the plurality of persons from the plurality of predetermined points, a reference point decision unit (14) that decides a reference point, based on the gravity direction in the position of each of the plurality of persons, a complementary circular image generation unit (16) that generates a complementary circular image that is a circular image acquired by adding a complementary image to the image in the image circle of the fisheye image, and that has, as a center, the reference point different from a center of the image in the image circle, and an expansion unit (17) that panoramically expands the complementary circular image, based on the reference point, and generat
    Type: Application
    Filed: July 30, 2024
    Publication date: November 21, 2024
    Applicant: NEC Corporation
    Inventors: Jianquan Liu, Junnan Li