Patents by Inventor Junnan LI

Junnan LI has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

  • Publication number: 20240160858
    Abstract: Embodiments described herein provide a method of generating a vision-language task output to a text instruction relating to an input image, the method comprising receiving, via a data interface, the input image and the text instruction comprising an instruction relating to the image. The method further includes encoding, via an image encoder, the image into a first image representation. The method further includes generating, by a multimodal encoder, a second image representation based on cross-attending the first image representation to the text instruction. The method further includes generating, by a neural network based language model, a vision-language task output in response to the text instruction based on an input combining the second image representation and the text instruction.
    Type: Application
    Filed: November 9, 2023
    Publication date: May 16, 2024
    Inventors: Wenliang Dai, Junnan Li, Chu Hong Hoi, Dongxu Li
  • Publication number: 20240161369
    Abstract: Embodiments described herein provide systems and methods of subject-driven image generation. In at least one embodiment, a system receives, via a data interface, an image containing a subject, a text description of the subject in the image, and a text prompt relating to a different rendition of the subject. The system encodes, via an image encoder, the image into an image feature vector. The system encodes, via a text encoder, the text description int a text feature vector. The system generates, by a multimodal encoder, a vector representation of the subject based on the image feature vector and the text feature vector. The system generates, by a neural network based image generation model, an output image based on an input combining the text prompt and the vector representation.
    Type: Application
    Filed: October 31, 2023
    Publication date: May 16, 2024
    Inventors: Junnan Li, Chu Hong Hoi, Dongxu Li
  • Publication number: 20240160853
    Abstract: Embodiments described herein provide a multimodal vision-language model. The multimodal vision-language model contains a Generalist Multimodal Transformer capable of complete multiple tasks using the same set of parameters learning from pre-training. The Generalist Multimodal Transformer allows alignment between frozen, unimodal encoders, such as image encoders and large language models. The Generalist Multimodal Transformer eliminates the need for fine-tuning the image encoders and large language models.
    Type: Application
    Filed: January 27, 2023
    Publication date: May 16, 2024
    Inventors: Junnan Li, Chu Hong Hoi
  • Publication number: 20240161520
    Abstract: Embodiments described herein provide a multimodal vision-language model. The multimodal vision-language model contains a Generalist Multimodal Transformer capable of complete multiple tasks using the same set of parameters learning from pre-training. The Generalist Multimodal Transformer allows alignment between frozen, unimodal encoders, such as image encoders and large language models. The Generalist Multimodal Transformer eliminates the need for fine-tuning the image encoders and large language models.
    Type: Application
    Filed: January 27, 2023
    Publication date: May 16, 2024
    Inventors: Junnan Li, Chu Hong Hoi
  • Publication number: 20240119257
    Abstract: Embodiments described herein provide systems and methods for providing zero-shot visual question answering. A first image and a first question relating to a visual content of the first image are received. One or more image captions relevant to the first question are determined using a visual-language neural model by determining portions of the first image relevant to the first question. Answer candidates are generated using the one or more image captions, answer candidates. Synthetic question-answer pairs are generated using synthetic questions generated using the answer candidates and the answer candidates. A prompt is generated by concatenating the synthetic question-answer pairs. A first answer to the first question is generated using a language network model using an input of the first question prepended with the prompt.
    Type: Application
    Filed: January 4, 2023
    Publication date: April 11, 2024
    Inventors: Jiaxian GUO, Junnan LI, Chu Hong HOI
  • Publication number: 20240054350
    Abstract: Embodiments described herein provide systems and methods for federated learning. A central system may store a neural network model which has a body of a number of layers, and a classification layer comprising class prototypes which classifies the latent representations output by the body of the model. The central system may initialize the class prototypes so that they are uniformly distributed in the representation space. The model and class prototypes may be broadcast to a number of client systems, which update the body of the model locally while keeping the class prototypes fixed. The clients may return information to the central system including updated local model parameters, and a local representation of the classes based on the latent representation of items in the local training data. Based on the information from the clients, the neural network model may be updated. This process may be repeated iteratively.
    Type: Application
    Filed: December 9, 2022
    Publication date: February 15, 2024
    Inventors: Yutong Dai, Zeyuan Chen, Junnan Li
  • Publication number: 20230419652
    Abstract: Embodiments described herein provide a zero-shot visual question answering (VQA) framework, which conjoins foundation network models with zero additional training. A first image and a question relating to the first image are received. The first image is divided into a plurality of image patches. A plurality of relevant image patches that are relevant to the question are determined, using a first neural network model, from the plurality of image patches. A plurality of image captions are generated, using a second neural network model, based on the plurality of relevant image patches. An answer to the question is generated based on the plurality of image captions.
    Type: Application
    Filed: September 23, 2022
    Publication date: December 28, 2023
    Inventors: Anthony Meng Huat Tiong, Junnan Li, Chu Hong Hoi
  • Publication number: 20230359900
    Abstract: Embodiments described herein provide a masked self-training (MaST) which is an unsupervised learning approach leveraging two complimentary sources of supervision: pseudo-labels and raw image pixels. Specifically, MaST jointly optimizes three objectives to finetune a pre-trained classification model on unlabeled images: (1) self-training objective to learn global task-specific class prediction; (2) masked image modeling objective to learn local pixel-level information; (3) global-local feature alignment objective to bridge the knowledge learned from the two sources of supervision.
    Type: Application
    Filed: May 27, 2022
    Publication date: November 9, 2023
    Inventors: Junnan Li, Chu Hong Hoi
  • Patent number: 11776236
    Abstract: The system and method are directed to a prototypical contrastive learning (PCL). The PCL explicitly encodes the hierarchical semantic structure of the dataset into the learned embedding space and prevents the network from exploiting low-level cues for solving the unsupervised learning task. The PCL includes prototypes as the latent variables to help find the maximum-likelihood estimation of the network parameters in an expectation-maximization framework. The PCL iteratively performs an E-step for finding prototypes with clustering and M-step for optimizing the network on a contrastive loss.
    Type: Grant
    Filed: February 2, 2022
    Date of Patent: October 3, 2023
    Assignee: Salesforce.com, Inc.
    Inventors: Junnan Li, Chu Hong Hoi
  • Publication number: 20230237773
    Abstract: Embodiments described herein provide bootstrapping language-images pretraining for unified vision-language understanding and generation (BLIP), a unified VLP framework which transfers flexibly to both vision-language understanding and generation tasks. BLIP enables a wider range of downstream tasks, improving on both shortcomings of existing models.
    Type: Application
    Filed: May 16, 2022
    Publication date: July 27, 2023
    Inventors: Junnan Li, Chu Hong Hoi
  • Publication number: 20230237772
    Abstract: Embodiments described herein provide bootstrapping language-images pre-training for unified vision-language understanding and generation (BLIP), a unified VLP framework which transfers flexibly to both vision-language understanding and generation tasks. BLIP enables a wider range of downstream tasks, improving on both shortcomings of existing models.
    Type: Application
    Filed: May 16, 2022
    Publication date: July 27, 2023
    Inventors: Junnan Li, Chu Hong Hoi
  • Publication number: 20230154188
    Abstract: Embodiments described a method of video-text pre-learning to effectively learn cross-modal representations from sparse video frames and text. Specifically, an align and prompt framework provides a video and language pre-training framework that encodes the frames and text independently using a transformer-based video encoder and a text encoder. A multi-modal encoder is then employed to capture cross-modal interaction between a plurality of video frames and a plurality of texts. The pre-training includes a prompting entity modeling that enables the model to capture fine-grained region-entity alignment.
    Type: Application
    Filed: December 30, 2021
    Publication date: May 18, 2023
    Inventors: Dongxu Li, Junnan Li, Chu Hong Hoi
  • Publication number: 20230154146
    Abstract: Embodiments described a method of video-text pre-learning to effectively learn cross-modal representations from sparse video frames and text. Specifically, an align and prompt framework provides a video and language pre-training framework that encodes the frames and text independently using a transformer-based video encoder and a text encoder. A multi-modal encoder is then employed to capture cross-modal interaction between a plurality of video frames and a plurality of texts. The pre-training includes a prompting entity modeling that enables the model to capture fine-grained region-entity alignment.
    Type: Application
    Filed: December 30, 2021
    Publication date: May 18, 2023
    Inventors: Dongxu Li, Junnan Li, Chu Hong Hoi
  • Publication number: 20230109681
    Abstract: Embodiments are directed to translating a natural language query into a code snippet in a programing language that semantically represents the query. The embodiments include a cascading neural network that includes an encoder network and a classifier network. The encoder network being faster but less accurate than the classifier network. The encoder network is trained using a contrastive learning framework to identify code candidates from a large set of code snippets. The classifier network is trained using a binary classifier to identify the code snippet that semantically represents the query from the code candidates.
    Type: Application
    Filed: January 28, 2022
    Publication date: April 13, 2023
    Inventors: Akhilesh Deepak Gotmare, Junnan Li, Chu Hong Hoi
  • Patent number: 11599792
    Abstract: A method provides learning with noisy labels. The method includes generating a first network of a machine learning model with a first set of parameter initial values, and generating a second network of the machine learning model with a second set of parameter initial values. First clean probabilities for samples in a training dataset are generated using the second network. A first labeled dataset and a first unlabeled dataset are generated from the training dataset based on the first clean probabilities. The first network is trained based on the first labeled dataset and first unlabeled dataset to update parameters of the first network.
    Type: Grant
    Filed: November 19, 2019
    Date of Patent: March 7, 2023
    Assignee: SALESFORCE.COM, INC.
    Inventors: Junnan Li, Chu Hong Hoi
  • Publication number: 20220391755
    Abstract: Embodiments described herein provide visual-and-language (V+L) systems and methods for learning vision and language representations. Specifically, a method may comprise receiving a training dataset comprising a plurality of image samples and a plurality of text samples; encoding the plurality of image samples into a plurality of encoded image samples and the plurality of text samples into a plurality of encoded text samples; computing a first loss objective based on the plurality of encoded image samples and the plurality of encoded text samples; encoding a first subset of the plurality of encoded image samples and a second subset of the plurality of encoded text samples into a plurality of encoded image-text samples; computing a second loss objective based on the plurality of encoded image-text samples; and updating the V+L model based at least in part on the first loss objective and the second loss objective.
    Type: Application
    Filed: July 8, 2021
    Publication date: December 8, 2022
    Inventors: Junnan Li, Chu Hong Hoi
  • Publication number: 20220374595
    Abstract: Embodiments described herein provides a contrastive learning framework that leverages hard negative examples, that are mined globally from the entire training corpus for a given query to improve the quality of code and natural language representations. Specifically, similar examples from the training corpus are extracted and used as hard negatives in an online manner during training while keeping the minibatch construction random.
    Type: Application
    Filed: November 19, 2021
    Publication date: November 24, 2022
    Inventors: Akhilesh Deepak Gotmare, Junnan Li, Shafiq Rayhan Joty, Chu Hong Hoi
  • Publication number: 20220247924
    Abstract: The present invention provides an image processing apparatus (10) including a detection unit (12) that detects a plurality of predetermined points of a body of each of a plurality of persons from an image in an image circle of a fisheye image, a gravity direction determination unit (13) that determines a gravity direction in a position of each of the plurality of persons from the plurality of predetermined points, a reference point decision unit (14) that decides a reference point, based on the gravity direction in the position of each of the plurality of persons, a complementary circular image generation unit (16) that generates a complementary circular image that is a circular image acquired by adding a complementary image to the image in the image circle of the fisheye image, and that has, as a center, the reference point different from a center of the image in the image circle, and an expansion unit (17) that panoramically expands the complementary circular image, based on the reference point, and generat
    Type: Application
    Filed: June 13, 2019
    Publication date: August 4, 2022
    Applicant: NEC Corporation
    Inventors: Jianquan LIU, Junnan LI
  • Publication number: 20220245850
    Abstract: The present invention provides a processing apparatus (20) including a first generation unit (22) that generates, from a plurality of time-series images, three-dimensional feature information indicating a time change of a feature in each position in each of the plurality of images, a second generation unit (23) that generates person position information indicating a position in which a person is present in each of the plurality of images, and an estimation unit (24) that estimates person behavior indicated by the plurality of images, based on the time change of the feature indicated by the three-dimensional feature information in the position in which the person is present being indicated by the person position information.
    Type: Application
    Filed: June 13, 2019
    Publication date: August 4, 2022
    Applicant: NEC Corporation
    Inventors: Jianquan LIU, Junnan LI
  • Publication number: 20220156593
    Abstract: Embodiments described herein provide systems and methods for learning representation from unlabeled videos. Specifically, a method may comprise generating a set of strongly-augmented samples and a set of weakly-augmented samples from the unlabeled video samples; generating a set of predictive logits by inputting the set of strongly-augmented samples into a student model and a first teacher model; generating a set of artificial labels by inputting the set of weakly-augmented samples to a second teacher model that operates in parallel to the first teacher model, wherein the second teacher model shares one or more model parameters with the first teacher model; computing a loss objective based on the set of predictive logits and the set of artificial labels; updating student model parameters based on the loss objective via backpropagation; and updating the shared parameters for the first teacher model and the second teacher model based on the updated student model parameters.
    Type: Application
    Filed: March 31, 2021
    Publication date: May 19, 2022
    Inventors: Hualin Liu, Chu Hong Hoi, Junnan Li