Patents by Inventor Junnan LI

Junnan LI has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

Systems and methods for masked self-training of unsupervised image classification

Patent number: 12354013

Abstract: Embodiments described herein provide a masked self-training (MaST) which is an unsupervised learning approach leveraging two complimentary sources of supervision: pseudo-labels and raw image pixels. Specifically, MaST jointly optimizes three objectives to finetune a pre-trained classification model on unlabeled images: (1) self-training objective to learn global task-specific class prediction; (2) masked image modeling objective to learn local pixel-level information; (3) global-local feature alignment objective to bridge the knowledge learned from the two sources of supervision.

Type: Grant

Filed: May 27, 2022

Date of Patent: July 8, 2025

Assignee: Salesforce, Inc.

Inventors: Junnan Li, Chu Hong Hoi
Systems and methods for semi-supervised learning with contrastive graph regularization

Patent number: 12314861

Abstract: Embodiments described herein provide an approach (referred to as “Co-training” mechanism throughout this disclosure) that jointly learns two representations of the training data, their class probabilities and low-dimensional embeddings. Specifically, two representations of each image sample are generated: a class probability produced by the classification head and a low-dimensional embedding produced by the projection head. The classification head is trained using memory-smoothed pseudo-labels, where pseudo-labels are smoothed by aggregating information from nearby samples in the embedding space. The projection head is trained using contrastive learning on a pseudo-label graph, where samples with similar pseudo-labels are encouraged to have similar embeddings.

Type: Grant

Filed: January 28, 2021

Date of Patent: May 27, 2025

Assignee: Salesforce, Inc.

Inventors: Junnan Li, Chu Hong Hoi
Systems and methods for unified vision-language understanding and generation

Patent number: 12299961

Abstract: Embodiments described herein provide systems, methods, and devices for pre-training a multimodal encoder-decoder (MED) model for vision-language tasks. A method may include encoding, by an image encoder of the MED, an image into an image representation; encoding, by a text encoder of the MED, a text into a text representation; generating, by an image-grounded text encoder of the MED, a multimodal representation based on the image representation and the text; generating, by an image-grounded text decoder of the MED, a predicted text based on the image representation and the text; generating, through an image-text matching (ITM) head, a binary classification indicating whether the image and the text are a match; computing a first loss, ITM loss, and third loss based on the image representation, text representation, binary classification, predicted text and text; jointly updating the MED based on the first loss, the second loss and the third loss.

Type: Grant

Filed: May 16, 2022

Date of Patent: May 13, 2025

Assignee: Salesforce, Inc.

Inventors: Junnan Li, Chu Hong Hoi
Systems and methods for unified vision-language understanding and generation

Patent number: 12288380

Abstract: Embodiments described herein provide systems, methods, and devices for generating enhanced vison-language training data. A method may include: receiving, from a communication interface, a first training dataset of image-text pairs and a second training dataset of annotated image-text pairs; fine-tuning an image-grounded text decoder and an image-grounded text encoder using the second training dataset of annotated image-text pairs; generating, by the fine-tuned image-grounded text decoder, a predicted text based on a training image from the first training dataset; generating, by the fine-tuned image-grounded text encoder, a filtering decision based on the training image and the predicted text; adding the training image and the predicted text to form a third training dataset of image-text pairs depending on the filter decision; and training a vision-language model using the third training dataset of image-text pairs.

Type: Grant

Filed: May 16, 2022

Date of Patent: April 29, 2025

Assignee: Salesforce, Inc.

Inventors: Junnan Li, Chu Hong Hoi
Systems and methods for vision-and-language representation learning

Patent number: 12271792

Abstract: Embodiments described herein provide visual-and-language (V+L) systems and methods for learning vision and language representations. Specifically, a method may comprise receiving a training dataset comprising a plurality of image samples and a plurality of text samples; encoding the plurality of image samples into a plurality of encoded image samples and the plurality of text samples into a plurality of encoded text samples; computing a first loss objective based on the plurality of encoded image samples and the plurality of encoded text samples; encoding a first subset of the plurality of encoded image samples and a second subset of the plurality of encoded text samples into a plurality of encoded image-text samples; computing a second loss objective based on the plurality of encoded image-text samples; and updating the V+L model based at least in part on the first loss objective and the second loss objective.

Type: Grant

Filed: July 8, 2021

Date of Patent: April 8, 2025

Assignee: Salesforce, Inc.

Inventors: Junnan Li, Chu Hong Hoi
Systems and methods for video representation learning with a weak teacher

Patent number: 12210976

Abstract: Embodiments described herein provide systems and methods for learning representation from unlabeled videos. Specifically, a method may comprise generating a set of strongly-augmented samples and a set of weakly-augmented samples from the unlabeled video samples; generating a set of predictive logits by inputting the set of strongly-augmented samples into a student model and a first teacher model; generating a set of artificial labels by inputting the set of weakly-augmented samples to a second teacher model that operates in parallel to the first teacher model, wherein the second teacher model shares one or more model parameters with the first teacher model; computing a loss objective based on the set of predictive logits and the set of artificial labels; updating student model parameters based on the loss objective via backpropagation; and updating the shared parameters for the first teacher model and the second teacher model based on the updated student model parameters.

Type: Grant

Filed: March 31, 2021

Date of Patent: January 28, 2025

Assignee: Salesforce, Inc.

Inventors: Hualin Liu, Chu Hong Hoi, Junnan Li
Systems and methods for video and language pre-training

Patent number: 12198432

Abstract: Embodiments described a method of video-text pre-learning to effectively learn cross-modal representations from sparse video frames and text. Specifically, an align and prompt framework provides a video and language pre-training framework that encodes the frames and text independently using a transformer-based video encoder and a text encoder. A multi-modal encoder is then employed to capture cross-modal interaction between a plurality of video frames and a plurality of texts. The pre-training includes a prompting entity modeling that enables the model to capture fine-grained region-entity alignment.

Type: Grant

Filed: December 30, 2021

Date of Patent: January 14, 2025

Assignee: Salesforce, Inc.

Inventors: Dongxu Li, Junnan Li, Chu Hong Hoi
IMAGE PROCESSING DEVICE, IMAGE PROCESSING METHOD, AND NON-TRANSITORY STORAGE MEDIUM

Publication number: 20240388804

Abstract: The present invention provides an image processing apparatus (10) including a detection unit (12) that detects a plurality of predetermined points of a body of each of a plurality of persons from an image in an image circle of a fisheye image, a gravity direction determination unit (13) that determines a gravity direction in a position of each of the plurality of persons from the plurality of predetermined points, a reference point decision unit (14) that decides a reference point, based on the gravity direction in the position of each of the plurality of persons, a complementary circular image generation unit (16) that generates a complementary circular image that is a circular image acquired by adding a complementary image to the image in the image circle of the fisheye image, and that has, as a center, the reference point different from a center of the image in the image circle, and an expansion unit (17) that panoramically expands the complementary circular image, based on the reference point, and generat

Type: Application

Filed: July 30, 2024

Publication date: November 21, 2024

Applicant: NEC Corporation

Inventors: Jianquan Liu, Junnan Li
SYSTEMS AND METHODS FOR MULTI-MODAL LANGUAGE MODELS

Publication number: 20240370718

Abstract: Embodiments described herein provide a method of generating a multi-modal task output to a text instruction relating to inputs of multiple different modalities (e.g., text, audio, video, 3D). The method comprises receiving, via a data interface, a first input of a first modality, a second input of a second modality and the text instruction relating to the first and the second inputs; encoding, by a first multimodal encoder adapted for the first modality, the first input of the first modality into a first encoded representation conditioned on the text instruction; encoding, by a second multimodal encoder adapted for the second modality, the second input of the second modality into a second encoded representation conditioned on the text instruction; and generating, by a neural network based language model, the multi-modal task output based on an input combining the first encoded representation, the second encoded representation, and the text instruction.

Type: Application

Filed: December 29, 2023

Publication date: November 7, 2024

Inventors: Artemis Panagopoulou, Le Xue, Ning Yu, Junnan Li, Dongxu Li, Silvio Savarese, Shafiq Rayhan Joty, Ran Xu, Caiming Xiong, Juan Carlos Niebles Duque
Three-dimensional person behavior estimation

Patent number: 12118741

Abstract: The present invention provides a processing apparatus (20) including a first generation unit (22) that generates, from a plurality of time-series images, three-dimensional feature information indicating a time change of a feature in each position in each of the plurality of images, a second generation unit (23) that generates person position information indicating a position in which a person is present in each of the plurality of images, and an estimation unit (24) that estimates person behavior indicated by the plurality of images, based on the time change of the feature indicated by the three-dimensional feature information in the position in which the person is present being indicated by the person position information.

Type: Grant

Filed: June 13, 2019

Date of Patent: October 15, 2024

Assignee: NEC CORPORATION

Inventors: Jianquan Liu, Junnan Li
Systems and methods for vision-language distribution alignment

Patent number: 12112523

Abstract: Embodiments described herein a CROss-Modal Distribution Alignment (CROMDA) model for vision-language pretraining, which can be used for retrieval downstream tasks. In the CROMDA mode, global cross-modal representations are aligned on each unimodality. Specifically, a uni-modal global similarity between an image/text and the image/text feature queue are computed. A softmax-normalized distribution is then generated based on the computed similarity. The distribution thus takes advantage of property of the global structure of the queue. CROMDA then aligns the two distributions and learns a modal invariant global representation. In this way, CROMDA is able to obtain invariant property in each modality, where images with similar text representations should be similar and vice versa.

Type: Grant

Filed: January 31, 2022

Date of Patent: October 8, 2024

Assignee: Salesforce, Inc.

Inventors: Shu Zhang, Junnan Li, Ran Xu, Caiming Xiong, Chetan Ramaiah
PROCESSING SYSTEM, ESTIMATION APPARATUS, PROCESSING METHOD, AND NON-TRANSITORY STORAGE MEDIUM

Publication number: 20240331365

Abstract: The present invention provides a processing system (10) including: a sample image generation unit (11) that generates a plurality of sample images being each associated with a partial region of a first image generated using a first lens; an estimation unit (12) that generates an image content estimation result indicating a content for each of the sample images using an estimation model generated by machine learning using a second image generated using a second lens differing from the first lens; a task execution unit (14) that estimates a relative positional relationship of a plurality of the sample images in the first image; a determination unit (15) that determines whether an estimation result of the relative positional relationship is correct; and a correction unit (16) that corrects a value of a parameter of the estimation model when the estimation result of the relative positional relationship is determined to be incorrect.

Type: Application

Filed: June 11, 2024

Publication date: October 3, 2024

Applicant: NEC Corporation

Inventors: Jianquan LIU, Junnan Li
SYSTEMS AND METHODS FOR MULTIMODAL PRETRAINING FOR THREE-DIMENSIONAL UNDERSTANDING MODELS

Publication number: 20240312128

Abstract: A method of training a neural network based three-dimensional (3D) encoder is provided. A first plurality of samples of a training dataset are generated using a first 3D model. An image generator with multi-view rendering is used to generate a plurality of two-dimensional (2D) images having different viewpoints of the first 3D model. A first language model is used to generate a plurality of texts corresponding to the plurality of 2D images respectively. A first text for a first image is generated by using one or more text descriptions generated by the first language model. A point cloud is generated by randomly sampling points in the 3D model. The first plurality of samples are generated using the plurality of 2D images, the corresponding plurality of texts, and the point cloud. The neural network based 3D encoder is trained using the training dataset including the first plurality of samples.

Type: Application

Filed: October 24, 2023

Publication date: September 19, 2024

Inventors: Le Xue, Ning Yu, Shu Zhang, Junnan Li, Caiming Xiong, Silvio Savarese, Juan Carlos Niebles Duque, Ran Xu
CONTROL METHOD FOR LOCKING AND UNLOCKING LOCKS

Publication number: 20240301719

Abstract: This disclosure provides a lock having a rotating component, the rotation of which drives the retractable movement of a latch bolt, and the rotating component is equipped with a trigger component; a first sensing component, fixedly set or arranged at the first position of the lock, where the first position is the position indicated by a knob fixedly connected to the rotating component when the lock is in the unlocked state, and when the trigger component is at the first position, the first sensing component generates a first trigger signal; a second sensing component is fixedly set or arranged at the second position of the lock, where the second position is the position indicated by the knob when the lock is in the locked state, and when the trigger component is at the second position, the second sensing component generates a second trigger signal.

Type: Application

Filed: March 8, 2024

Publication date: September 12, 2024

Inventors: Junnan Li, Da Liang, Liying Chen, Yixi Peng, Dezhou Chang, Ping Liu, Dejun Liu
Image processing device, image processing method, and non-transitory storage medium

Patent number: 12081873

Abstract: The present invention provides an image processing apparatus (10) including a detection unit (12) that detects a plurality of predetermined points of a body of each of a plurality of persons from an image in an image circle of a fisheye image, a gravity direction determination unit (13) that determines a gravity direction in a position of each of the plurality of persons from the plurality of predetermined points, a reference point decision unit (14) that decides a reference point, based on the gravity direction in the position of each of the plurality of persons, a complementary circular image generation unit (16) that generates a complementary circular image that is a circular image acquired by adding a complementary image to the image in the image circle of the fisheye image, and that has, as a center, the reference point different from a center of the image in the image circle, and an expansion unit (17) that panoramically expands the complementary circular image, based on the reference point, and generat

Type: Grant

Filed: June 13, 2019

Date of Patent: September 3, 2024

Assignee: NEC CORPORATION

Inventors: Jianquan Liu, Junnan Li
SYSTEMS AND METHODS FOR AN ENCODER-DECODER BASED FRAMEWORK FOR CODE GENERATION AND UNDERSTANDING

Publication number: 20240289606

Abstract: Embodiments described herein provide a mixture of encoder-decoder Transformer framework for multi-task pretraining and flexible finetuning for both code understanding and generation tasks. Specifically, the framework is built on multimodal encoder and decoder modules. During pre-training, the encoder-decoder framework is trained with multiple learning objectives, including a diverse set of self-supervised tasks over two major stages of pretraining on unimodal and bimodal data.

Type: Application

Filed: February 24, 2023

Publication date: August 29, 2024

Inventors: Yue Wang, Hung Le, Akhilesh Deepak Gotmare, Junnan Li, Chu Hong Hoi
Systems and methods for partially supervised learning with momentum prototypes

Patent number: 12056610

Abstract: A learning mechanism with partially-labeled web images is provided while correcting the noise labels during the learning. Specifically, the mechanism employs a momentum prototype that represents common characteristics of a specific class. One training objective is to minimize the difference between the normalized embedding of a training image sample and the momentum prototype of the corresponding class. Meanwhile, during the training process, the momentum prototype is used to generate a pseudo label for the training image sample, which can then be used to identify and remove out of distribution (OOD) samples to correct the noisy labels from the original partially-labeled training images. The momentum prototype for each class is in turn constantly updated based on the embeddings of new training samples and their pseudo labels.

Type: Grant

Filed: August 28, 2020

Date of Patent: August 6, 2024

Assignee: Salesforce, Inc.

Inventors: Junnan Li, Chu Hong Hoi
Processing system, estimation apparatus, processing method, and non-transitory storage medium

Patent number: 12039772

Abstract: The present invention provides a processing system (10) including: a sample image generation unit (11) that generates a plurality of sample images being each associated with a partial region of a first image generated using a first lens; an estimation unit (12) that generates an image content estimation result indicating a content for each of the sample images using an estimation model generated by machine learning using a second image generated using a second lens differing from the first lens; a task execution unit (14) that estimates a relative positional relationship of a plurality of the sample images in the first image; a determination unit (15) that determines whether an estimation result of the relative positional relationship is correct; and a correction unit (16) that corrects a value of a parameter of the estimation model when the estimation result of the relative positional relationship is determined to be incorrect.

Type: Grant

Filed: April 5, 2019

Date of Patent: July 16, 2024

Assignee: NEC CORPORATION

Inventors: Jianquan Liu, Junnan Li
Systems and methods for video and language pre-training

Patent number: 11989941

Abstract: Embodiments described a method of video-text pre-learning to effectively learn cross-modal representations from sparse video frames and text. Specifically, an align and prompt framework provides a video and language pre-training framework that encodes the frames and text independently using a transformer-based video encoder and a text encoder. A multi-modal encoder is then employed to capture cross-modal interaction between a plurality of video frames and a plurality of texts. The pre-training includes a prompting entity modeling that enables the model to capture fine-grained region-entity alignment.

Type: Grant

Filed: December 30, 2021

Date of Patent: May 21, 2024

Assignee: Salesforce, Inc.

Inventors: Dongxu Li, Junnan Li, Chu Hong Hoi
SYSTEMS AND METHODS FOR A VISION-LANGUAGE PRETRAINING FRAMEWORK

Publication number: 20240161520

Abstract: Embodiments described herein provide a multimodal vision-language model. The multimodal vision-language model contains a Generalist Multimodal Transformer capable of complete multiple tasks using the same set of parameters learning from pre-training. The Generalist Multimodal Transformer allows alignment between frozen, unimodal encoders, such as image encoders and large language models. The Generalist Multimodal Transformer eliminates the need for fine-tuning the image encoders and large language models.

Type: Application

Filed: January 27, 2023

Publication date: May 16, 2024

Inventors: Junnan Li, Chu Hong Hoi

1 2 3 next