Patents by Inventor Jiahui YU

Jiahui YU has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

Relative margin for contrastive learning

Patent number: 12282857

Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for training neural networks through contrastive learning. In particular, the contrastive learning is modified to use a relative margin to adjust a training pair's contribution to optimization.

Type: Grant

Filed: September 27, 2024

Date of Patent: April 22, 2025

Assignee: Google LLC

Inventors: Siyuan Qiao, Chenxi Liu, Jiahui Yu, Yonghui Wu
VIDEO-TEXT MODELING WITH ZERO-SHOT TRANSFER FROM CONTRASTIVE CAPTIONERS

Publication number: 20250124708

Abstract: Provided is an efficient approach to establish a foundational video-text model for tasks including open-vocabulary video classification, text-to-video retrieval, video captioning and video question-answering. Some example implementations include a model which can be referred to as VideoCoCa. Example implementations reuse a pretrained image-text contrastive captioner (CoCa) model and adapt it to video-text tasks with little or minimal extra training. While previous works adapt image-text models with various cross-frame fusion modules (for example, cross-frame attention layer or perceiver resampler) and finetune the modified architecture on video-text data, aspects of the present disclosure leverage findings that the generative attentional pooling and contrastive attentional pooling layers in the image-text CoCa design are instantly adaptable to “flattened frame embeddings”, yielding a strong zero-shot transfer baseline for many video-text tasks.

Type: Application

Filed: December 8, 2023

Publication date: April 17, 2025

Inventors: Shen Yan, Tao Zhu, Zirui Wang, Yuan Cao, Jiahui Yu
SELF-SUPERVISED LEARNING FOR AUDIO PROCESSING

Publication number: 20250118291

Abstract: Methods, computer systems, and apparatus, including computer programs encoded on computer storage media, for training an audio-processing neural network that includes at least (1) a first encoder network having a first set of encoder network parameters and (2) a decoder network having a set of decoder network parameters. The system obtains a set of un-labeled audio data segments, and generates, from the set of un-labeled audio data segments, a set of encoder training examples. The system performs training of a second encoder neural network that includes at least the first encoder neural network on the set of generated encoder training examples. The system also obtains one or more labeled training examples, and performs training of the audio-processing neural network on the labeled training examples.

Type: Application

Filed: January 30, 2023

Publication date: April 10, 2025

Inventors: Chung-Cheng CHIU, Weikeng QIN, Jiahui YU, Yonghui WU, Yu ZHANG
MEDIA ITEM CHARACTERIZATION BASED ON MULTIMODAL EMBEDDINGS

Publication number: 20250111671

Abstract: Methods and systems for media item characterization based on multimodal embeddings are provided herein. A media item including a sequence of video frames is identified. A set of video embeddings representing visual features of the sequence of video frames is obtained. A set of audio embeddings representing audio features of the sequence of video frames is obtained. A set of audiovisual embeddings is generated based on the set of video embeddings and the set of audio embeddings. Each of the set of audiovisual embeddings represents a visual feature and an audio feature of a respective video frame of the sequence of video frames. One or more media characteristics associated with the media item are determined based on the set of audiovisual embeddings.

Type: Application

Filed: September 27, 2024

Publication date: April 3, 2025

Inventors: Tao Zhu, Jiahui Yu, Jingchen Feng, Kai Chen, Pooya Abolghasemi, Gagan Bansal, Jieren Xu, Hui Miao, Yaping Zhang, Shuchao Bi, Yonghui Wu, Claire Cui, Rohan Anil
RELATIVE MARGIN FOR CONTRASTIVE LEARNING

Publication number: 20250111235

Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for training neural networks through contrastive learning. In particular, the contrastive learning is modified to use a relative margin to adjust a training pair's contribution to optimization.

Type: Application

Filed: September 27, 2024

Publication date: April 3, 2025

Inventors: Siyuan Qiao, Chenxi Liu, Jiahui Yu, Yonghui Wu
Attribute Recognition with Image-Conditioned Prefix Language Modeling

Publication number: 20250054322

Abstract: Systems and methods for attribute recognition can include obtaining an image and a text string. The text string can be processed with a language model to generate a set of candidate attributes based on sequence based prediction. The image and the candidate attributes can be processed with an image-text model to determine a likelihood that the respective candidate attribute is depicted in the image. The likelihood determination can then be utilized to determine a predicted attribute for the object of interest.

Type: Application

Filed: July 29, 2024

Publication date: February 13, 2025

Inventors: Keren Ye, Yicheng Zhu, Junjie Ke, Jiahui Yu, Leonidas John Guibas, Peyman Milanfar, Feng Yang
Co-Training of Action Recognition Machine Learning Models

Publication number: 20250037426

Abstract: A method includes obtaining video datasets each including pairs of a training video and a ground-truth action classification of the training video. The method also includes generating an action recognition model that includes a shared encoder model and action classification heads. A number of the action classifications heads may be equal to a number of the video datasets, and each action classification head may be configured to, based on an output of the shared encoder model, classify training videos sampled from a corresponding video dataset. The method also includes determining, by the action recognition model and for each training video sampled from the video datasets, an inferred action classification. The method further includes determining a loss value based on the inferred action classifications and the ground-truth action classifications, and adjusting parameters of the action recognition model based on the loss value.

Type: Application

Filed: December 9, 2022

Publication date: January 30, 2025

Inventors: Bowen Zhang, Jiahui Yu, Christopher Fifty, Wei Han, Andrew M. Dai, Ruoming Pang, Fei Sha
Optimizing inference performance for conformer

Patent number: 12190869

Abstract: A computer-implemented method includes receiving a sequence of acoustic frames as input to an automatic speech recognition (ASR) model. Here, the ASR model includes a causal encoder and a decoder. The method also includes generating, by the causal encoder, a first higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The method also includes generating, by the decoder, a first probability distribution over possible speech recognition hypotheses. Here, the causal encoder includes a stack of causal encoder layers each including a Recurrent Neural Network (RNN) Attention-Performer module that applies linear attention.

Type: Grant

Filed: September 29, 2022

Date of Patent: January 7, 2025

Assignee: Google LLC

Inventors: Tara N. Sainath, Rami Botros, Anmol Gulati, Krzysztof Choromanski, Ruoming Pang, Trevor Strohman, Weiran Wang, Jiahui Yu
Vector-Quantized Image Modeling

Publication number: 20240404238

Abstract: Systems and methods are provided for vector-quantized image modeling using vision transformers and improved codebook handling. In particular, the present disclosure provides a Vector-quantized Image Modeling (VIM) approach that involves pre-training a machine learning model (e.g., Transformer model) to predict rasterized image tokens autoregressively. The discrete image tokens can be encoded from a learned Vision-Transformer-based VQGAN (example implementations of which can be referred to as ViT-VQGAN). The present disclosure proposes multiple improvements over vanilla VQGAN from architecture to codebook learning, yielding better efficiency and reconstruction fidelity. The improved ViT-VQGAN further improves vector-quantized image modeling tasks, including unconditional image generation, conditioned image generation (e.g., class-conditioned image generation), and unsupervised representation learning.

Type: Application

Filed: October 5, 2022

Publication date: December 5, 2024

Inventors: Jiahui Yu, Vijay Vasudevan, Alexander Yeong-Shiuh Ku, Yonghui Wu, Jason Michael Baldridge, Yuanzhong Xu, Jing Yu Koh, Thang Minh Luong, Gunjan Baid, Zirui Wang, Han Zhang, Xin Li
Cascaded encoders for simplified streaming and non-streaming ASR

Patent number: 12154581

Abstract: An automated speech recognition (ASR) model includes a first encoder, a second encoder, and a decoder. The first encoder receives, as input, a sequence of acoustic frames, and generates, at each of a plurality of output steps, a first higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The second encoder receives, as input, the first higher order feature representation generated by the first encoder at each of the plurality of output steps, and generates, at each of the plurality of output steps, a second higher order feature representation for a corresponding first higher order feature frame. The decoder receives, as input, the second higher order feature representation generated by the second encoder at each of the plurality of output steps, and generates, at each of the plurality of time steps, a first probability distribution over possible speech recognition hypotheses.

Type: Grant

Filed: April 21, 2021

Date of Patent: November 26, 2024

Assignee: Google LLC

Inventors: Arun Narayanan, Tara Sainath, Chung-Cheng Chiu, Ruoming Pang, Rohit Prabhavalkar, Jiahui Yu, Ehsan Variani, Trevor Strohman
Convolution-Augmented Transformer Models

Publication number: 20240362453

Abstract: Systems and methods can utilize a conformer model to process a data set for various data processing tasks, including, but not limited to, speech recognition, sound separation, protein synthesis determination, video or other image set analysis, and natural language processing. The conformer model can use feed-forward blocks, a self-attention block, and a convolution block to process data to learn global interactions and relative-offset-based local correlations of the input data.

Type: Application

Filed: July 8, 2024

Publication date: October 31, 2024

Inventors: Anmol Gulati, Weikeng Qin, Zhengdong Zhang, Ruoming Pang, Niki Parmar, Jiahui Yu, Wei Han, Chung-Cheng Chiu, Yu Zhang, Yonghui Wu, Shibo Wang
Fast emit low-latency streaming ASR with sequence-level emission regularization utilizing forward and backward probabilities between nodes of an alignment lattice

Patent number: 12094453

Abstract: A computer-implemented method of training a streaming speech recognition model that includes receiving, as input to the streaming speech recognition model, a sequence of acoustic frames. The streaming speech recognition model is configured to learn an alignment probability between the sequence of acoustic frames and an output sequence of vocabulary tokens. The vocabulary tokens include a plurality of label tokens and a blank token. At each output step, the method includes determining a first probability of emitting one of the label tokens and determining a second probability of emitting the blank token. The method also includes generating the alignment probability at a sequence level based on the first probability and the second probability. The method also includes applying a tuning parameter to the alignment probability at the sequence level to maximize the first probability of emitting one of the label tokens.

Type: Grant

Filed: September 9, 2021

Date of Patent: September 17, 2024

Assignee: Google LLC

Inventors: Jiahui Yu, Chung-cheng Chiu, Bo Li, Shuo-yiin Chang, Tara Sainath, Wei Han, Anmol Gulati, Yanzhang He, Arun Narayanan, Yonghui Wu, Ruoming Pang
Convolution-augmented transformer models

Patent number: 12079703

Abstract: Systems and methods can utilize a conformer model to process a data set for various data processing tasks, including, but not limited to, speech recognition, sound separation, protein synthesis determination, video or other image set analysis, and natural language processing. The conformer model can use feed-forward blocks, a self-attention block, and a convolution block to process data to learn global interactions and relative-offset-based local correlations of the input data.

Type: Grant

Filed: December 31, 2020

Date of Patent: September 3, 2024

Assignee: GOOGLE LLC

Inventors: Anmol Gulati, Ruoming Pang, Niki Parmar, Jiahui Yu, Wei Han, Chung-Cheng Chiu, Yu Zhang, Yonghui Wu, Shibo Wang, Weikeng Qin, Zhengdong Zhang
Vector-Quantized Image Modeling

Publication number: 20240112088

Abstract: Systems and methods are provided for vector-quantized image modeling using vision transformers and improved codebook handling. In particular, the present disclosure provides a Vector-quantized Image Modeling (VIM) approach that involves pretraining a machine learning model (e.g., Transformer model) to predict rasterized image tokens autoregressively. The discrete image tokens can be encoded from a learned Vision-Transformer-based VQGAN (example implementations of which can be referred to as ViT-VQGAN). The present disclosure proposes multiple improvements over vanilla VQGAN from architecture to codebook learning, yielding better efficiency and reconstruction fidelity. The improved ViT-VQGAN further improves vector-quantized image modeling tasks, including unconditional image generation, conditioned image generation (e.g., class-conditioned image generation), and unsupervised representation learning.

Type: Application

Filed: November 27, 2023

Publication date: April 4, 2024

Inventors: Jiahui Yu, Xin Li, Han Zhang, Vijay Vasudevan, Alexander Yeong-Shiuh Ku, Jason Michael Baldridge, Yuanzhong Xu, Jing Yu Koh, Thang Minh Luong, Gunjan Baid, Zirui Wang, Yonghui Wu
CONTRASTIVE CAPTIONING NEURAL NETWORKS

Publication number: 20230351149

Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for processing multi-modal inputs using contrastive captioning neural networks.

Type: Application

Filed: April 28, 2023

Publication date: November 2, 2023

Inventors: Jiahui Yu, Zirui Wang, Vijay Vasudevan, Ho Man Yeung, Seyed Mojtaba Seyedhosseini Tarzjani, Yonghui Wu
Systems and Methods for Pretraining Image Processing Models

Publication number: 20230281400

Abstract: Example embodiments of the present disclosure relate to systems and methods for pretraining image-processing models on weakly-supervised image-text pairs. The pretraining can include receiving a training sequence for the machine-learned image-processing model. The training sequence can include text tokens and image tokens. A prefix sequence can contain the image tokens. A remainder sequence can include a remainder set of the text tokens. The pretraining can include determining, using the prefix sequence as an input to the machine-learned image-processing model, an objective based on recovery of the remainder sequence. The pretraining can include updating one or more learnable parameters of the machine-learned image-processing model based on the objective.

Type: Application

Filed: March 3, 2022

Publication date: September 7, 2023

Inventors: Zirui Wang, Jiahui Yu, Yuan Cao, Wei Yu, Zihang Dai
Systems and Methods for Training Dual-Mode Machine-Learned Speech Recognition Models

Publication number: 20230237993

Abstract: Systems and methods of the present disclosure are directed to a computing system, including one or more processors and a machine-learned multi-mode speech recognition model configured to operate in a streaming recognition mode or a contextual recognition mode. The computing system can perform operations including obtaining speech data and a ground truth label and processing the speech data using the contextual recognition mode to obtain contextual prediction data. The operations can include evaluating a difference between the contextual prediction data and the ground truth label and processing the speech data using the streaming recognition mode to obtain streaming prediction data. The operations can include evaluating a difference between the streaming prediction data and the ground truth label and the contextual and streaming prediction data. The operations can include adjusting parameters of the speech recognition model.

Type: Application

Filed: October 1, 2021

Publication date: July 27, 2023

Inventors: Jiahui Yu, Ruoming Pang, Wei Han, Anmol Gulati, Chung-Cheng Chiu, Bo Li, Tara N. Sainath, Yonghui Hu
Optimizing Inference Performance for Conformer

Publication number: 20230130634

Abstract: A computer-implemented method includes receiving a sequence of acoustic frames as input to an automatic speech recognition (ASR) model. Here, the ASR model includes a causal encoder and a decoder. The method also includes generating, by the causal encoder, a first higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The method also includes generating, by the decoder, a first probability distribution over possible speech recognition hypotheses. Here, the causal encoder includes a stack of causal encoder layers each including a Recurrent Neural Network (RNN) Attention-Performer module that applies linear attention.

Type: Application

Filed: September 29, 2022

Publication date: April 27, 2023

Applicant: Google LLC

Inventors: Tara N. Sainath, Rami Botros, Anmol Gulati, Krzysztof Choromanski, Ruoming Pang, Trevor Strohman, Weiran Wang, Jiahui Yu
Predicting Word Boundaries for On-Device Batching of End-To-End Speech Recognition Models

Publication number: 20230107493

Abstract: A method includes receiving a sequence of input audio frames corresponding to an utterance captured by a user device, the utterance including a plurality of words. For each input audio frame, the method includes predicting, using a word boundary detection model configured receive the sequence of input audio frames as input, whether the input audio frame is a word boundary. The method includes batching the input audio frames into a plurality of batches based on the input audio frames predicted as word boundaries, wherein each batch includes a corresponding plurality of batched input audio frames. For each of the plurality of batches, the method includes processing, using a speech recognition model, the corresponding plurality of batched input audio frames in parallel to generate a speech recognition result.

Type: Application

Filed: September 21, 2022

Publication date: April 6, 2023

Applicant: Google LLC

Inventors: Shaan Jagdeep Patrick Bijwadia, Tara N. Sainath, Jiahui Yu, Shuo-yiin Chang, Yangzhang He
SINGLE-STAGE MODEL TRAINING FOR NEURAL ARCHITECTURE SEARCH

Publication number: 20220405579

Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for selecting a neural network to perform a particular machine learning task while satisfying a set of constraints.

Type: Application

Filed: March 3, 2021

Publication date: December 22, 2022

Inventors: Jiahui Yu, Pengchong Jin, Hanxiao Liu, Gabriel Mintzer Bender, Pieter-Jan Kindermans, Mingxing Tan, Xiaodan Song, Ruoming Pang, Quoc V. Le

1 2 3 next