Patents by Inventor Kaizhi Qian

Kaizhi Qian has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

Self-supervised speech recognition

Patent number: 12211491

Abstract: One or more computer processors obtain an initial subnetwork at a target sparsity and an initial pruning mask from a pre-trained self-supervised learning (SSL) speech model. The one or more computer processors finetune the initial subnetwork, comprising: the one or more computer processors zero out one or more masked weights in the initial subnetwork specified by the initial pruning mask; the one or more computer processors train a new subnetwork from the zeroed out subnetwork; the one or more computer processors prune one or more weights of lowest magnitude in the new subnetwork regardless of network structure to satisfy the target sparsity. The one or more computer processors classify an audio segment with the finetuned subnetwork.

Type: Grant

Filed: May 9, 2022

Date of Patent: January 28, 2025

Assignee: International Business Machines Corporation

Inventors: Cheng-I Lai, Yang Zhang, Kaizhi Qian, Chuang Gan, James R. Glass, Alexander Haojan Liu
DETECTING ACTIONS IN VIDEO USING MACHINE LEARNING AND BASED ON BIDIRECTIONAL FEEDBACK BETWEEN PREDICTED TYPE AND PREDICTED EXTENT

Publication number: 20240303508

Abstract: Techniques of video processing for action detection using machine learning. An action depicted in a video is identified. A type of the action is predicted based on a classification module of one or more machine learning models. A video clip depicting the action is predicted in the video. To that end, a starting point and an ending point of the video clip in the video are determined. The video clip is predicted based on a localization module of the one or more machine learning models. A refinement is performed that includes refining the type of the action based on the video clip or refining the video clip based on the type of the action. An indication of the refined type or of the refined video clip is output.

Type: Application

Filed: March 8, 2023

Publication date: September 12, 2024

Inventors: Bo WU, Chuang GAN, Kaizhi QIAN, Pin-Yu CHEN
Global prosody style transfer without text transcriptions

Patent number: 11996083

Abstract: A computer-implemented method is provided of using a machine learning model for disentanglement of prosody in spoken natural language. The method includes encoding, by a computing device, the spoken natural language to produce content code. The method further includes resampling, by the computing device without text transcriptions, the content code to obscure the prosody by applying an unsupervised technique to the machine learning model to generate prosody-obscured content code. The method additionally includes decoding, by the computing device, the prosody-obscured content code to synthesize speech indirectly based upon the content code.

Type: Grant

Filed: June 3, 2021

Date of Patent: May 28, 2024

Assignee: INTERNATIONAL BUSINESS MACHINES CORPORATION

Inventors: Kaizhi Qian, Yang Zhang, Shiyu Chang, Jinjun Xiong, Chuang Gan, David Cox
SELF-SUPERVISED SPEECH REPRESENTATIONS BY DISENTANGLING SPEAKERS

Publication number: 20240170007

Abstract: A method, computer system and computer program product is presented for providing a self-supervised speech representation. In one embodiment, audio input is received including speech utterances. A label sequence is generated from these speech utterances by a teacher label generator. A speech representation is generated of a partially masked version of the speech utterance using a speech representation network. The speech utterance is passed into two random transformations that alter only speaker information prior to the partial masking. A predictor will then predict the label sequence. In one embodiment performance-based assessment is made on a cross-entropy loss between the generated label sequence and a predicted label sequence.

Type: Application

Filed: November 7, 2022

Publication date: May 23, 2024

Inventors: Kaizhi Qian, Yang Zhang, Chuang Gan, Dakuo Wang, Bo Wu
Audio Understanding with Fixed Language Models

Publication number: 20240127001

Abstract: Techniques for audio understanding using fixed language models are provided. In one aspect, a system for performing audio understanding tasks includes: a fixed text embedder for, on receipt of a prompt sequence having (e.g., from 0-10) demonstrations of an audio understanding task followed by a new question, converting the prompt sequence into text embeddings; a pretrained audio encoder for converting the prompt sequence into audio embeddings; and a fixed autoregressive language model for answering the new question using the text embeddings and the audio embeddings. A method for performing audio understanding tasks is also provided.

Type: Application

Filed: October 12, 2022

Publication date: April 18, 2024

Inventors: Kaizhi Qian, Yang Zhang, Chuang Gan, Bo Wu, Zhenfang Chen
Skeleton-based action recognition using bi-directional spatial-temporal transformer

Patent number: 11854305

Abstract: A bi-directional spatial-temporal transformer neural network (BDSTT) is trained to predict original coordinates of a skeletal joint in a specific frame through relative relationships of the skeletal joint to other joints and to the state of the skeletal joint in other frames. Obtain a plurality of frames comprising coordinates of the skeletal joint and coordinates of other joints. Produce a spatially masked frame by masking the original coordinates of the skeletal joint. Provide the specific frame, the spatially masked frame, and at least one more frame to a coordinate prediction head of the BDSTT. Obtain, from the coordinate prediction head, a prediction of coordinates for the skeletal joint. Adjust parameters of the BDSTT until a mean-squared error, between the prediction of coordinates for the skeletal joint and the original coordinates of the skeletal joint, converges.

Type: Grant

Filed: May 9, 2021

Date of Patent: December 26, 2023

Assignee: International Business Machines Corporation

Inventors: Bo Wu, Chuang Gan, Dakuo Wang, Kaizhi Qian
SELF-SUPERVISED SPEECH RECOGNITION

Publication number: 20230360642

Abstract: One or more computer processors obtain an initial subnetwork at a target sparsity and an initial pruning mask from a pre-trained self-supervised learning (SSL) speech model. The one or more computer processors finetune the initial subnetwork, comprising: the one or more computer processors zero out one or more masked weights in the initial subnetwork specified by the initial pruning mask; the one or more computer processors train a new subnetwork from the zeroed out subnetwork; the one or more computer processors prune one or more weights of lowest magnitude in the new subnetwork regardless of network structure to satisfy the target sparsity. The one or more computer processors classify an audio segment with the finetuned subnetwork.

Type: Application

Filed: May 9, 2022

Publication date: November 9, 2023

Inventors: Cheng-I Lai, Yang Zhang, Kaizhi Qian, Chuang Gan, James R. Glass, Alexander Haojan Liu
GLOBAL PROSODY STYLE TRANSFER WITHOUT TEXT TRANSCRIPTIONS

Publication number: 20220392429

Abstract: A computer-implemented method is provided of using a machine learning model for disentanglement of prosody in spoken natural language. The method includes encoding, by a computing device, the spoken natural language to produce content code. The method further includes resampling, by the computing device without text transcriptions, the content code to obscure the prosody by applying an unsupervised technique to the machine learning model to generate prosody-obscured content code. The method additionally includes decoding, by the computing device, the prosody-obscured content code to synthesize speech indirectly based upon the content code.

Type: Application

Filed: June 3, 2021

Publication date: December 8, 2022

Inventors: Kaizhi Qian, Yang Zhang, Shiyu Chang, Jinjun Xiong, Chuang Gan, David Cox
SKELETON-BASED ACTION RECOGNITION USING BI-DIRECTIONAL SPATIAL-TEMPORAL TRANSFORMER

Publication number: 20220374629

Abstract: A bi-directional spatial-temporal transformer neural network (BDSTT) is trained to predict original coordinates of a skeletal joint in a specific frame through relative relationships of the skeletal joint to other joints and to the state of the skeletal joint in other frames. Obtain a plurality of frames comprising coordinates of the skeletal joint and coordinates of other joints. Produce a spatially masked frame by masking the original coordinates of the skeletal joint. Provide the specific frame, the spatially masked frame, and at least one more frame to a coordinate prediction head of the BDSTT. Obtain, from the coordinate prediction head, a prediction of coordinates for the skeletal joint. Adjust parameters of the BDSTT until a mean-squared error, between the prediction of coordinates for the skeletal joint and the original coordinates of the skeletal joint, converges.

Type: Application

Filed: May 9, 2021

Publication date: November 24, 2022

Inventors: Bo Wu, Chuang Gan, Dakuo Wang, Kaizhi Qian
Unsupervised speech decomposition

Patent number: 11295762

Abstract: A method, a structure, and a computer system for decomposing speech. The exemplary embodiments may include one or more encoders for generating one or more encodings of a speech input comprising rhythm information, pitch information, timbre information, and content information, and a decoder for decoding the one or more encodings.

Type: Grant

Filed: April 20, 2020

Date of Patent: April 5, 2022

Assignee: International Business Machines Corporation

Inventors: Kaizhi Qian, Yang Zhang, Shiyu Chang, Chuang Gan, David Cox
UNSUPERVISED SPEECH DECOMPOSITION

Publication number: 20210327460

Abstract: A method, a structure, and a computer system for decomposing speech. The exemplary embodiments may include one or more encoders for generating one or more encodings of a speech input comprising rhythm information, pitch information, timbre information, and content information, and a decoder for decoding the one or more encodings.

Type: Application

Filed: April 20, 2020

Publication date: October 21, 2021

Inventors: Kaizhi Qian, Yang Zhang, Shiyu Chang, Chuang Gan, David Cox
Deep learning algorithms for heartbeats detection

Patent number: 10709390

Abstract: A system that detects heartbeats includes a sensor or a transducer and algorithms based on deep learning. The algorithms employ techniques of artificial intelligence that enable the system to extract heartbeat features under low signal-to-noise-ratio (SNR) conditions when a user is exercising. The algorithms can be applied to various technologies for heart rate monitoring such as ultrasound Doppler, photoplethysmogram (PPG), electrocardiogram (EKG), acoustic, pressure/force sensing and laser/RF Doppler, among other types of sensing methods.

Type: Grant

Filed: November 21, 2017

Date of Patent: July 14, 2020

Assignee: LOGOS CARE, INC.

Inventors: Kaizhi Qian, Yang Zhang, Thomas Y. Lo
DEEP LEARNING ALGORITHMS FOR HEARTBEATS DETECTION

Publication number: 20180249964

Abstract: A system that detects heartbeats includes a sensor or a transducer and algorithms based on deep learning. The algorithms employ techniques of artificial intelligence that enable the system to extract heartbeat features under low signal-to-noise-ratio (SNR) conditions when a user is exercising. The algorithms can be applied to various technologies for heart rate monitoring such as ultrasound Doppler, photoplethysmogram (PPG), electrocardiogram (EKG), acoustic, pressure/force sensing and laser/RF Doppler, among other types of sensing methods.

Type: Application

Filed: November 21, 2017

Publication date: September 6, 2018

Inventors: Kaizhi Qian, Yang Zhang, Thomas Y. Lo