Patents by Inventor Yifan Gong

Yifan Gong has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

Generating and using text-to-speech data for speech recognition models

Patent number: 12205596

Abstract: Systems, methods, and devices are provided for generating and using text-to-speech (TTS) data for improved speech recognition models. A main model is trained with keyword independent baseline training data. In some instances, acoustic and language model sub-components of the main model are modified with new TTS training data. In some instances, the new TTS training is obtained from a multi-speaker neural TTS system for a keyword that is underrepresented in the baseline training data. In some instances, the new TTS training data is used for pronunciation learning and normalization of keyword dependent confidence scores in keyword spotting (KWS) applications. In some instances, the new TTS training data is used for rapid speaker adaptation in speech recognition models.

Type: Grant

Filed: February 10, 2023

Date of Patent: January 21, 2025

Assignee: Microsoft Technology Licensing, LLC

Inventors: Guoli Ye, Yan Huang, Wenning Wei, Lei He, Eva Sharma, Jian Wu, Yao Tian, Edward C. Lin, Yifan Gong, Rui Zhao, Jinyu Li, William Maxwell Gale
MACHINE LEARNING MODEL WITH DEPTH PROCESSING UNITS

Publication number: 20250005339

Abstract: Representative embodiments disclose machine learning classifiers used in scenarios such as speech recognition, image captioning, machine translation, or other sequence-to-sequence embodiments. The machine learning classifiers have a plurality of time layers, each layer having a time processing block and a depth processing block. The time processing block is a recurrent neural network such as a Long Short Term Memory (LSTM) network. The depth processing blocks can be an LSTM network, a gated Deep Neural Network (DNN) or a maxout DNN. The depth processing blocks account for the hidden states of each time layer and uses summarized layer information for final input signal feature classification. An attention layer can also be used between the top depth processing block and the output layer.

Type: Application

Filed: September 10, 2024

Publication date: January 2, 2025

Inventors: Jinyu LI, Liang LU, Changliang LIU, Yifan GONG
Communication method, related computing system and storage medium for selecting a communication channel between modules

Patent number: 12141080

Abstract: A communication method, a related computing system and a storage medium are described. A communication method for a computing system runs at least one process, wherein the at least one process comprises a plurality of modules, and the method comprises: acquiring attribute information of each of the plurality of modules, wherein the plurality of modules at least comprise a first module and a second module; in response to determining that data is to be transmitted from the first module to the second module, comparing attribute information of the first module with attribute information of the second module; and selecting a communication channel for each of the first module and the second module according to the comparison, to transmit the data from the first module to the second module through the selected communication channel.

Type: Grant

Filed: November 14, 2022

Date of Patent: November 12, 2024

Assignee: Beijing Tusen Zhitu Technology Co., Ltd.

Inventors: Yifan Gong, Jiangming Jin
Machine learning model with depth processing units

Patent number: 12086704

Abstract: Representative embodiments disclose machine learning classifiers used in scenarios such as speech recognition, image captioning, machine translation, or other sequence-to-sequence embodiments. The machine learning classifiers have a plurality of time layers, each layer having a time processing block and a depth processing block. The time processing block is a recurrent neural network such as a Long Short Term Memory (LSTM) network. The depth processing blocks can be an LSTM network, a gated Deep Neural Network (DNN) or a maxout DNN. The depth processing blocks account for the hidden states of each time layer and uses summarized layer information for final input signal feature classification. An attention layer can also be used between the top depth processing block and the output layer.

Type: Grant

Filed: November 3, 2021

Date of Patent: September 10, 2024

Assignee: MICROSOFT TECHNOLOGY LICENSING, LLC

Inventors: Jinyu Li, Liang Lu, Changliang Liu, Yifan Gong
Dynamic combination of acoustic model states

Patent number: 12014728

Abstract: A computer implemented method classifies an input corresponding to multiple different kinds of input. The method includes obtaining a set of features from the input, providing the set of features to multiple different models to generate state predictions, generating a set of state-dependent predicted weights, and combining the state predictions from the multiple models, based on the state-dependent predicted weights for classification of the set of features.

Type: Grant

Filed: March 25, 2019

Date of Patent: June 18, 2024

Assignee: Microsoft Technology Licensing, LLC

Inventors: Kshitiz Kumar, Yifan Gong
Speaker adaptation for attention-based encoder-decoder

Patent number: 11915686

Abstract: Embodiments are associated with a speaker-independent attention-based encoder-decoder model to classify output tokens based on input speech frames, the speaker-independent attention-based encoder-decoder model associated with a first output distribution, and a speaker-dependent attention-based encoder-decoder model to classify output tokens based on input speech frames, the speaker-dependent attention-based encoder-decoder model associated with a second output distribution. The second attention-based encoder-decoder model is trained to classify output tokens based on input speech frames of a target speaker and simultaneously trained to maintain a similarity between the first output distribution and the second output distribution.

Type: Grant

Filed: January 5, 2022

Date of Patent: February 27, 2024

Assignee: Microsoft Technology Licensing, LLC

Inventors: Zhong Meng, Yashesh Gaur, Jinyu Li, Yifan Gong
Augmented training data for end-to-end models

Patent number: 11862144

Abstract: A computer system is provided that includes a processor configured to store a set of audio training data that includes a plurality of audio segments and metadata indicating a word or phrase associated with each audio segment. For a target training statement of a set of structured text data, the processor is configured to generate a concatenated audio signal that matches a word content of a target training statement by comparing the words or phrases of a plurality of text segments of the target training statement to respective words or phrases of audio segments of the stored set of audio training data, selecting a plurality of audio segments from the set of audio training data based on a match in the words or phrases between the plurality of text segments of the target training statement and the selected plurality of audio segments, and concatenating the selected plurality of audio segments.

Type: Grant

Filed: December 16, 2020

Date of Patent: January 2, 2024

Assignee: Microsoft Technology Licensing, LLC

Inventors: Rui Zhao, Jinyu Li, Yifan Gong
Condition-invariant feature extraction network

Patent number: 11823702

Abstract: To generate substantially condition-invariant and speaker-discriminative features, embodiments are associated with a feature extractor capable of extracting features from speech frames based on first parameters, a speaker classifier capable of identifying a speaker based on the features and on second parameters, and a condition classifier capable of identifying a noise condition based on the features and on third parameters. The first parameters of the feature extractor and the second parameters of the speaker classifier are trained to minimize a speaker classification loss, the first parameters of the feature extractor are further trained to maximize a condition classification loss, and the third parameters of the condition classifier are trained to minimize the condition classification loss.

Type: Grant

Filed: November 30, 2021

Date of Patent: November 21, 2023

Assignee: Microsoft Technology Licensing, LLC

Inventors: Zhong Meng, Yong Zhao, Jinyu Li, Yifan Gong
On-device custom wake word detection

Patent number: 11798535

Abstract: Generally discussed herein are devices, systems, and methods for on-device detection of a wake word. A device can include a memory including model parameters that define a custom wake word detection model, the wake word detection model including a recurrent neural network transducer (RNNT) and a lookup table (LUT), the LUT indicating a hidden vector to be provided in response to a phoneme of a user-specified wake word, a microphone to capture audio, and processing circuitry to receive the audio from the microphone, determine, using the wake word detection model, whether the audio includes an utterance of the user-specified wake word, and wake up a personal assistant after determining the audio includes the utterance of the user-specified wake word.

Type: Grant

Filed: September 14, 2021

Date of Patent: October 24, 2023

Assignee: Microsoft Technology Licensing, LLC

Inventors: Emilian Stoimenov, Rui Zhao, Kaustubh Prakash Kalgaonkar, Ivaylo Andreanov Enchev, Khuram Shahid, Anthony Phillip Stark, Guoli Ye, Mahadevan Srinivasan, Yifan Gong, Hosam Adel Khalil
Wake word selection assistance architectures and methods

Patent number: 11790891

Abstract: Generally discussed herein are devices, systems, and methods for custom wake word selection assistance. A method can include receiving, at a device, data indicating a custom wake word provided by a user, determining one or more characteristics of the custom wake word, determining that use of the custom wake word will cause more than a threshold rate of false detections based on the characteristics, rejecting the custom wake word as the wake word for accessing a personal assistant in response to determining that use of the custom wake word will cause more than a threshold rate of false detections, and setting the custom wake word as the wake word in response to determining that use of the custom wake word will not cause more than the threshold rate of false detections.

Type: Grant

Filed: December 1, 2021

Date of Patent: October 17, 2023

Assignee: Microsoft Technology Licensing, LLC

Inventors: Emilian Stoimenov, Khuram Shahid, Guoli Ye, Hosam Adel Khalil, Yifan Gong
Convolutional neural network with phonetic attention for speaker verification

Patent number: 11776548

Abstract: Embodiments may include determination, for each of a plurality of speech frames associated with an acoustic feature, of a phonetic feature based on the associated acoustic feature, generation of one or more two-dimensional feature maps based on the plurality of phonetic features, input of the one or more two-dimensional feature maps to a trained neural network to generate a plurality of speaker embeddings, and aggregation of the plurality of speaker embeddings into a speaker embedding based on respective weights determined for each of the plurality of speaker embeddings, wherein the speaker embedding is associated with an identity of the speaker.

Type: Grant

Filed: February 7, 2022

Date of Patent: October 3, 2023

Assignee: Microsoft Technology Licensing, LLC

Inventors: Yong Zhao, Tianyan Zhou, Jinyu Li, Yifan Gong, Jian Wu, Zhuo Chen
Attentive adversarial domain-invariant training

Patent number: 11735190

Abstract: To generate substantially domain-invariant and speaker-discriminative features, embodiments may operate to extract features from input data based on a first set of parameters, generate outputs based on the extracted features and on a second set of parameters, and identify words represented by the input data based on the outputs, wherein the first set of parameters and the second set of parameters have been trained to minimize a network loss associated with the second set of parameters, wherein the first set of parameters has been trained to maximize the domain classification loss of a network comprising 1) an attention network to determine, based on a third set of parameters, relative importances of features extracted based on the first parameters to domain classification and 2) a domain classifier to classify a domain based on the extracted features, the relative importances, and a fourth set of parameters, and wherein the third set of parameters and the fourth set of parameters have been trained to minimize

Type: Grant

Filed: October 5, 2021

Date of Patent: August 22, 2023

Assignee: Microsoft Technology Licensing, LLC

Inventors: Zhong Meng, Jinyu Li, Yifan Gong
Adaptive batching to reduce recognition latency

Patent number: 11705117

Abstract: Acoustic features are batched into two different batches. The second batch of the two batches is made in response to a detection of a word hypothesis output by a speech recognition network that received the first batch. The number of acoustic feature frames of the second batch is equal to a second batch size greater than the first batch size. The second batch is also to the speech recognition network for processing.

Type: Grant

Filed: October 13, 2021

Date of Patent: July 18, 2023

Assignee: Microsoft Technology Licensing, LLC

Inventors: Hosam A. Khalil, Emilian Y. Stoimenov, Yifan Gong, Chaojun Liu, Christopher H. Basoglu, Amit K. Agarwal, Naveen Parihar, Sayan Pathak
GENERATING AND USING TEXT-TO-SPEECH DATA FOR SPEECH RECOGNITION MODELS

Publication number: 20230186919

Abstract: Systems, methods, and devices are provided for generating and using text-to-speech (TTS) data for improved speech recognition models. A main model is trained with keyword independent baseline training data. In some instances, acoustic and language model sub-components of the main model are modified with new TTS training data. In some instances, the new TTS training is obtained from a multi-speaker neural TTS system for a keyword that is underrepresented in the baseline training data. In some instances, the new TTS training data is used for pronunciation learning and normalization of keyword dependent confidence scores in keyword spotting (KWS) applications. In some instances, the new TTS training data is used for rapid speaker adaptation in speech recognition models.

Type: Application

Filed: February 10, 2023

Publication date: June 15, 2023

Inventors: Guoli YE, Yan HUANG, Wenning WEI, Lei HE, Eva SHARMA, Jian WU, Yao TIAN, Edward C. LIN, Yifan GONG, Rui ZHAO, Jinyu LI, William Maxwell GALE
Universal acoustic modeling using neural mixture models

Patent number: 11676006

Abstract: According to some embodiments, a universal modeling system may include a plurality of domain expert models to each receive raw input data (e.g., a stream of audio frames containing speech utterances) and provide a domain expert output based on the raw input data. A neural mixture component may then generate a weight corresponding to each domain expert model based on information created by the plurality of domain expert models (e.g., hidden features and/or row convolution). The weights might be associated with, for example, constrained scalar numbers, unconstrained scaler numbers, vectors, matrices, etc. An output layer may provide a universal modeling system output (e.g., an automatic speech recognition result) based on each domain expert output after being multiplied by the corresponding weight for that domain expert model.

Type: Grant

Filed: May 16, 2019

Date of Patent: June 13, 2023

Assignee: Microsoft Technology Licensing, LLC

Inventors: Amit Das, Jinyu Li, Changliang Liu, Yifan Gong
Pre-training with alignments for recurrent neural network transducer based end-to-end speech recognition

Patent number: 11657799

Abstract: Techniques performed by a data processing system for training a Recurrent Neural Network Transducer (RNN-T) herein include encoder pretraining by training a neural network-based token classification model using first token-aligned training data representing a plurality of utterances, where each utterance is associated with a plurality of frames of audio data and tokens representing each utterance are aligned with frame boundaries of the plurality of audio frames; obtaining first cross-entropy (CE) criterion from the token classification model, wherein the CE criterion represent a divergence between expected outputs and reference outputs of the model; pretraining an encoder of an RNN-T based on the first CE criterion; and training the RNN-T with second training data after pretraining the encoder of the RNN-T. These techniques also include whole-network pre-training of the RNN-T.

Type: Grant

Filed: April 3, 2020

Date of Patent: May 23, 2023

Assignee: Microsoft Technology Licensing, LLC

Inventors: Rui Zhao, Jinyu Li, Liang Lu, Yifan Gong, Hu Hu
COMMUNICATION METHOD, RELATED COMPUTING SYSTEM AND STORAGE MEDIUM

Publication number: 20230153254

Abstract: A communication method, a related computing system and a storage medium are described. A communication method for a computing system runs at least one process, wherein the at least one process comprises a plurality of modules, and the method comprises: acquiring attribute information of each of the plurality of modules, wherein the plurality of modules at least comprise a first module and a second module; in response to determining that data is to be transmitted from the first module to the second module, comparing attribute information of the first module with attribute information of the second module; and selecting a communication channel for each of the first module and the second module according to the comparison, to transmit the data from the first module to the second module through the selected communication channel.

Type: Application

Filed: November 14, 2022

Publication date: May 18, 2023

Inventors: Yifan GONG, Jiangming JIN
SEQUENCE-TO-SEQUENCE SPEECH RECOGNITION WITH LATENCY THRESHOLD

Publication number: 20230154467

Abstract: A computing system including one or more processors configured to receive an audio input. The one or more processors may generate a text transcription of the audio input at a sequence-to-sequence speech recognition model, which may assign a respective plurality of external-model text tokens to a plurality of frames included in the audio input. Each external-model text token may have an external-model alignment within the audio input. Based on the audio input, the one or more processors may generate a plurality of hidden states. Based on the plurality of hidden states, the one or more processors may generate a plurality of output text tokens. Each output text token may have a corresponding output alignment within the audio input. For each output text token, a latency between the output alignment and the external-model alignment may be below a predetermined latency threshold. The one or more processors may output the text transcription.

Type: Application

Filed: January 20, 2023

Publication date: May 18, 2023

Applicant: Microsoft Technology Licensing, LLC

Inventors: Yashesh GAUR, Jinyu LI, Liang LU, Hirofumi INAGUMA, Yifan GONG
SERVICE DISCOVERY METHOD AND APPARATUS, COMPUTING DEVICE, AND STORAGE MEDIUM

Publication number: 20230153324

Abstract: The present disclosure provides a service discovery method and apparatus, a computing device, and a storage medium, to solve the problem of node data falsification or tampering easily occurring in the prior art. The service discovery method comprises: in response to discovering a target node to be online or offline, creating, by a first online node, a block of a target node, and sending a data synchronization request to a second online node; in response to determining that the block is the latest block, informing, by the second online node, a plurality of third online nodes to respectively authenticate the permission of the target node; and performing statistics on permission authentication results of the target node authenticated by the plurality of third online nodes, and synchronizing the block to a block chain respectively maintained by each online node in response to an authentication passing rate satisfying a predetermined condition.

Type: Application

Filed: November 17, 2022

Publication date: May 18, 2023

Inventors: Yifan GONG, Jiangming JIN
Layer trajectory long short-term memory with future context

Patent number: 11631399

Abstract: According to some embodiments, a machine learning model may include an input layer to receive an input signal as a series of frames representing handwriting data, speech data, audio data, and/or textual data. A plurality of time layers may be provided, and each time layer may comprise a uni-directional recurrent neural network processing block. A depth processing block may scan hidden states of the recurrent neural network processing block of each time layer, and the depth processing block may be associated with a first frame and receive context frame information of a sequence of one or more future frames relative to the first frame. An output layer may output a final classification as a classified posterior vector of the input signal. For example, the depth processing block may receive the context from information from an output of a time layer processing block or another depth processing block of the future frame.

Type: Grant

Filed: May 13, 2019

Date of Patent: April 18, 2023

Assignee: Microsoft Technology Licensing, LLC

Inventors: Jinyu Li, Vadim Mazalov, Changliang Liu, Liang Lu, Yifan Gong

1 2 3 4 5 … next