Patents by Inventor Yifan Gong

Yifan Gong has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

ON-DEVICE CUSTOM WAKE WORD DETECTION

Publication number: 20210407498

Abstract: Generally discussed herein are devices, systems, and methods for on-device detection of a wake word. A device can include a memory including model parameters that define a custom wake word detection model, the wake word detection model including a recurrent neural network transducer (RNNT) and a lookup table (LUT), the LUT indicating a hidden vector to be provided in response to a phoneme of a user-specified wake word, a microphone to capture audio, and processing circuitry to receive the audio from the microphone, determine, using the wake word detection model, whether the audio includes an utterance of the user-specified wake word, and wake up a personal assistant after determining the audio includes the utterance of the user-specified wake word.

Type: Application

Filed: September 14, 2021

Publication date: December 30, 2021

Inventors: Emilian Stoimenov, Rui Zhao, Kaustubh Prakash Kalgaonkar, Ivaylo Andreanov Enchev, Khuram Shahid, Anthony Phillip Stark, Guoli Ye, Mahadevan Srinivasan, Yifan Gong, Hosam Adel Khalil
Machine learning model with depth processing units

Patent number: 11210565

Abstract: Representative embodiments disclose machine learning classifiers used in scenarios such as speech recognition, image captioning, machine translation, or other sequence-to-sequence embodiments. The machine learning classifiers have a plurality of time layers, each layer having a time processing block and a depth processing block. The time processing block is a recurrent neural network such as a Long Short Term Memory (LSTM) network. The depth processing blocks can be an LSTM network, a gated Deep Neural Network (DNN) or a maxout DNN. The depth processing blocks account for the hidden states of each time layer and uses summarized layer information for final input signal feature classification. An attention layer can also be used between the top depth processing block and the output layer.

Type: Grant

Filed: November 30, 2018

Date of Patent: December 28, 2021

Assignee: Microsoft Technology Licensing, LLC

Inventors: Jinyu Li, Liang Lu, Changliang Liu, Yifan Gong
Adaptive batching to reduce recognition latency

Patent number: 11183178

Abstract: Embodiments may include collection of a first batch of acoustic feature frames of an audio signal, the number of acoustic feature frames of the first batch equal to a first batch size, input of the first batch to a speech recognition network, collection, in response to detection of a word hypothesis output by the speech recognition network, of a second batch of acoustic feature frames of the audio signal, the number of acoustic feature frames of the second batch equal to a second batch size greater than the first batch size, and input of the second batch to the speech recognition network.

Type: Grant

Filed: January 27, 2020

Date of Patent: November 23, 2021

Assignee: MICROSOFT TECHNOLOGY LICENSING, LLC

Inventors: Hosam A. Khalil, Emilian Y. Stoimenov, Yifan Gong, Chaojun Liu, Christopher H. Basoglu, Amit K. Agarwal, Naveen Parihar, Sayan Pathak
Attentive adversarial domain-invariant training

Patent number: 11170789

Abstract: To generate substantially domain-invariant and speaker-discriminative features, embodiments are associated with a feature extractor to receive speech frames and extract features from the speech frames based on a first set of parameters of the feature extractor, a senone classifier to identify a senone based on the received features and on a second set of parameters of the senone classifier, an attention network capable of determining a relative importance of features extracted by the feature extractor to domain classification, based on a third set of parameters of the attention network, a domain classifier capable of classifying a domain based on the features and the relative importances, and on a fourth set of parameters of the domain classifier; and a training platform to train the first set of parameters of the feature extractor and the second set of parameters of the senone classifier to minimize the senone classification loss, train the first set of parameters of the feature extractor to maximize the dom

Type: Grant

Filed: July 26, 2019

Date of Patent: November 9, 2021

Assignee: MICROSOFT TECHNOLOGY LICENSING, LLC

Inventors: Zhong Meng, Jinyu Li, Yifan Gong
Online verification of custom wake word

Patent number: 11158305

Abstract: Generally discussed herein are devices, systems, and methods for wake word verification. A method can include receiving, at a server, a message from a device indicating that an utterance of a user-defined wake word was detected at the device, the message including (a) audio samples or features extracted from the audio samples and (b) data indicating the user-defined wake word, retrieving or generating, at the server, a custom decoding graph for the user-defined wake word, wherein the decoding graph and the static portion of the wake word verification model form a custom wake word verification model for the user-defined wake word, executing the wake word verification model to determine a likelihood that the wake word was uttered, and providing a message to the device indicating whether wake was uttered based on the determined likelihood.

Type: Grant

Filed: July 25, 2019

Date of Patent: October 26, 2021

Assignee: Microsoft Technology Licensing, LLC

Inventors: Khuram Shahid, Kshitiz Kumar, Teng Yi, Veljko Miljanic, Huaming Wang, Yifan Gong, Hosam Adel Khalil
SEQUENCE-TO-SEQUENCE SPEECH RECOGNITION WITH LATENCY THRESHOLD

Publication number: 20210312923

Abstract: A computing system including one or more processors configured to receive an audio input. The one or more processors may generate a text transcription of the audio input at a sequence-to-sequence speech recognition model, which may assign a respective plurality of external-model text tokens to a plurality of frames included in the audio input. Each external-model text token may have an external-model alignment within the audio input. Based on the audio input, the one or more processors may generate a plurality of hidden states. Based on the plurality of hidden states, the one or more processors may generate a plurality of output text tokens. Each output text token may have a corresponding output alignment within the audio input. For each output text token, a latency between the output alignment and the external-model alignment may be below a predetermined latency threshold. The one or more processors may output the text transcription.

Type: Application

Filed: April 6, 2020

Publication date: October 7, 2021

Applicant: Microsoft Technology Licensing, LLC

Inventors: Yashesh GAUR, Jinyu LI, Liang LU, Hirofumi INAGUMA, Yifan GONG
Pre-Training With Alignments For Recurrent Neural Network Transducer Based End-To-End Speech Recognition

Publication number: 20210312905

Abstract: Techniques performed by a data processing system for training a Recurrent Neural Network Transducer (RNN-T) herein include encoder pretraining by training a neural network-based token classification model using first token-aligned training data representing a plurality of utterances, where each utterance is associated with a plurality of frames of audio data and tokens representing each utterance are aligned with frame boundaries of the plurality of audio frames; obtaining first cross-entropy (CE) criterion from the token classification model, wherein the CE criterion represent a divergence between expected outputs and reference outputs of the model; pretraining an encoder of an RNN-T based on the first CE criterion; and training the RNN-T with second training data after pretraining the encoder of the RNN-T. These techniques also include whole-network pre-training of the RNN-T.

Type: Application

Filed: April 3, 2020

Publication date: October 7, 2021

Applicant: Microsoft Technology Licensing, LLC

Inventors: Rui ZHAO, Jinyu LI, Liang LU, Yifan GONG, Hu Hu
GENERATING AND USING TEXT-TO-SPEECH DATA FOR SPEECH RECOGNITION MODELS

Publication number: 20210304769

Abstract: Systems, methods, and devices are provided for generating and using text-to-speech (TTS) data for improved speech recognition models. A main model is trained with keyword independent baseline training data. In some instances, acoustic and language model sub-components of the main model are modified with new TTS training data. In some instances, the new TTS training is obtained from a multi-speaker neural TTS system for a keyword that is underrepresented in the baseline training data. In some instances, the new TTS training data is used for pronunciation learning and normalization of keyword dependent confidence scores in keyword spotting (KWS) applications. In some instances, the new TTS training data is used for rapid speaker adaptation in speech recognition models.

Type: Application

Filed: May 14, 2020

Publication date: September 30, 2021

Inventors: Guoli Ye, Yan Huang, Wenning Wei, Lei He, Eva Sharma, Jian Wu, Yao Tian, Edward C. Lin, Yifan Gong, Rui Zhao, Jinyu Li, William Maxwell Gale
On-device custom wake word detection

Patent number: 11132992

Abstract: Generally discussed herein are devices, systems, and methods for on-device detection of a wake word. A device can include a memory including model parameters that define a custom wake word detection model, the wake word detection model including a recurrent neural network transducer (RNNT) and a lookup table (LUT), the LUT indicating a hidden vector to be provided in response to a phoneme of a user-specified wake word, a microphone to capture audio, and processing circuitry to receive the audio from the microphone, determine, using the wake word detection model, whether the audio includes an utterance of the user-specified wake word, and wake up a personal assistant after determining the audio includes the utterance of the user-specified wake word.

Type: Grant

Filed: July 25, 2019

Date of Patent: September 28, 2021

Assignee: Microsoft Technology Licensing, LLC

Inventors: Emilian Stoimenov, Rui Zhao, Kaustubh Prakash Kalgaonkar, Ivaylo Andreanov Enchev, Khuram Shahid, Anthony Phillip Stark, Guoli Ye, Mahadevan Srinivasan, Yifan Gong, Hosam Adel Khalil
AUTOMATED SPEECH RECOGNITION CONFIDENCE CLASSIFIER

Publication number: 20210272557

Abstract: A method of enhancing an automated speech recognition confidence classifier includes receiving a set of baseline confidence features from one or more decoded words, deriving word embedding confidence features from the baseline confidence features, joining the baseline confidence features with word embedding confidence features to create a feature vector, and executing the confidence classifier to generate a confidence score, wherein the confidence classifier is trained with a set of training examples having labeled features corresponding to the feature vector.

Type: Application

Filed: March 31, 2021

Publication date: September 2, 2021

Inventors: Kshitiz Kumar, Anastasios Anastasakos, Yifan Gong
Adversarial speaker adaptation

Patent number: 11107460

Abstract: Embodiments are associated with a speaker-independent acoustic model capable of classifying senones based on input speech frames and on first parameters of the speaker-independent acoustic model, a speaker-dependent acoustic model capable of classifying senones based on input speech frames and on second parameters of the speaker-dependent acoustic model, and a discriminator capable of receiving data from the speaker-dependent acoustic model and data from the speaker-independent acoustic model and outputting a prediction of whether received data was generated by the speaker-dependent acoustic model based on third parameters.

Type: Grant

Filed: July 2, 2019

Date of Patent: August 31, 2021

Assignee: MICROSOFT TECHNOLOGY LICENSING, LLC

Inventors: Zhong Meng, Jinyu Li, Yifan Gong
COMPUTER-IMPLEMENTED METHODS AND SYSTEMS FOR PRIVACY-PRESERVING DEEP NEURAL NETWORK MODEL COMPRESSION

Publication number: 20210256383

Abstract: A privacy-preserving DNN model compression framework allows a system designer to implement a pruning scheme on a pre-trained model without the access to the client's confidential dataset. Weight pruning of the DNN model is formulated without the original dataset as two sets of optimization problems with respect to pruning the whole model or each layer are solved successfully with an ADMM optimization framework. The system allows data privacy to be preserved and real-time inference to be achieved while maintaining accuracy on large-scale DNNs.

Type: Application

Filed: February 16, 2021

Publication date: August 19, 2021

Inventors: Yanzhi Wang, Yifan Gong, Zheng Zhan
ADAPTIVE BATCHING TO REDUCE RECOGNITION LATENCY

Publication number: 20210217410

Abstract: Embodiments may include collection of a first batch of acoustic feature frames of an audio signal, the number of acoustic feature frames of the first batch equal to a first batch size, input of the first batch to a speech recognition network, collection, in response to detection of a word hypothesis output by the speech recognition network, of a second batch of acoustic feature frames of the audio signal, the number of acoustic feature frames of the second batch equal to a second batch size greater than the first batch size, and input of the second batch to the speech recognition network.

Type: Application

Filed: January 27, 2020

Publication date: July 15, 2021

Inventors: Hosam A. KHALIL, Emilian Y. STOIMENOV, Yifan GONG, Chaojun LIU, Christopher H. BASOGLU, Amit K. AGARWAL, Naveen PARIHAR, Sayan PATHAK
Method, apparatus, and system for multi-module scheduling

Patent number: 11055144

Abstract: The present disclosure provides a method, an apparatus and a system for multi-module scheduling, capable of solving the problem associated with inconsistency in data inputted to a computing module in the multi-module scheduling technique in the related art.

Type: Grant

Filed: February 14, 2019

Date of Patent: July 6, 2021

Assignee: TUSIMPLE, INC.

Inventors: Yifan Gong, Siyuan Liu, Dinghua Li, Jiangming Jin, Lei Su, Yixin Yang, Wei Liu, Zehua Huang
Caption assisted calling to maintain connection in challenging network conditions

Patent number: 11044287

Abstract: Systems are provided for managing and coordinating STT/TTS systems and the communications between these systems when they are connected in online meetings and for mitigating connectivity issues that may arise during the online meetings to provide a seamless and reliable meeting experience with either live captions and/or rendered audio. Initially, online meeting communications are transmitted over a lossy connectionless type protocol/channel. Then, in response to detected connectivity problems with one or more systems involved in the online meeting, which can cause jitter or packet loss, for example, an instruction is dynamically generated and processed for causing one or more of the connected systems to transmit and/or process the online meeting content with a more reliable connection/protocol, such as a connection-oriented protocol. Codecs at the systems are used, when needed to convert speech to text with related speech attribute information and to convert text to speech.

Type: Grant

Filed: December 9, 2020

Date of Patent: June 22, 2021

Assignee: MICROSOFT TECHNOLOGY LICENSING, LLC

Inventors: Akash Alok Mahajan, Yifan Gong
Automated speech recognition confidence classifier

Patent number: 10991365

Abstract: A method of enhancing an automated speech recognition confidence classifier includes receiving a set of baseline confidence features from one or more decoded words, deriving word embedding confidence features from the baseline confidence features, joining the baseline confidence features with word embedding confidence features to create a feature vector, and executing the confidence classifier to generate a confidence score, wherein the confidence classifier is trained with a set of training examples having labeled features corresponding to the feature vector.

Type: Grant

Filed: April 8, 2019

Date of Patent: April 27, 2021

Assignee: Microsoft Technology Licensing, LLC

Inventors: Kshitiz Kumar, Anastasios Anastasakos, Yifan Gong
Code-switching speech recognition with end-to-end connectionist temporal classification model

Patent number: 10964309

Abstract: A CS CTC model may be initialed from a major language CTC model by keeping network hidden weights and replacing output tokens with a union of major and secondary language output tokens. The initialized model may be trained by updating parameters with training data from both languages, and a LID model may also be trained with the data. During a decoding process for each of a series of audio frames, if silence dominates a current frame then a silence output token may be emitted. If silence does not dominate the frame, then a major language output token posterior vector from the CS CTC model may be multiplied with the LID major language probability to create a probability vector from the major language. A similar step is performed for the secondary language, and the system may emit an output token associated with the highest probability across all tokens from both languages.

Type: Grant

Filed: May 13, 2019

Date of Patent: March 30, 2021

Assignee: Microsoft Technology Licensing, LLC

Inventors: Jinyu Li, Guoli Ye, Rui Zhao, Yifan Gong, Ke Li
CONVOLUTIONAL NEURAL NETWORK WITH PHONETIC ATTENTION FOR SPEAKER VERIFICATION

Publication number: 20210082438

Abstract: Embodiments may include reception of a plurality of speech frames, determination of a multi-dimensional acoustic feature associated with each of the plurality of speech frames, determination of a plurality of multi-dimensional phonetic features, each of the plurality of multi-dimensional phonetic features determined based on a respective one of the plurality of speech frames, generation of a plurality of two-dimensional feature maps based on the phonetic features, input of the feature maps and the plurality of acoustic features to a convolutional neural network, the convolutional neural network to generate a plurality of speaker embeddings based on the plurality of feature maps and the plurality of acoustic features, aggregation of the plurality of speaker embeddings into a first speaker embedding based on respective weights determined for each of the plurality of speaker embeddings, and determination of a speaker associated with the plurality of speech frames based on the first speaker embedding.

Type: Application

Filed: November 13, 2019

Publication date: March 18, 2021

Inventors: Yong ZHAO, Tianyan ZHOU, Jinyu LI, Yifan GONG, Jian WU, Zhuo CHEN
Method, apparatus and system for multi-module scheduling

Patent number: 10942771

Abstract: The present disclosure provides a method, an apparatus and a system for multi-module scheduling, capable of solving at least one of the problems associated with the multi-module scheduling technique in the related art, i.e., inconsistency in data inputted to a computing module, and a significant delay or low throughput in data transmission between computing modules.

Type: Grant

Filed: February 14, 2019

Date of Patent: March 9, 2021

Assignee: TUSIMPLE, INC.

Inventors: Yifan Gong, Siyuan Liu, Dinghua Li, Jiangming Jin, Lei Su, YiXin Yang, Wei Liu, Zehua Huang
SPEAKER ADAPTATION FOR ATTENTION-BASED ENCODER-DECODER

Publication number: 20210065683

Abstract: Embodiments are associated with a speaker-independent attention-based encoder-decoder model to classify output tokens based on input speech frames, the speaker-independent attention-based encoder-decoder model associated with a first output distribution, a speaker-dependent attention-based encoder-decoder model to classify output tokens based on input speech frames, the speaker-dependent attention-based encoder-decoder model associated with a second output distribution, training of the second attention-based encoder-decoder model to classify output tokens based on input speech frames of a target speaker and simultaneously training the speaker-dependent attention-based encoder-decoder model to maintain a similarity between the first output distribution and the second output distribution, and performing automatic speech recognition on speech frames of the target speaker using the trained speaker-dependent attention-based encoder-decoder model.

Type: Application

Filed: November 6, 2019

Publication date: March 4, 2021

Inventors: Zhong MENG, Yashesh GAUR, Jinyu LI, Yifan GONG

prev 1 2 3 4 5 6 7 … next