Patents by Inventor Yifan Gong

Yifan Gong has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

Caption assisted calling to maintain connection in challenging network conditions

Patent number: 11563784

Abstract: Systems are provided for managing and coordinating STT/TTS systems and the communications between these systems when they are connected in online meetings and for mitigating connectivity issues that may arise during the online meetings to provide a seamless and reliable meeting experience with either live captions and/or rendered audio. Initially, online meeting communications are transmitted over a lossy connectionless type protocol/channel. Then, in response to detected connectivity problems with one or more systems involved in the online meeting, which can cause jitter or packet loss, for example, an instruction is dynamically generated and processed for causing one or more of the connected systems to transmit and/or process the online meeting content with a more reliable connection/protocol, such as a connection-oriented protocol. Codecs at the systems are used, when needed to convert speech to text with related speech attribute information and to convert text to speech.

Type: Grant

Filed: June 11, 2021

Date of Patent: January 24, 2023

Assignee: MICROSOFT TECHNOLOGY LICENSING, LLC

Inventors: Akash Alok Mahajan, Yifan Gong
Sequence-to-sequence speech recognition with latency threshold

Patent number: 11562745

Abstract: A computing system including one or more processors configured to receive an audio input. The one or more processors may generate a text transcription of the audio input at a sequence-to-sequence speech recognition model, which may assign a respective plurality of external-model text tokens to a plurality of frames included in the audio input. Each external-model text token may have an external-model alignment within the audio input. Based on the audio input, the one or more processors may generate a plurality of hidden states. Based on the plurality of hidden states, the one or more processors may generate a plurality of output text tokens. Each output text token may have a corresponding output alignment within the audio input. For each output text token, a latency between the output alignment and the external-model alignment may be below a predetermined latency threshold. The one or more processors may output the text transcription.

Type: Grant

Filed: April 6, 2020

Date of Patent: January 24, 2023

Assignee: Microsoft Technology Licensing, LLC

Inventors: Yashesh Gaur, Jinyu Li, Liang Lu, Hirofumi Inaguma, Yifan Gong
COMMUNICATION METHOD AND RELATED COMMUNICATION APPARATUS AND STORAGE MEDIUM

Publication number: 20220414024

Abstract: The present disclosure provides a communication method, a related communication apparatus, and a storage medium. The communication method includes: generating a first key by using a random sequence; encrypting data by using the first key to generate encrypted data; writing the encrypted data into a memory; encrypting the random sequence and a storage address of the encrypted data in the memory by using a public key; and sending the encrypted storage address and the encrypted random sequence to a second node from a first node.

Type: Application

Filed: June 24, 2022

Publication date: December 29, 2022

Inventors: Yifan GONG, Jiangming JIN
Internal language model for E2E models

Patent number: 11527238

Abstract: A computer device is provided that includes one or more processors configured to receive an end-to-end (E2E) model that has been trained for automatic speech recognition with training data from a source-domain, and receive an external language model that has been trained with training data from a target-domain. The one or more processors are configured to perform an inference of the probability of an output token sequence given a sequence of input speech features. Performing the inference includes computing an E2E model score, computing an external language model score, and computing an estimated internal language model score for the E2E model. The estimated internal language model score is computed by removing a contribution of an intrinsic acoustic model. The processor is further configured to compute an integrated score based at least on E2E model score, the external language model score, and the estimated internal language model score.

Type: Grant

Filed: January 21, 2021

Date of Patent: December 13, 2022

Assignee: Microsoft Technology Licensing, LLC

Inventors: Zhong Meng, Sarangarajan Parthasarathy, Xie Sun, Yashesh Gaur, Naoyuki Kanda, Liang Lu, Xie Chen, Rui Zhao, Jinyu Li, Yifan Gong
Learning student DNN via output distribution

Patent number: 11429860

Abstract: Systems and methods are provided for generating a DNN classifier by “learning” a “student” DNN model from a larger more accurate “teacher” DNN model. The student DNN may be trained from un-labeled training data because its supervised signal is obtained by passing the un-labeled training data through the teacher DNN. In one embodiment, an iterative process is applied to train the student DNN by minimize the divergence of the output distributions from the teacher and student DNN models. For each iteration until convergence, the difference in the output distributions is used to update the student DNN model, and output distributions are determined again, using the unlabeled training data. The resulting trained student model may be suitable for providing accurate signal processing applications on devices having limited computational or storage resources such as mobile or wearable devices. In an embodiment, the teacher DNN model comprises an ensemble of DNN models.

Type: Grant

Filed: September 14, 2015

Date of Patent: August 30, 2022

Assignee: Microsoft Technology Licensing, LLC

Inventors: Jinyu Li, Rui Zhao, Jui-Ting Huang, Yifan Gong
WAKE WORD SELECTION ASSISTANCE ARCHITECTURES AND METHODS

Publication number: 20220254334

Abstract: Generally discussed herein are devices, systems, and methods for custom wake word selection assistance. A method can include receiving, at a device, data indicating a custom wake word provided by a user, determining one or more characteristics of the custom wake word, determining that use of the custom wake word will cause more than a threshold rate of false detections based on the characteristics, rejecting the custom wake word as the wake word for accessing a personal assistant in response to determining that use of the custom wake word will cause more than a threshold rate of false detections, and setting the custom wake word as the wake word in response to determining that use of the custom wake word will not cause more than the threshold rate of false detections.

Type: Application

Filed: December 1, 2021

Publication date: August 11, 2022

Inventors: Emilian Stoimenov, Khuram Shahid, Guoli Ye, Hosam Adel Khalil, Yifan Gong
AUGMENTED TRAINING DATA FOR END-TO-END MODELS

Publication number: 20220189461

Abstract: A computer system is provided that includes a processor configured to store a set of audio training data that includes a plurality of audio segments and metadata indicating a word or phrase associated with each audio segment. For a target training statement of a set of structured text data, the processor is configured to generate a concatenated audio signal that matches a word content of a target training statement by comparing the words or phrases of a plurality of text segments of the target training statement to respective words or phrases of audio segments of the stored set of audio training data, selecting a plurality of audio segments from the set of audio training data based on a match in the words or phrases between the plurality of text segments of the target training statement and the selected plurality of audio segments, and concatenating the selected plurality of audio segments.

Type: Application

Filed: December 16, 2020

Publication date: June 16, 2022

Applicant: Microsoft Technology Licensing, LLC

Inventors: Rui ZHAO, Jinyu LI, Yifan GONG
CONDITION-INVARIANT FEATURE EXTRACTION NETWORK

Publication number: 20220165290

Abstract: To generate substantially condition-invariant and speaker-discriminative features, embodiments are associated with a feature extractor capable of extracting features from speech frames based on first parameters, a speaker classifier capable of identifying a speaker based on the features and on second parameters, and a condition classifier capable of identifying a noise condition based on the features and on third parameters. The first parameters of the feature extractor and the second parameters of the speaker classifier are trained to minimize a speaker classification loss, the first parameters of the feature extractor are further trained to maximize a condition classification loss, and the third parameters of the condition classifier are trained to minimize the condition classification loss.

Type: Application

Filed: November 30, 2021

Publication date: May 26, 2022

Inventors: Zhong MENG, Yong ZHAO, Jinyu LI, Yifan GONG
CAPTION ASSISTED CALLING TO MAINTAIN CONNECTION IN CHALLENGING NETWORK CONDITIONS

Publication number: 20220159047

Abstract: Systems are provided for managing and coordinating STT/TTS systems and the communications between these systems when they are connected in online meetings and for mitigating connectivity issues that may arise during the online meetings to provide a seamless and reliable meeting experience with either live captions and/or rendered audio. Initially, online meeting communications are transmitted over a lossy connectionless type protocol/channel. Then, in response to detected connectivity problems with one or more systems involved in the online meeting, which can cause jitter or packet loss, for example, an instruction is dynamically generated and processed for causing one or more of the connected systems to transmit and/or process the online meeting content with a more reliable connection/protocol, such as a connection-oriented protocol. Codecs at the systems are used, when needed to convert speech to text with related speech attribute information and to convert text to speech.

Type: Application

Filed: June 11, 2021

Publication date: May 19, 2022

Inventors: Akash Alok MAHAJAN, Yifan GONG
CONVOLUTIONAL NEURAL NETWORK WITH PHONETIC ATTENTION FOR SPEAKER VERIFICATION

Publication number: 20220157324

Abstract: Embodiments may include determination, for each of a plurality of speech frames associated with an acoustic feature, of a phonetic feature based on the associated acoustic feature, generation of one or more two-dimensional feature maps based on the plurality of phonetic features, input of the one or more two-dimensional feature maps to a trained neural network to generate a plurality of speaker embeddings, and aggregation of the plurality of speaker embeddings into a speaker embedding based on respective weights determined for each of the plurality of speaker embeddings, wherein the speaker embedding is associated with an identity of the speaker.

Type: Application

Filed: February 7, 2022

Publication date: May 19, 2022

Inventors: Yong ZHAO, Tianyan ZHOU, Jinyu LI, Yifan GONG, Jian WU, Zhuo CHEN
INTERNAL LANGUAGE MODEL FOR E2E MODELS

Publication number: 20220139380

Abstract: A computer device is provided that includes one or more processors configured to receive an end-to-end (E2E) model that has been trained for automatic speech recognition with training data from a source-domain, and receive an external language model that has been trained with training data from a target-domain. The one or more processors are configured to perform an inference of the probability of an output token sequence given a sequence of input speech features. Performing the inference includes computing an E2E model score, computing an external language model score, and computing an estimated internal language model score for the E2E model. The estimated internal language model score is computed by removing a contribution of an intrinsic acoustic model. The processor is further configured to compute an integrated score based at least on E2E model score, the external language model score, and the estimated internal language model score.

Type: Application

Filed: January 21, 2021

Publication date: May 5, 2022

Applicant: Microsoft Technology Licensing, LLC

Inventors: Zhong MENG, Sarangarajan PARTHASARATHY, Xie SUN, Yashesh GAUR, Naoyuki KANDA, Liang LU, Xie CHEN, Rui ZHAO, Jinyu LI, Yifan GONG
SPEAKER ADAPTATION FOR ATTENTION-BASED ENCODER-DECODER

Publication number: 20220130376

Abstract: Embodiments are associated with a speaker-independent attention-based encoder-decoder model to classify output tokens based on input speech frames, the speaker-independent attention-based encoder-decoder model associated with a first output distribution, and a speaker-dependent attention-based encoder-decoder model to classify output tokens based on input speech frames, the speaker-dependent attention-based encoder-decoder model associated with a second output distribution. The second attention-based encoder-decoder model is trained to classify output tokens based on input speech frames of a target speaker and simultaneously trained to maintain a similarity between the first output distribution and the second output distribution.

Type: Application

Filed: January 5, 2022

Publication date: April 28, 2022

Inventors: Zhong MENG, Yashesh GAUR, Jinyu LI, Yifan GONG
Convolutional neural network with phonetic attention for speaker verification

Patent number: 11276410

Abstract: Embodiments may include reception of a plurality of speech frames, determination of a multi-dimensional acoustic feature associated with each of the plurality of speech frames, determination of a plurality of multi-dimensional phonetic features, each of the plurality of multi-dimensional phonetic features determined based on a respective one of the plurality of speech frames, generation of a plurality of two-dimensional feature maps based on the phonetic features, input of the feature maps and the plurality of acoustic features to a convolutional neural network, the convolutional neural network to generate a plurality of speaker embeddings based on the plurality of feature maps and the plurality of acoustic features, aggregation of the plurality of speaker embeddings into a first speaker embedding based on respective weights determined for each of the plurality of speaker embeddings, and determination of a speaker associated with the plurality of speech frames based on the first speaker embedding.

Type: Grant

Filed: November 13, 2019

Date of Patent: March 15, 2022

Assignee: MICROSOFT TECHNOLOGY LICENSING, LLC

Inventors: Yong Zhao, Tianyan Zhou, Jinyu Li, Yifan Gong, Jian Wu, Zhuo Chen
ADAPTIVE BATCHING TO REDUCE RECOGNITION LATENCY

Publication number: 20220068269

Abstract: Embodiments may include collection of a first batch of acoustic feature frames of an audio signal, the number of acoustic feature frames of the first batch equal to a first batch size, input of the first batch to a speech recognition network, collection, in response to detection of a word hypothesis output by the speech recognition network, of a second batch of acoustic feature frames of the audio signal, the number of acoustic feature frames of the second batch equal to a second batch size greater than the first batch size, and input of the second batch to the speech recognition network.

Type: Application

Filed: October 13, 2021

Publication date: March 3, 2022

Inventors: Hosam A. KHALIL, Emilian Y. STOIMENOV, Yifan GONG, Chaojun LIU, Christopher H. BASOGLU, Amit K. AGARWAL, Naveen PARIHAR, Sayan PATHAK
MACHINE LEARNING MODEL WITH DEPTH PROCESSING UNITS

Publication number: 20220058442

Abstract: Representative embodiments disclose machine learning classifiers used in scenarios such as speech recognition, image captioning, machine translation, or other sequence-to-sequence embodiments. The machine learning classifiers have a plurality of time layers, each layer having a time processing block and a depth processing block. The time processing block is a recurrent neural network such as a Long Short Term Memory (LSTM) network. The depth processing blocks can be an LSTM network, a gated Deep Neural Network (DNN) or a maxout DNN. The depth processing blocks account for the hidden states of each time layer and uses summarized layer information for final input signal feature classification. An attention layer can also be used between the top depth processing block and the output layer.

Type: Application

Filed: November 3, 2021

Publication date: February 24, 2022

Inventors: Jinyu Li, Liang Lu, Changliang Liu, Yifan Gong
Streaming contextual unidirectional models

Patent number: 11244673

Abstract: Streaming machine learning unidirectional models is facilitated by the use of embedding vectors. Processing blocks in the models apply embedding vectors as input. The embedding vectors utilize context of future data (e.g., data that is temporally offset into the future within a data stream) to improve the accuracy of the outputs generated by the processing blocks. The embedding vectors cause a temporal shift between the outputs of the processing blocks and the inputs to which the outputs correspond. This temporal shift enables the processing blocks to apply the embedding vector inputs from processing blocks that are associated with future data.

Type: Grant

Filed: July 19, 2019

Date of Patent: February 8, 2022

Assignee: MICROSOFT TECHNOLOGLY LICENSING, LLC

Inventors: Jinyu Li, Amit Kumar Agarwal, Yifan Gong, Harini Kesavamoorthy, Changliang Liu, Liang Lu
ATTENTIVE ADVERSARIAL DOMAIN-INVARIANT TRAINING

Publication number: 20220028399

Abstract: To generate substantially domain-invariant and speaker-discriminative features, embodiments may operate to extract features from input data based on a first set of parameters, generate outputs based on the extracted features and on a second set of parameters, and identify words represented by the input data based on the outputs, wherein the first set of parameters and the second set of parameters have been trained to minimize a network loss associated with the second set of parameters, wherein the first set of parameters has been trained to maximize the domain classification loss of a network comprising 1) an attention network to determine, based on a third set of parameters, relative importances of features extracted based on the first parameters to domain classification and 2) a domain classifier to classify a domain based on the extracted features, the relative importances, and a fourth set of parameters, and wherein the third set of parameters and the fourth set of parameters have been trained to minimize

Type: Application

Filed: October 5, 2021

Publication date: January 27, 2022

Inventors: Zhong MENG, Jinyu LI, Yifan GONG
Speaker adaptation for attention-based encoder-decoder

Patent number: 11232782

Abstract: Embodiments are associated with a speaker-independent attention-based encoder-decoder model to classify output tokens based on input speech frames, the speaker-independent attention-based encoder-decoder model associated with a first output distribution, a speaker-dependent attention-based encoder-decoder model to classify output tokens based on input speech frames, the speaker-dependent attention-based encoder-decoder model associated with a second output distribution, training of the second attention-based encoder-decoder model to classify output tokens based on input speech frames of a target speaker and simultaneously training the speaker-dependent attention-based encoder-decoder model to maintain a similarity between the first output distribution and the second output distribution, and performing automatic speech recognition on speech frames of the target speaker using the trained speaker-dependent attention-based encoder-decoder model.

Type: Grant

Filed: November 6, 2019

Date of Patent: January 25, 2022

Assignee: MICROSOFT TECHNOLOGY LICENSING, LLC

Inventors: Zhong Meng, Yashesh Gaur, Jinyu Li, Yifan Gong
Wake word selection assistance architectures and methods

Patent number: 11222622

Abstract: Generally discussed herein are devices, systems, and methods for custom wake word selection assistance. A method can include receiving, at a device, data indicating a custom wake word provided by a user, determining one or more characteristics of the custom wake word, determining that use of the custom wake word will cause more than a threshold rate of false detections based on the characteristics, rejecting the custom wake word as the wake word for accessing a personal assistant in response to determining that use of the custom wake word will cause more than a threshold rate of false detections, and setting the custom wake word as the wake word in response to determining that use of the custom wake word will not cause more than the threshold rate of false detections.

Type: Grant

Filed: July 25, 2019

Date of Patent: January 11, 2022

Assignee: Microsoft Technology Licensing, LLC

Inventors: Emilian Stoimenov, Khuram Shahid, Guoli Ye, Hosam Adel Khalil, Yifan Gong
Condition-invariant feature extraction network

Patent number: 11217265

Abstract: To generate substantially condition-invariant and speaker-discriminative features, embodiments are associated with a feature extractor capable of extracting features from speech frames based on first parameters, a speaker classifier capable of identifying a speaker based on the features and on second parameters, and a condition classifier capable of identifying a noise condition based on the features and on third parameters. The first parameters of the feature extractor and the second parameters of the speaker classifier are trained to minimize a speaker classification loss, the first parameters of the feature extractor are further trained to maximize a condition classification loss, and the third parameters of the condition classifier are trained to minimize the condition classification loss.

Type: Grant

Filed: June 7, 2019

Date of Patent: January 4, 2022

Assignee: MICROSOFT TECHNOLOGY LICENSING, LLC

Inventors: Zhong Meng, Yong Zhao, Jinyu Li, Yifan Gong

prev 1 2 3 4 5 6 … next