Patents by Inventor Yifan Gong
Yifan Gong has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).
-
Patent number: 11563784Abstract: Systems are provided for managing and coordinating STT/TTS systems and the communications between these systems when they are connected in online meetings and for mitigating connectivity issues that may arise during the online meetings to provide a seamless and reliable meeting experience with either live captions and/or rendered audio. Initially, online meeting communications are transmitted over a lossy connectionless type protocol/channel. Then, in response to detected connectivity problems with one or more systems involved in the online meeting, which can cause jitter or packet loss, for example, an instruction is dynamically generated and processed for causing one or more of the connected systems to transmit and/or process the online meeting content with a more reliable connection/protocol, such as a connection-oriented protocol. Codecs at the systems are used, when needed to convert speech to text with related speech attribute information and to convert text to speech.Type: GrantFiled: June 11, 2021Date of Patent: January 24, 2023Assignee: MICROSOFT TECHNOLOGY LICENSING, LLCInventors: Akash Alok Mahajan, Yifan Gong
-
Patent number: 11562745Abstract: A computing system including one or more processors configured to receive an audio input. The one or more processors may generate a text transcription of the audio input at a sequence-to-sequence speech recognition model, which may assign a respective plurality of external-model text tokens to a plurality of frames included in the audio input. Each external-model text token may have an external-model alignment within the audio input. Based on the audio input, the one or more processors may generate a plurality of hidden states. Based on the plurality of hidden states, the one or more processors may generate a plurality of output text tokens. Each output text token may have a corresponding output alignment within the audio input. For each output text token, a latency between the output alignment and the external-model alignment may be below a predetermined latency threshold. The one or more processors may output the text transcription.Type: GrantFiled: April 6, 2020Date of Patent: January 24, 2023Assignee: Microsoft Technology Licensing, LLCInventors: Yashesh Gaur, Jinyu Li, Liang Lu, Hirofumi Inaguma, Yifan Gong
-
Publication number: 20220414024Abstract: The present disclosure provides a communication method, a related communication apparatus, and a storage medium. The communication method includes: generating a first key by using a random sequence; encrypting data by using the first key to generate encrypted data; writing the encrypted data into a memory; encrypting the random sequence and a storage address of the encrypted data in the memory by using a public key; and sending the encrypted storage address and the encrypted random sequence to a second node from a first node.Type: ApplicationFiled: June 24, 2022Publication date: December 29, 2022Inventors: Yifan GONG, Jiangming JIN
-
Patent number: 11527238Abstract: A computer device is provided that includes one or more processors configured to receive an end-to-end (E2E) model that has been trained for automatic speech recognition with training data from a source-domain, and receive an external language model that has been trained with training data from a target-domain. The one or more processors are configured to perform an inference of the probability of an output token sequence given a sequence of input speech features. Performing the inference includes computing an E2E model score, computing an external language model score, and computing an estimated internal language model score for the E2E model. The estimated internal language model score is computed by removing a contribution of an intrinsic acoustic model. The processor is further configured to compute an integrated score based at least on E2E model score, the external language model score, and the estimated internal language model score.Type: GrantFiled: January 21, 2021Date of Patent: December 13, 2022Assignee: Microsoft Technology Licensing, LLCInventors: Zhong Meng, Sarangarajan Parthasarathy, Xie Sun, Yashesh Gaur, Naoyuki Kanda, Liang Lu, Xie Chen, Rui Zhao, Jinyu Li, Yifan Gong
-
Patent number: 11429860Abstract: Systems and methods are provided for generating a DNN classifier by “learning” a “student” DNN model from a larger more accurate “teacher” DNN model. The student DNN may be trained from un-labeled training data because its supervised signal is obtained by passing the un-labeled training data through the teacher DNN. In one embodiment, an iterative process is applied to train the student DNN by minimize the divergence of the output distributions from the teacher and student DNN models. For each iteration until convergence, the difference in the output distributions is used to update the student DNN model, and output distributions are determined again, using the unlabeled training data. The resulting trained student model may be suitable for providing accurate signal processing applications on devices having limited computational or storage resources such as mobile or wearable devices. In an embodiment, the teacher DNN model comprises an ensemble of DNN models.Type: GrantFiled: September 14, 2015Date of Patent: August 30, 2022Assignee: Microsoft Technology Licensing, LLCInventors: Jinyu Li, Rui Zhao, Jui-Ting Huang, Yifan Gong
-
Publication number: 20220254334Abstract: Generally discussed herein are devices, systems, and methods for custom wake word selection assistance. A method can include receiving, at a device, data indicating a custom wake word provided by a user, determining one or more characteristics of the custom wake word, determining that use of the custom wake word will cause more than a threshold rate of false detections based on the characteristics, rejecting the custom wake word as the wake word for accessing a personal assistant in response to determining that use of the custom wake word will cause more than a threshold rate of false detections, and setting the custom wake word as the wake word in response to determining that use of the custom wake word will not cause more than the threshold rate of false detections.Type: ApplicationFiled: December 1, 2021Publication date: August 11, 2022Inventors: Emilian Stoimenov, Khuram Shahid, Guoli Ye, Hosam Adel Khalil, Yifan Gong
-
Publication number: 20220189461Abstract: A computer system is provided that includes a processor configured to store a set of audio training data that includes a plurality of audio segments and metadata indicating a word or phrase associated with each audio segment. For a target training statement of a set of structured text data, the processor is configured to generate a concatenated audio signal that matches a word content of a target training statement by comparing the words or phrases of a plurality of text segments of the target training statement to respective words or phrases of audio segments of the stored set of audio training data, selecting a plurality of audio segments from the set of audio training data based on a match in the words or phrases between the plurality of text segments of the target training statement and the selected plurality of audio segments, and concatenating the selected plurality of audio segments.Type: ApplicationFiled: December 16, 2020Publication date: June 16, 2022Applicant: Microsoft Technology Licensing, LLCInventors: Rui ZHAO, Jinyu LI, Yifan GONG
-
Publication number: 20220165290Abstract: To generate substantially condition-invariant and speaker-discriminative features, embodiments are associated with a feature extractor capable of extracting features from speech frames based on first parameters, a speaker classifier capable of identifying a speaker based on the features and on second parameters, and a condition classifier capable of identifying a noise condition based on the features and on third parameters. The first parameters of the feature extractor and the second parameters of the speaker classifier are trained to minimize a speaker classification loss, the first parameters of the feature extractor are further trained to maximize a condition classification loss, and the third parameters of the condition classifier are trained to minimize the condition classification loss.Type: ApplicationFiled: November 30, 2021Publication date: May 26, 2022Inventors: Zhong MENG, Yong ZHAO, Jinyu LI, Yifan GONG
-
Publication number: 20220159047Abstract: Systems are provided for managing and coordinating STT/TTS systems and the communications between these systems when they are connected in online meetings and for mitigating connectivity issues that may arise during the online meetings to provide a seamless and reliable meeting experience with either live captions and/or rendered audio. Initially, online meeting communications are transmitted over a lossy connectionless type protocol/channel. Then, in response to detected connectivity problems with one or more systems involved in the online meeting, which can cause jitter or packet loss, for example, an instruction is dynamically generated and processed for causing one or more of the connected systems to transmit and/or process the online meeting content with a more reliable connection/protocol, such as a connection-oriented protocol. Codecs at the systems are used, when needed to convert speech to text with related speech attribute information and to convert text to speech.Type: ApplicationFiled: June 11, 2021Publication date: May 19, 2022Inventors: Akash Alok MAHAJAN, Yifan GONG
-
Publication number: 20220157324Abstract: Embodiments may include determination, for each of a plurality of speech frames associated with an acoustic feature, of a phonetic feature based on the associated acoustic feature, generation of one or more two-dimensional feature maps based on the plurality of phonetic features, input of the one or more two-dimensional feature maps to a trained neural network to generate a plurality of speaker embeddings, and aggregation of the plurality of speaker embeddings into a speaker embedding based on respective weights determined for each of the plurality of speaker embeddings, wherein the speaker embedding is associated with an identity of the speaker.Type: ApplicationFiled: February 7, 2022Publication date: May 19, 2022Inventors: Yong ZHAO, Tianyan ZHOU, Jinyu LI, Yifan GONG, Jian WU, Zhuo CHEN
-
Publication number: 20220139380Abstract: A computer device is provided that includes one or more processors configured to receive an end-to-end (E2E) model that has been trained for automatic speech recognition with training data from a source-domain, and receive an external language model that has been trained with training data from a target-domain. The one or more processors are configured to perform an inference of the probability of an output token sequence given a sequence of input speech features. Performing the inference includes computing an E2E model score, computing an external language model score, and computing an estimated internal language model score for the E2E model. The estimated internal language model score is computed by removing a contribution of an intrinsic acoustic model. The processor is further configured to compute an integrated score based at least on E2E model score, the external language model score, and the estimated internal language model score.Type: ApplicationFiled: January 21, 2021Publication date: May 5, 2022Applicant: Microsoft Technology Licensing, LLCInventors: Zhong MENG, Sarangarajan PARTHASARATHY, Xie SUN, Yashesh GAUR, Naoyuki KANDA, Liang LU, Xie CHEN, Rui ZHAO, Jinyu LI, Yifan GONG
-
Publication number: 20220130376Abstract: Embodiments are associated with a speaker-independent attention-based encoder-decoder model to classify output tokens based on input speech frames, the speaker-independent attention-based encoder-decoder model associated with a first output distribution, and a speaker-dependent attention-based encoder-decoder model to classify output tokens based on input speech frames, the speaker-dependent attention-based encoder-decoder model associated with a second output distribution. The second attention-based encoder-decoder model is trained to classify output tokens based on input speech frames of a target speaker and simultaneously trained to maintain a similarity between the first output distribution and the second output distribution.Type: ApplicationFiled: January 5, 2022Publication date: April 28, 2022Inventors: Zhong MENG, Yashesh GAUR, Jinyu LI, Yifan GONG
-
Patent number: 11276410Abstract: Embodiments may include reception of a plurality of speech frames, determination of a multi-dimensional acoustic feature associated with each of the plurality of speech frames, determination of a plurality of multi-dimensional phonetic features, each of the plurality of multi-dimensional phonetic features determined based on a respective one of the plurality of speech frames, generation of a plurality of two-dimensional feature maps based on the phonetic features, input of the feature maps and the plurality of acoustic features to a convolutional neural network, the convolutional neural network to generate a plurality of speaker embeddings based on the plurality of feature maps and the plurality of acoustic features, aggregation of the plurality of speaker embeddings into a first speaker embedding based on respective weights determined for each of the plurality of speaker embeddings, and determination of a speaker associated with the plurality of speech frames based on the first speaker embedding.Type: GrantFiled: November 13, 2019Date of Patent: March 15, 2022Assignee: MICROSOFT TECHNOLOGY LICENSING, LLCInventors: Yong Zhao, Tianyan Zhou, Jinyu Li, Yifan Gong, Jian Wu, Zhuo Chen
-
Publication number: 20220068269Abstract: Embodiments may include collection of a first batch of acoustic feature frames of an audio signal, the number of acoustic feature frames of the first batch equal to a first batch size, input of the first batch to a speech recognition network, collection, in response to detection of a word hypothesis output by the speech recognition network, of a second batch of acoustic feature frames of the audio signal, the number of acoustic feature frames of the second batch equal to a second batch size greater than the first batch size, and input of the second batch to the speech recognition network.Type: ApplicationFiled: October 13, 2021Publication date: March 3, 2022Inventors: Hosam A. KHALIL, Emilian Y. STOIMENOV, Yifan GONG, Chaojun LIU, Christopher H. BASOGLU, Amit K. AGARWAL, Naveen PARIHAR, Sayan PATHAK
-
Publication number: 20220058442Abstract: Representative embodiments disclose machine learning classifiers used in scenarios such as speech recognition, image captioning, machine translation, or other sequence-to-sequence embodiments. The machine learning classifiers have a plurality of time layers, each layer having a time processing block and a depth processing block. The time processing block is a recurrent neural network such as a Long Short Term Memory (LSTM) network. The depth processing blocks can be an LSTM network, a gated Deep Neural Network (DNN) or a maxout DNN. The depth processing blocks account for the hidden states of each time layer and uses summarized layer information for final input signal feature classification. An attention layer can also be used between the top depth processing block and the output layer.Type: ApplicationFiled: November 3, 2021Publication date: February 24, 2022Inventors: Jinyu Li, Liang Lu, Changliang Liu, Yifan Gong
-
Patent number: 11244673Abstract: Streaming machine learning unidirectional models is facilitated by the use of embedding vectors. Processing blocks in the models apply embedding vectors as input. The embedding vectors utilize context of future data (e.g., data that is temporally offset into the future within a data stream) to improve the accuracy of the outputs generated by the processing blocks. The embedding vectors cause a temporal shift between the outputs of the processing blocks and the inputs to which the outputs correspond. This temporal shift enables the processing blocks to apply the embedding vector inputs from processing blocks that are associated with future data.Type: GrantFiled: July 19, 2019Date of Patent: February 8, 2022Assignee: MICROSOFT TECHNOLOGLY LICENSING, LLCInventors: Jinyu Li, Amit Kumar Agarwal, Yifan Gong, Harini Kesavamoorthy, Changliang Liu, Liang Lu
-
Publication number: 20220028399Abstract: To generate substantially domain-invariant and speaker-discriminative features, embodiments may operate to extract features from input data based on a first set of parameters, generate outputs based on the extracted features and on a second set of parameters, and identify words represented by the input data based on the outputs, wherein the first set of parameters and the second set of parameters have been trained to minimize a network loss associated with the second set of parameters, wherein the first set of parameters has been trained to maximize the domain classification loss of a network comprising 1) an attention network to determine, based on a third set of parameters, relative importances of features extracted based on the first parameters to domain classification and 2) a domain classifier to classify a domain based on the extracted features, the relative importances, and a fourth set of parameters, and wherein the third set of parameters and the fourth set of parameters have been trained to minimizeType: ApplicationFiled: October 5, 2021Publication date: January 27, 2022Inventors: Zhong MENG, Jinyu LI, Yifan GONG
-
Patent number: 11232782Abstract: Embodiments are associated with a speaker-independent attention-based encoder-decoder model to classify output tokens based on input speech frames, the speaker-independent attention-based encoder-decoder model associated with a first output distribution, a speaker-dependent attention-based encoder-decoder model to classify output tokens based on input speech frames, the speaker-dependent attention-based encoder-decoder model associated with a second output distribution, training of the second attention-based encoder-decoder model to classify output tokens based on input speech frames of a target speaker and simultaneously training the speaker-dependent attention-based encoder-decoder model to maintain a similarity between the first output distribution and the second output distribution, and performing automatic speech recognition on speech frames of the target speaker using the trained speaker-dependent attention-based encoder-decoder model.Type: GrantFiled: November 6, 2019Date of Patent: January 25, 2022Assignee: MICROSOFT TECHNOLOGY LICENSING, LLCInventors: Zhong Meng, Yashesh Gaur, Jinyu Li, Yifan Gong
-
Patent number: 11222622Abstract: Generally discussed herein are devices, systems, and methods for custom wake word selection assistance. A method can include receiving, at a device, data indicating a custom wake word provided by a user, determining one or more characteristics of the custom wake word, determining that use of the custom wake word will cause more than a threshold rate of false detections based on the characteristics, rejecting the custom wake word as the wake word for accessing a personal assistant in response to determining that use of the custom wake word will cause more than a threshold rate of false detections, and setting the custom wake word as the wake word in response to determining that use of the custom wake word will not cause more than the threshold rate of false detections.Type: GrantFiled: July 25, 2019Date of Patent: January 11, 2022Assignee: Microsoft Technology Licensing, LLCInventors: Emilian Stoimenov, Khuram Shahid, Guoli Ye, Hosam Adel Khalil, Yifan Gong
-
Patent number: 11217265Abstract: To generate substantially condition-invariant and speaker-discriminative features, embodiments are associated with a feature extractor capable of extracting features from speech frames based on first parameters, a speaker classifier capable of identifying a speaker based on the features and on second parameters, and a condition classifier capable of identifying a noise condition based on the features and on third parameters. The first parameters of the feature extractor and the second parameters of the speaker classifier are trained to minimize a speaker classification loss, the first parameters of the feature extractor are further trained to maximize a condition classification loss, and the third parameters of the condition classifier are trained to minimize the condition classification loss.Type: GrantFiled: June 7, 2019Date of Patent: January 4, 2022Assignee: MICROSOFT TECHNOLOGY LICENSING, LLCInventors: Zhong Meng, Yong Zhao, Jinyu Li, Yifan Gong