Patents by Inventor Yifan Gong

Yifan Gong has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

  • Patent number: 10580432
    Abstract: Generally discussed herein are devices, systems, and methods for speech recognition. Processing circuitry can implement a connectionist temporal classification (CTC) neural network (NN) including an encode NN to receive an audio frame and generate a current encoded hidden feature vector, an attend NN to generate, based on a current encoded hidden feature vector and a first context vector from a previous time slice, a weight vector indicating an amount the current encoded hidden feature vector, a previous encoded hidden feature vector, and a future encoded hidden feature vector from a future time slice contribute to a current, second context vector, an annotate NN to generate the current, second context vector based on the weight vector, the current encoded hidden feature vector, the previous encoded hidden feature vector, and the future encoded hidden feature vector, and a normal NN to generate a normalized output vector based on the context vector.
    Type: Grant
    Filed: February 28, 2018
    Date of Patent: March 3, 2020
    Assignee: Microsoft Technology Licensing, LLC
    Inventors: Amit Das, Jinyu Li, Rui Zhao, Yifan Gong
  • Patent number: 10515301
    Abstract: Conversion of a large-footprint DNN to a small-print DNN is performed using a variety of techniques, including split-vector quantization. The small-foot print DNN may be distributed to a variety of devices, including mobile devices. Further, the small-footprint DNN may aid a digital assistant on a device in interpreting speech input.
    Type: Grant
    Filed: January 19, 2016
    Date of Patent: December 24, 2019
    Assignee: Microsoft Technology Licensing, LLC
    Inventors: Jinyu Li, Yifan Gong, Yongqiang Wang
  • Publication number: 20190341053
    Abstract: A computerized conference assistant includes a camera and a microphone. A face location machine of the computerized conference assistant finds a physical location of a human, based on a position of a candidate face in digital video captured by the camera. A beamforming machine of the computerized conference assistant outputs a beamformed signal isolating sounds originating from the physical location of the human. A diarization machine of the computerized conference assistant attributes information encoded in the beamformed signal to the human.
    Type: Application
    Filed: June 26, 2018
    Publication date: November 7, 2019
    Applicant: Microsoft Technology Licensing, LLC
    Inventors: Shixiong ZHANG, Lingfeng WU, Eyal KRUPKA, Xiong XIAO, Yifan GONG
  • Publication number: 20190317804
    Abstract: The present disclosure provides a method, an apparatus and a system for multi-module scheduling, capable of solving at least one of the problems associated with the multi-module scheduling technique in the related art, i.e., inconsistency in data inputted to a computing module, and a significant delay or low throughput in data transmission between computing modules.
    Type: Application
    Filed: February 14, 2019
    Publication date: October 17, 2019
    Inventors: Yifan GONG, Zehua HUANG, Jiangming JIN, Dinghua LI, Siyuan LIU, Wei LIU, Lei SU, YiXin YANG
  • Publication number: 20190287515
    Abstract: Methods, systems, and computer programs are presented for training, with adversarial constraints, a student model for speech recognition based on a teacher model. One method includes operations for training a teacher model based on teacher speech data, initializing a student model with parameters obtained from the teacher model, and training the student model with adversarial teacher-student learning based on the teacher speech data and student speech data. Training the student model with adversarial teacher-student learning further includes minimizing a teacher-student loss that measures a divergence of outputs between the teacher model and the student model; minimizing a classifier condition loss with respect to parameters of a condition classifier; and maximizing the classifier condition loss with respect to parameters of a feature extractor. The classifier condition loss measures errors caused by acoustic condition classification. Further, speech is recognized with the trained student model.
    Type: Application
    Filed: March 16, 2018
    Publication date: September 19, 2019
    Inventors: Jinyu Li, Zhong Meng, Yifan Gong
  • Publication number: 20190286489
    Abstract: The present disclosure provides a method, an apparatus and a system for multi-module scheduling, capable of solving the problem associated with inconsistency in data inputted to a computing module in the multi-module scheduling technique in the related art.
    Type: Application
    Filed: February 14, 2019
    Publication date: September 19, 2019
    Inventors: Yifan GONG, Zehua HUANG, Jiangming JIN, Dinghua LI, Siyuan LIU, Wei LIU, Lei SU, YiXin YANG
  • Publication number: 20190279614
    Abstract: Non-limiting examples of the present disclosure describe advancements in acoustic-to-word modeling that improve accuracy in speech recognition processing through the replacement of out-of-vocabulary (OOV) tokens. During the decoding of speech signals, better accuracy in speech recognition processing is achieved through training and implementation of multiple different solutions that present enhanced speech recognition models. In one example, a hybrid neural network model for speech recognition processing combines a word-based neural network model as a primary model and a character-based neural network model as an auxiliary model. The primary word-based model emits a word sequence, and an output of character-based auxiliary model is consulted at a segment where the word-based model emits an OOV token. In another example, a mixed unit speech recognition model is developed and trained to generate a mixed word and character sequence during decoding of a speech signal without requiring generation of OOV tokens.
    Type: Application
    Filed: March 9, 2018
    Publication date: September 12, 2019
    Inventors: Guoli YE, James DROPPO, Jinyu LI, Rui ZHAO, Yifan GONG
  • Publication number: 20190267023
    Abstract: Generally discussed herein are devices, systems, and methods for speech recognition. Processing circuitry can implement a connectionist temporal classification (CTC) neural network (NN) including an encode NN to receive an audio frame and generate a current encoded hidden feature vector, an attend NN to generate, based on a current encoded hidden feature vector and a first context vector from a previous time slice, a weight vector indicating an amount the current encoded hidden feature vector, a previous encoded hidden feature vector, and a future encoded hidden feature vector from a future time slice contribute to a current, second context vector, an annotate NN to generate the current, second context vector based on the weight vector, the current encoded hidden feature vector, the previous encoded hidden feature vector, and the future encoded hidden feature vector, and a normal NN to generate a normalized output vector based on the context vector.
    Type: Application
    Filed: February 28, 2018
    Publication date: August 29, 2019
    Inventors: Amit Das, Jinyu Li, Rui Zhao, Yifan Gong
  • Patent number: 10354656
    Abstract: Improvements in speaker identification and verification are provided via an attention model for speaker recognition and the end-to-end training thereof. A speaker discriminative convolutional neural network (CNN) is used to directly extract frame-level speaker features that are weighted and combined to form an utterance-level speaker recognition vector via the attention model. The CNN and attention model are join-optimized via an end-to-end training algorithm that imitates the speaker recognition process and uses the most-similar utterances from imposters for each speaker.
    Type: Grant
    Filed: June 23, 2017
    Date of Patent: July 16, 2019
    Assignee: MICROSOFT TECHNOLOGY LICENSING, LLC
    Inventors: Yong Zhao, Jinyu Li, Yifan Gong, Shixiong Zhang, Zhuo Chen
  • Patent number: 10347241
    Abstract: Systems and methods can be implemented to conduct speaker-invariant training for speech recognition in a variety of applications. An adversarial multi-task learning scheme for speaker-invariant training can be implemented, aiming at actively curtailing the inter-talker feature variability, while maximizing its senone discriminability to enhance the performance of a deep neural network (DNN) based automatic speech recognition system. In speaker-invariant training, a DNN acoustic model and a speaker classifier network can be jointly optimized to minimize the senone (triphone state) classification loss, and simultaneously mini-maximize the speaker classification loss. A speaker invariant and senone-discriminative intermediate feature is learned through this adversarial multi-task learning, which can be applied to an automatic speech recognition system. Additional systems and methods are disclosed.
    Type: Grant
    Filed: March 23, 2018
    Date of Patent: July 9, 2019
    Assignee: Microsoft Technology Licensing, LLC
    Inventors: Zhong Meng, Vadim Aleksandrovich Mazalov, Yifan Gong, Yong Zhao, Zhuo Chen, Jinyu Li
  • Publication number: 20190147854
    Abstract: A method includes obtaining a source domain having labels for source domain speech input features, obtaining a target domain having target domain speech input features without labels, extracting private components from each of the source and target domain speech input features, extracting shared components from the source and target domain speech input features using a shared component extractor, and reconstructing the source and target input features as a regularization of private component extraction.
    Type: Application
    Filed: November 16, 2017
    Publication date: May 16, 2019
    Inventors: Jinyu Li, Vadim A. Mazalov, Yifan Gong, Zhong Meng, Zhuo Chen
  • Publication number: 20190139563
    Abstract: Representative embodiments disclose mechanisms to separate and recognize multiple audio sources (e.g., picking out individual speakers) in an environment where they overlap and interfere with each other. The architecture uses a microphone array to spatially separate out the audio signals. The spatially filtered signals are then input into a plurality of separators, so each signal is input into a corresponding signal. The separators use neural networks to separate out audio sources. The separators typically produce multiple output signals for the single input signals. A post selection processor then assesses the separator outputs to pick the signals with the highest quality output. These signals can be used in a variety of systems such as speech recognition, meeting transcription and enhancement, hearing aids, music information retrieval, speech enhancement and so forth.
    Type: Application
    Filed: November 6, 2017
    Publication date: May 9, 2019
    Inventors: Zhuo Chen, Jinyu Li, Xiong Xiao, Takuya Yoshioka, Huaming Wang, Zhenghao Wang, Yifan Gong
  • Patent number: 10235994
    Abstract: The technology described herein uses a modular model to process speech. A deep learning based acoustic model comprises a stack of different types of neural network layers. The sub-modules of a deep learning based acoustic model can be used to represent distinct non-phonetic acoustic factors, such as accent origins (e.g. native, non-native), speech channels (e.g. mobile, bluetooth, desktop etc.), speech application scenario (e.g. voice search, short message dictation etc.), and speaker variation (e.g. individual speakers or clustered speakers), etc. The technology described herein uses certain sub-modules in a first context and a second group of sub-modules in a second context.
    Type: Grant
    Filed: June 30, 2016
    Date of Patent: March 19, 2019
    Assignee: Microsoft Technology Licensing, LLC
    Inventors: Yan Huang, Chaojun Liu, Kshitiz Kumar, Kaustubh Prakash Kalgaonkar, Yifan Gong
  • Publication number: 20190051290
    Abstract: Improvements in speech recognition in a new domain are provided via the student/teacher training of models for different speech domains. A student model for a new domain is created based on the teacher model trained in an existing domain. The student model is trained in parallel to the operation of the teacher model, with inputs in the new and existing domains respectfully, to develop a neural network that is adapted to recognize speech in the new domain. The data in the new domain may exclude transcription labels but rather are parallelized with the data analyzed in the existing domain analyzed by the teacher model. The outputs from the teacher model are compared with the outputs of the student model and the differences are used to adjust the parameters of the student model to better recognize speech in the second domain.
    Type: Application
    Filed: August 11, 2017
    Publication date: February 14, 2019
    Applicant: Microsoft Technology Licensing, LLC
    Inventors: Jinyu Li, Michael Lewis Seltzer, Xi Wang, Rui Zhao, Yifan Gong
  • Publication number: 20180374486
    Abstract: Improvements in speaker identification and verification are provided via an attention model for speaker recognition and the end-to-end training thereof. A speaker discriminative convolutional neural network (CNN) is used to directly extract frame-level speaker features that are weighted and combined to form an utterance-level speaker recognition vector via the attention model. The CNN and attention model are join-optimized via an end-to-end training algorithm that imitates the speaker recognition process and uses the most-similar utterances from imposters for each speaker.
    Type: Application
    Filed: June 23, 2017
    Publication date: December 27, 2018
    Applicant: Microsoft Technology Licensing, LLC
    Inventors: Yong Zhao, Jinyu Li, Yifan Gong, Shixiong Zhang, Zhuo Chen
  • Patent number: 10115393
    Abstract: A computer-readable speaker-adapted speech engine acoustic model can be generated. The generating of the acoustic model can include performing speaker-specific adaptation of one or more layers of the model to produce one or more adaptive layers comprising layer weights, with the speaker-specific adaptation comprising a data size reduction technique. The data size reduction technique can be threshold value adaptation, corner area adaptation, diagonal-based quantization, adaptive matrix reduction, or a combination of these reduction techniques. The speaker-adapted speech engine model can be accessed and used in performing speech recognition on computer-readable audio speech input via a computerized speech recognition engine.
    Type: Grant
    Filed: October 31, 2016
    Date of Patent: October 30, 2018
    Assignee: Microsoft Technology Licensing, LLC
    Inventors: Kshitiz Kumar, Chaojun Liu, Yifan Gong
  • Patent number: 10019990
    Abstract: Systems and methods for speech recognition incorporating environmental variables are provided. The systems and methods capture speech to be recognized. The speech is then recognized utilizing a variable component deep neural network (DNN). The variable component DNN processes the captured speech by incorporating an environment variable. The environment variable may be any variable that is dependent on environmental conditions or the relation of the user, the client device, and the environment. For example, the environment variable may be based on noise of the environment and represented as a signal-to-noise ratio. The variable component DNN may incorporate the environment variable in different ways. For instance, the environment variable may be incorporated into weighting matrices and biases of the DNN, the outputs of the hidden layers of the DNN, or the activation functions of the nodes of the DNN.
    Type: Grant
    Filed: September 9, 2014
    Date of Patent: July 10, 2018
    Assignee: Microsoft Technology Licensing, LLC
    Inventors: Jinyu Li, Rui Zhao, Yifan Gong
  • Patent number: 10019984
    Abstract: Techniques and technologies for diagnosing speech recognition errors are described. In an example implementation, a system for diagnosing speech recognition errors may include an error detection module configured to determine that a speech recognition result is least partially erroneous, and a recognition error diagnostics module. The recognition error diagnostics module may be configured to (a) perform a first error analysis of the at least partially erroneous speech recognition result to provide a first error analysis result; (b) perform a second error analysis of the at least partially erroneous speech recognition result to provide a second error analysis result; and (c) determine at least one category of recognition error associated with the at least partially erroneous speech recognition result based on a combination of the first error analysis result and the second error analysis result.
    Type: Grant
    Filed: February 27, 2015
    Date of Patent: July 10, 2018
    Assignee: Microsoft Technology Licensing, LLC
    Inventors: Shiun-Zu Kuo, Thomas Reutter, Yifan Gong, Mark T. Hanson, Ye Tian, Shuangyu Chang, Jonathan Hamaker, Qi Miao, Yuancheng Tu
  • Patent number: 9997161
    Abstract: The described technology provides normalization of speech recognition confidence classifier (CC) scores that maintains the accuracy of acceptance metrics. A speech recognition CC scores quantitatively represents the correctness of decoded utterances in a defined range (e.g., [0,1]). An operating threshold is associated with a confidence classifier, such that utterance recognitions having scores exceeding the operating threshold are deemed acceptable. However, when a speech recognition engine, an acoustic model, and/or other parameters are updated by the platform, the correct-accept (CA) versus false-accept (FA) profile can change such that the application software's operating threshold is no longer valid or as accurate.
    Type: Grant
    Filed: September 11, 2015
    Date of Patent: June 12, 2018
    Assignee: Microsoft Technology Licensing, LLC
    Inventors: Yifan Gong, Chaojun Liu, Kshitiz Kumar
  • Patent number: 9842585
    Abstract: Described herein are various technologies pertaining to a multilingual deep neural network (MDNN). The MDNN includes a plurality of hidden layers, wherein values for weight parameters of the plurality of hidden layers are learned during a training phase based upon training data in terms of acoustic raw features for multiple languages. The MDNN further includes softmax layers that are trained for each target language separately, making use of the hidden layer values trained jointly with multiple source languages. The MDNN is adaptable, such that a new softmax layer may be added on top of the existing hidden layers, where the new softmax layer corresponds to a new target language.
    Type: Grant
    Filed: March 11, 2013
    Date of Patent: December 12, 2017
    Assignee: Microsoft Technology Licensing, LLC
    Inventors: Jui-Ting Huang, Jinyu Li, Dong Yu, Li Deng, Yifan Gong