Patents by Inventor Yifan Gong

Yifan Gong has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

Speech recognition using connectionist temporal classification

Patent number: 10580432

Abstract: Generally discussed herein are devices, systems, and methods for speech recognition. Processing circuitry can implement a connectionist temporal classification (CTC) neural network (NN) including an encode NN to receive an audio frame and generate a current encoded hidden feature vector, an attend NN to generate, based on a current encoded hidden feature vector and a first context vector from a previous time slice, a weight vector indicating an amount the current encoded hidden feature vector, a previous encoded hidden feature vector, and a future encoded hidden feature vector from a future time slice contribute to a current, second context vector, an annotate NN to generate the current, second context vector based on the weight vector, the current encoded hidden feature vector, the previous encoded hidden feature vector, and the future encoded hidden feature vector, and a normal NN to generate a normalized output vector based on the context vector.

Type: Grant

Filed: February 28, 2018

Date of Patent: March 3, 2020

Assignee: Microsoft Technology Licensing, LLC

Inventors: Amit Das, Jinyu Li, Rui Zhao, Yifan Gong
Small-footprint deep neural network

Patent number: 10515301

Abstract: Conversion of a large-footprint DNN to a small-print DNN is performed using a variety of techniques, including split-vector quantization. The small-foot print DNN may be distributed to a variety of devices, including mobile devices. Further, the small-footprint DNN may aid a digital assistant on a device in interpreting speech input.

Type: Grant

Filed: January 19, 2016

Date of Patent: December 24, 2019

Assignee: Microsoft Technology Licensing, LLC

Inventors: Jinyu Li, Yifan Gong, Yongqiang Wang
MULTI-MODAL SPEECH ATTRIBUTION AMONG N SPEAKERS

Publication number: 20190341053

Abstract: A computerized conference assistant includes a camera and a microphone. A face location machine of the computerized conference assistant finds a physical location of a human, based on a position of a candidate face in digital video captured by the camera. A beamforming machine of the computerized conference assistant outputs a beamformed signal isolating sounds originating from the physical location of the human. A diarization machine of the computerized conference assistant attributes information encoded in the beamformed signal to the human.

Type: Application

Filed: June 26, 2018

Publication date: November 7, 2019

Applicant: Microsoft Technology Licensing, LLC

Inventors: Shixiong ZHANG, Lingfeng WU, Eyal KRUPKA, Xiong XIAO, Yifan GONG
METHOD, APPARATUS AND SYSTEM FOR MULTI-MODULE SCHEDULING

Publication number: 20190317804

Abstract: The present disclosure provides a method, an apparatus and a system for multi-module scheduling, capable of solving at least one of the problems associated with the multi-module scheduling technique in the related art, i.e., inconsistency in data inputted to a computing module, and a significant delay or low throughput in data transmission between computing modules.

Type: Application

Filed: February 14, 2019

Publication date: October 17, 2019

Inventors: Yifan GONG, Zehua HUANG, Jiangming JIN, Dinghua LI, Siyuan LIU, Wei LIU, Lei SU, YiXin YANG
Adversarial Teacher-Student Learning for Unsupervised Domain Adaptation

Publication number: 20190287515

Abstract: Methods, systems, and computer programs are presented for training, with adversarial constraints, a student model for speech recognition based on a teacher model. One method includes operations for training a teacher model based on teacher speech data, initializing a student model with parameters obtained from the teacher model, and training the student model with adversarial teacher-student learning based on the teacher speech data and student speech data. Training the student model with adversarial teacher-student learning further includes minimizing a teacher-student loss that measures a divergence of outputs between the teacher model and the student model; minimizing a classifier condition loss with respect to parameters of a condition classifier; and maximizing the classifier condition loss with respect to parameters of a feature extractor. The classifier condition loss measures errors caused by acoustic condition classification. Further, speech is recognized with the trained student model.

Type: Application

Filed: March 16, 2018

Publication date: September 19, 2019

Inventors: Jinyu Li, Zhong Meng, Yifan Gong
METHOD, APPARATUS, AND SYSTEM FOR MULTIPLE MODULES SCHEDULING

Publication number: 20190286489

Abstract: The present disclosure provides a method, an apparatus and a system for multi-module scheduling, capable of solving the problem associated with inconsistency in data inputted to a computing module in the multi-module scheduling technique in the related art.

Type: Application

Filed: February 14, 2019

Publication date: September 19, 2019

Inventors: Yifan GONG, Zehua HUANG, Jiangming JIN, Dinghua LI, Siyuan LIU, Wei LIU, Lei SU, YiXin YANG
ADVANCING WORD-BASED SPEECH RECOGNITION PROCESSING

Publication number: 20190279614

Abstract: Non-limiting examples of the present disclosure describe advancements in acoustic-to-word modeling that improve accuracy in speech recognition processing through the replacement of out-of-vocabulary (OOV) tokens. During the decoding of speech signals, better accuracy in speech recognition processing is achieved through training and implementation of multiple different solutions that present enhanced speech recognition models. In one example, a hybrid neural network model for speech recognition processing combines a word-based neural network model as a primary model and a character-based neural network model as an auxiliary model. The primary word-based model emits a word sequence, and an output of character-based auxiliary model is consulted at a segment where the word-based model emits an OOV token. In another example, a mixed unit speech recognition model is developed and trained to generate a mixed word and character sequence during decoding of a speech signal without requiring generation of OOV tokens.

Type: Application

Filed: March 9, 2018

Publication date: September 12, 2019

Inventors: Guoli YE, James DROPPO, Jinyu LI, Rui ZHAO, Yifan GONG
SPEECH RECOGNITION USING CONNECTIONIST TEMPORAL CLASSIFICATION

Publication number: 20190267023

Abstract: Generally discussed herein are devices, systems, and methods for speech recognition. Processing circuitry can implement a connectionist temporal classification (CTC) neural network (NN) including an encode NN to receive an audio frame and generate a current encoded hidden feature vector, an attend NN to generate, based on a current encoded hidden feature vector and a first context vector from a previous time slice, a weight vector indicating an amount the current encoded hidden feature vector, a previous encoded hidden feature vector, and a future encoded hidden feature vector from a future time slice contribute to a current, second context vector, an annotate NN to generate the current, second context vector based on the weight vector, the current encoded hidden feature vector, the previous encoded hidden feature vector, and the future encoded hidden feature vector, and a normal NN to generate a normalized output vector based on the context vector.

Type: Application

Filed: February 28, 2018

Publication date: August 29, 2019

Inventors: Amit Das, Jinyu Li, Rui Zhao, Yifan Gong
Speaker recognition

Patent number: 10354656

Abstract: Improvements in speaker identification and verification are provided via an attention model for speaker recognition and the end-to-end training thereof. A speaker discriminative convolutional neural network (CNN) is used to directly extract frame-level speaker features that are weighted and combined to form an utterance-level speaker recognition vector via the attention model. The CNN and attention model are join-optimized via an end-to-end training algorithm that imitates the speaker recognition process and uses the most-similar utterances from imposters for each speaker.

Type: Grant

Filed: June 23, 2017

Date of Patent: July 16, 2019

Assignee: MICROSOFT TECHNOLOGY LICENSING, LLC

Inventors: Yong Zhao, Jinyu Li, Yifan Gong, Shixiong Zhang, Zhuo Chen
Speaker-invariant training via adversarial learning

Patent number: 10347241

Abstract: Systems and methods can be implemented to conduct speaker-invariant training for speech recognition in a variety of applications. An adversarial multi-task learning scheme for speaker-invariant training can be implemented, aiming at actively curtailing the inter-talker feature variability, while maximizing its senone discriminability to enhance the performance of a deep neural network (DNN) based automatic speech recognition system. In speaker-invariant training, a DNN acoustic model and a speaker classifier network can be jointly optimized to minimize the senone (triphone state) classification loss, and simultaneously mini-maximize the speaker classification loss. A speaker invariant and senone-discriminative intermediate feature is learned through this adversarial multi-task learning, which can be applied to an automatic speech recognition system. Additional systems and methods are disclosed.

Type: Grant

Filed: March 23, 2018

Date of Patent: July 9, 2019

Assignee: Microsoft Technology Licensing, LLC

Inventors: Zhong Meng, Vadim Aleksandrovich Mazalov, Yifan Gong, Yong Zhao, Zhuo Chen, Jinyu Li
Speech Recognition Source to Target Domain Adaptation

Publication number: 20190147854

Abstract: A method includes obtaining a source domain having labels for source domain speech input features, obtaining a target domain having target domain speech input features without labels, extracting private components from each of the source and target domain speech input features, extracting shared components from the source and target domain speech input features using a shared component extractor, and reconstructing the source and target input features as a regularization of private component extraction.

Type: Application

Filed: November 16, 2017

Publication date: May 16, 2019

Inventors: Jinyu Li, Vadim A. Mazalov, Yifan Gong, Zhong Meng, Zhuo Chen
MULTI-CHANNEL SPEECH SEPARATION

Publication number: 20190139563

Abstract: Representative embodiments disclose mechanisms to separate and recognize multiple audio sources (e.g., picking out individual speakers) in an environment where they overlap and interfere with each other. The architecture uses a microphone array to spatially separate out the audio signals. The spatially filtered signals are then input into a plurality of separators, so each signal is input into a corresponding signal. The separators use neural networks to separate out audio sources. The separators typically produce multiple output signals for the single input signals. A post selection processor then assesses the separator outputs to pick the signals with the highest quality output. These signals can be used in a variety of systems such as speech recognition, meeting transcription and enhancement, hearing aids, music information retrieval, speech enhancement and so forth.

Type: Application

Filed: November 6, 2017

Publication date: May 9, 2019

Inventors: Zhuo Chen, Jinyu Li, Xiong Xiao, Takuya Yoshioka, Huaming Wang, Zhenghao Wang, Yifan Gong
Modular deep learning model

Patent number: 10235994

Abstract: The technology described herein uses a modular model to process speech. A deep learning based acoustic model comprises a stack of different types of neural network layers. The sub-modules of a deep learning based acoustic model can be used to represent distinct non-phonetic acoustic factors, such as accent origins (e.g. native, non-native), speech channels (e.g. mobile, bluetooth, desktop etc.), speech application scenario (e.g. voice search, short message dictation etc.), and speaker variation (e.g. individual speakers or clustered speakers), etc. The technology described herein uses certain sub-modules in a first context and a second group of sub-modules in a second context.

Type: Grant

Filed: June 30, 2016

Date of Patent: March 19, 2019

Assignee: Microsoft Technology Licensing, LLC

Inventors: Yan Huang, Chaojun Liu, Kshitiz Kumar, Kaustubh Prakash Kalgaonkar, Yifan Gong
DOMAIN ADAPTATION IN SPEECH RECOGNITION VIA TEACHER-STUDENT LEARNING

Publication number: 20190051290

Abstract: Improvements in speech recognition in a new domain are provided via the student/teacher training of models for different speech domains. A student model for a new domain is created based on the teacher model trained in an existing domain. The student model is trained in parallel to the operation of the teacher model, with inputs in the new and existing domains respectfully, to develop a neural network that is adapted to recognize speech in the new domain. The data in the new domain may exclude transcription labels but rather are parallelized with the data analyzed in the existing domain analyzed by the teacher model. The outputs from the teacher model are compared with the outputs of the student model and the differences are used to adjust the parameters of the student model to better recognize speech in the second domain.

Type: Application

Filed: August 11, 2017

Publication date: February 14, 2019

Applicant: Microsoft Technology Licensing, LLC

Inventors: Jinyu Li, Michael Lewis Seltzer, Xi Wang, Rui Zhao, Yifan Gong
SPEAKER RECOGNITION

Publication number: 20180374486

Abstract: Improvements in speaker identification and verification are provided via an attention model for speaker recognition and the end-to-end training thereof. A speaker discriminative convolutional neural network (CNN) is used to directly extract frame-level speaker features that are weighted and combined to form an utterance-level speaker recognition vector via the attention model. The CNN and attention model are join-optimized via an end-to-end training algorithm that imitates the speaker recognition process and uses the most-similar utterances from imposters for each speaker.

Type: Application

Filed: June 23, 2017

Publication date: December 27, 2018

Applicant: Microsoft Technology Licensing, LLC

Inventors: Yong Zhao, Jinyu Li, Yifan Gong, Shixiong Zhang, Zhuo Chen
Reduced size computerized speech model speaker adaptation

Patent number: 10115393

Abstract: A computer-readable speaker-adapted speech engine acoustic model can be generated. The generating of the acoustic model can include performing speaker-specific adaptation of one or more layers of the model to produce one or more adaptive layers comprising layer weights, with the speaker-specific adaptation comprising a data size reduction technique. The data size reduction technique can be threshold value adaptation, corner area adaptation, diagonal-based quantization, adaptive matrix reduction, or a combination of these reduction techniques. The speaker-adapted speech engine model can be accessed and used in performing speech recognition on computer-readable audio speech input via a computerized speech recognition engine.

Type: Grant

Filed: October 31, 2016

Date of Patent: October 30, 2018

Assignee: Microsoft Technology Licensing, LLC

Inventors: Kshitiz Kumar, Chaojun Liu, Yifan Gong
Variable-component deep neural network for robust speech recognition

Patent number: 10019990

Abstract: Systems and methods for speech recognition incorporating environmental variables are provided. The systems and methods capture speech to be recognized. The speech is then recognized utilizing a variable component deep neural network (DNN). The variable component DNN processes the captured speech by incorporating an environment variable. The environment variable may be any variable that is dependent on environmental conditions or the relation of the user, the client device, and the environment. For example, the environment variable may be based on noise of the environment and represented as a signal-to-noise ratio. The variable component DNN may incorporate the environment variable in different ways. For instance, the environment variable may be incorporated into weighting matrices and biases of the DNN, the outputs of the hidden layers of the DNN, or the activation functions of the nodes of the DNN.

Type: Grant

Filed: September 9, 2014

Date of Patent: July 10, 2018

Assignee: Microsoft Technology Licensing, LLC

Inventors: Jinyu Li, Rui Zhao, Yifan Gong
Speech recognition error diagnosis

Patent number: 10019984

Abstract: Techniques and technologies for diagnosing speech recognition errors are described. In an example implementation, a system for diagnosing speech recognition errors may include an error detection module configured to determine that a speech recognition result is least partially erroneous, and a recognition error diagnostics module. The recognition error diagnostics module may be configured to (a) perform a first error analysis of the at least partially erroneous speech recognition result to provide a first error analysis result; (b) perform a second error analysis of the at least partially erroneous speech recognition result to provide a second error analysis result; and (c) determine at least one category of recognition error associated with the at least partially erroneous speech recognition result based on a combination of the first error analysis result and the second error analysis result.

Type: Grant

Filed: February 27, 2015

Date of Patent: July 10, 2018

Assignee: Microsoft Technology Licensing, LLC

Inventors: Shiun-Zu Kuo, Thomas Reutter, Yifan Gong, Mark T. Hanson, Ye Tian, Shuangyu Chang, Jonathan Hamaker, Qi Miao, Yuancheng Tu
Automatic speech recognition confidence classifier

Patent number: 9997161

Abstract: The described technology provides normalization of speech recognition confidence classifier (CC) scores that maintains the accuracy of acceptance metrics. A speech recognition CC scores quantitatively represents the correctness of decoded utterances in a defined range (e.g., [0,1]). An operating threshold is associated with a confidence classifier, such that utterance recognitions having scores exceeding the operating threshold are deemed acceptable. However, when a speech recognition engine, an acoustic model, and/or other parameters are updated by the platform, the correct-accept (CA) versus false-accept (FA) profile can change such that the application software's operating threshold is no longer valid or as accurate.

Type: Grant

Filed: September 11, 2015

Date of Patent: June 12, 2018

Assignee: Microsoft Technology Licensing, LLC

Inventors: Yifan Gong, Chaojun Liu, Kshitiz Kumar
Multilingual deep neural network

Patent number: 9842585

Abstract: Described herein are various technologies pertaining to a multilingual deep neural network (MDNN). The MDNN includes a plurality of hidden layers, wherein values for weight parameters of the plurality of hidden layers are learned during a training phase based upon training data in terms of acoustic raw features for multiple languages. The MDNN further includes softmax layers that are trained for each target language separately, making use of the hidden layer values trained jointly with multiple source languages. The MDNN is adaptable, such that a new softmax layer may be added on top of the existing hidden layers, where the new softmax layer corresponds to a new target language.

Type: Grant

Filed: March 11, 2013

Date of Patent: December 12, 2017

Assignee: Microsoft Technology Licensing, LLC

Inventors: Jui-Ting Huang, Jinyu Li, Dong Yu, Li Deng, Yifan Gong

prev 1 2 3 4 5 6 7 8 9 … next