Patents by Inventor Yifan Gong

Yifan Gong has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

  • Publication number: 20150310858
    Abstract: Providing a framework for merging automatic speech recognition (ASR) systems having a shared deep neural network (DNN) feature transformation is provided. A received utterance may be evaluated to generate a DNN-derived feature from the top hidden layer of a DNN. The top hidden layer output may then be utilized to generate a network including a bottleneck layer and an output layer. Weights representing a feature dimension reduction may then be extracted between the top hidden layer and the bottleneck layer. Scores may then be generated and combined to merge the ASR systems which share the DNN feature transformation.
    Type: Application
    Filed: April 29, 2014
    Publication date: October 29, 2015
    Applicant: MICROSOFT CORPORATION
    Inventors: JINYU LI, JIAN XUE, YIFAN GONG
  • Publication number: 20150255061
    Abstract: The adaptation and personalization of a deep neural network (DNN) model for automatic speech recognition is provided. An utterance which includes speech features for one or more speakers may be received in ASR tasks such as voice search or short message dictation. A decomposition approach may then be applied to an original matrix in the DNN model. In response to applying the decomposition approach, the original matrix may be converted into multiple new matrices which are smaller than the original matrix. A square matrix may then be added to the new matrices. Speaker-specific parameters may then be stored in the square matrix. The DNN model may then be adapted by updating the square matrix. This process may be applied to all of a number of original matrices in the DNN model. The adapted DNN model may include a reduced number of parameters than those received in the original DNN model.
    Type: Application
    Filed: March 7, 2014
    Publication date: September 10, 2015
    Applicant: MICROSOFT CORPORATION
    Inventors: Jian Xue, Jinyu Li, Dong Yu, Michael L. Seltzer, Yifan Gong
  • Publication number: 20140372112
    Abstract: A Deep Neural Network (DNN) model used in an Automatic Speech Recognition (ASR) system is restructured. A restructured DNN model may include fewer parameters compared to the original DNN model. The restructured DNN model may include a monophone state output layer in addition to the senone output layer of the original DNN model. Singular value decomposition (SVD) can be applied to one or more weight matrices of the DNN model to reduce the size of the DNN Model. The output layer of the DNN model may be restructured to include monophone states in addition to the senones (tied triphone states) which are included in the original DNN model. When the monophone states are included in the restructured DNN model, the posteriors of monophone states are used to select a small part of senones to be evaluated.
    Type: Application
    Filed: June 18, 2013
    Publication date: December 18, 2014
    Inventors: Jian Xue, Emilian Stoimenov, Jinyu Li, Yifan Gong
  • Publication number: 20140257814
    Abstract: A high-dimensional posterior-based feature with partial distance elimination may be utilized for speech recognition. The log likelihood values of a large number of Gaussians are needed to generate the high-dimensional posterior feature. Gaussians with very small log likelihoods are associated with zero posterior values. Log likelihoods for Gaussians for a speech frame may be evaluated with a partial distance elimination method. If the partial distance of a Gaussian is already too small, the Gaussian will have a zero posterior value. The partial distance may be calculated by sequentially adding individual dimensions in a group of dimensions. The partial distance elimination occurs when less than all of the dimensions in the group are sequentially added.
    Type: Application
    Filed: March 5, 2013
    Publication date: September 11, 2014
    Applicant: MICROSOFT CORPORATION
    Inventors: Jinyu Li, Zhijie Yan, Qiang Huo, Yifan Gong
  • Publication number: 20140257805
    Abstract: Described herein are various technologies pertaining to a multilingual deep neural network (MDNN). The MDNN includes a plurality of hidden layers, wherein values for weight parameters of the plurality of hidden layers are learned during a training phase based upon training data in terms of acoustic raw features for multiple languages. The MDNN further includes softmax layers that are trained for each target language separately, making use of the hidden layer values trained jointly with multiple source languages. The MDNN is adaptable, such that a new softmax layer may be added on top of the existing hidden layers, where the new softmax layer corresponds to a new target language.
    Type: Application
    Filed: March 11, 2013
    Publication date: September 11, 2014
    Applicant: MICROSOFT CORPORATION
    Inventors: Jui-Ting Huang, Jinyu Li, Dong Yu, Li Deng, Yifan Gong
  • Publication number: 20140257804
    Abstract: Technologies pertaining to training a deep neural network (DNN) for use in a recognition system are described herein. The DNN is trained using heterogeneous data, the heterogeneous data including narrowband signals and wideband signals. The DNN, subsequent to being trained, receives an input signal that can be either a wideband signal or narrowband signal. The DNN estimates the class posterior probability of the input signal regardless of whether the input signal is the wideband signal or the narrowband signal.
    Type: Application
    Filed: March 7, 2013
    Publication date: September 11, 2014
    Applicant: Microsoft Corporation
    Inventors: Jinyu Li, Dong Yu, Yifan Gong
  • Publication number: 20140214420
    Abstract: Personalization for Automatic Speech Recognition (ASR) is associated with a particular device. A generalized i-vector clustering method is used to train i-vector parameters on utterances received from a device and to classify test utterances from the same device. A sub-loading matrix and a residual noise term may be used when determining the personalization. A Universal Background Model (UBM) is trained using the utterances. The UBM is applied to obtain i-vectors of training utterances received from a device and a Gaussian Mixture Model (GMM) is trained using the i-vectors. During testing, the i-vector for each utterance received from the device is estimated using the device's UBM. The utterance is then assigned to the cluster with the closest centroid in the GMM. For each utterance, the i-vector and the residual noise estimation is performed. Hyperparameter estimation is also performed. The i-vector estimation and hyperparameter estimation are performed until convergence.
    Type: Application
    Filed: January 25, 2013
    Publication date: July 31, 2014
    Applicant: MICROSOFT CORPORATION
    Inventors: Kaisheng Yao, Yifan Gong
  • Publication number: 20140207448
    Abstract: A speech recognition system adaptively estimates a warping factor used to reduce speaker variability. The warping factor is estimated using a small window (e.g. 100 ms) of speech. The warping factor is adaptively adjusted as more speech is obtained until the warping factor converges or a pre-defined maximum number of adaptation is reached. The speaker may be placed into a group selected from two or more groups based on characteristics that are associated with the speaker's window of speech. Different step sizes may be used within the different groups when estimating the warping factor. VTLN is applied to the speech input using the estimated warping factor. A linear transformation, including a bias term, may also be computed to assist in normalizing the speech along with the application of the VTLN.
    Type: Application
    Filed: January 23, 2013
    Publication date: July 24, 2014
    Applicant: Microsoft Corporation
    Inventors: Shizhen Wang, Yifan Gong, Fileno Alleva
  • Patent number: 8731916
    Abstract: Noise and channel distortion parameters in the vectorized logarithmic or the cepstral domain for an utterance may be estimated, and subsequently the distorted speech parameters in the same domain may be updated using an unscented transformation framework during online automatic speech recognition. An utterance, including speech generated from a transmission source for delivery to a receiver, may be received by a computing device. The computing device may execute instructions for applying the unscented transformation framework to speech feature vectors, representative of the speech, in order to estimate, in a sequential or online manner, static noise and channel distortion parameters and dynamic noise distortion parameters in the unscented transformation framework. The static and dynamic parameters for the distorted speech in the utterance may then be updated from clean speech parameters and the noise and channel distortion parameters using non-linear mapping.
    Type: Grant
    Filed: November 18, 2010
    Date of Patent: May 20, 2014
    Assignee: Microsoft Corporation
    Inventors: Deng Li, Jinyu Li, Dong Yu, Yifan Gong
  • Patent number: 8700400
    Abstract: Subspace speech adaptation may be utilized for facilitating the recognition of speech containing short utterances. Speech training data may be received in a speech model by a computer. A first matrix may be determined for preconditioning speech statistics based on the speech training data. A second matrix may be determined for representing a basis for the speech to be recognized. A set of basis matrices may then be determined from the first matrix and the second matrix. Speech test data including a short utterance may then be received by the computer. The computer may then apply the set of basis matrices to the speech test data to produce a transcription. The transcription may represent speech recognition of the short utterance.
    Type: Grant
    Filed: December 30, 2010
    Date of Patent: April 15, 2014
    Assignee: Microsoft Corporation
    Inventors: Daniel Povey, Kaisheng Yao, Yifan Gong
  • Publication number: 20140067387
    Abstract: Scalar operations for model adaptation or feature enhancement may be utilized for recognizing an utterance during automatic speech recognition in a noisy environment. An utterance including distorted speech generated from a transmission source for delivery to a receiver, may be received by a computer. The distorted speech may be caused by the noisy environment and channel distortion. Computations using scalar operations in the form of an algorithm may then be performed for recognizing the utterance. As a result of performing all of the computations with scalar operations, computational complexity is very small in comparison to matrix and vector operations. Vector Taylor Series with diagonal Jacobian approximation may also be utilized as a distortion-model-based noise robust algorithm with scalar operations.
    Type: Application
    Filed: September 5, 2012
    Publication date: March 6, 2014
    Applicant: MICROSOFT CORPORATION
    Inventors: Jinyu Li, Michael Lewis Seltzer, Yifan Gong
  • Patent number: 8457946
    Abstract: Architecture for correcting incorrect recognition results in an Asian language speech recognition system. A spelling mode can be launched in response to receiving speech input, the spelling mode for correcting incorrect spelling of the recognition results or generating new words. Correction can be obtained using speech and/or manual selection and entry. The architecture facilitates correction in a single pass, rather than multiples times as in conventional systems. Words corrected using the spelling mode are corrected as a unit and treated as a word. The spelling mode applies to languages of at least the Asian continent, such as Simplified Chinese, Traditional Chinese, and/or other Asian languages such as Japanese.
    Type: Grant
    Filed: April 26, 2007
    Date of Patent: June 4, 2013
    Assignee: Microsoft Corporation
    Inventors: Shiun-Zu Kuo, Kevin E. Feige, Yifan Gong, Taro Miwa, Arun Chitrapu
  • Publication number: 20130080165
    Abstract: Online histogram recognition may be provided. Upon receiving a spoken phrase from a user, a histogram/frequency distribution may be estimated on the spoken phrase according to a prior distribution. The histogram distribution may be equalized and then provided to a spoken language understanding application.
    Type: Application
    Filed: September 24, 2011
    Publication date: March 28, 2013
    Applicant: Microsoft Corporation
    Inventors: Shizen Wang, Yifan Gong
  • Patent number: 8306819
    Abstract: Techniques for enhanced automatic speech recognition are described. An enhanced ASR system may be operative to generate an error correction function. The error correction function may represent a mapping between a supervised set of parameters and an unsupervised training set of parameters generated using a same set of acoustic training data, and apply the error correction function to an unsupervised testing set of parameters to form a corrected set of parameters used to perform speaker adaptation. Other embodiments are described and claimed.
    Type: Grant
    Filed: March 9, 2009
    Date of Patent: November 6, 2012
    Assignee: Microsoft Corporation
    Inventors: Chaojun Liu, Yifan Gong
  • Patent number: 8239195
    Abstract: A speech recognition system includes a receiver component that receives a distorted speech utterance. The speech recognition also includes an adaptor component that selectively adapts parameters of a compressed model used to recognize at least a portion of the distorted speech utterance, wherein the adaptor component selectively adapts the parameters of the compressed model based at least in part upon the received distorted speech utterance.
    Type: Grant
    Filed: September 23, 2008
    Date of Patent: August 7, 2012
    Assignee: Microsoft Corporation
    Inventors: Jinyu Li, Li Deng, Dong Yu, Jian Wu, Yifan Gong, Alejandro Acero
  • Publication number: 20120173240
    Abstract: Subspace speech adaptation may be utilized for facilitating the recognition of speech containing short utterances. Speech training data may be received in a speech model by a computer. A first matrix may be determined for preconditioning speech statistics based on the speech training data. A second matrix may be determined for representing a basis for the speech to be recognized. A set of basis matrices may then be determined from the first matrix and the second matrix. Speech test data including a short utterance may then be received by the computer. The computer may then apply the set of basis matrices to the speech test data to produce a transcription. The transcription may represent speech recognition of the short utterance.
    Type: Application
    Filed: December 30, 2010
    Publication date: July 5, 2012
    Applicant: MICROSOFT CORPORATION
    Inventors: Daniel Povey, Kaisheng YAO, Yifan Gong
  • Patent number: 8214215
    Abstract: A speech recognition system described herein includes a receiver component that receives a distorted speech utterance. The speech recognition also includes an updater component that is in communication with a first model and a second model, wherein the updater component automatically updates parameters of the second model based at least in part upon joint estimates of additive and convolutive distortions output by the first model, wherein the joint estimates of additive and convolutive distortions are estimates of distortions based on a phase-sensitive model in the speech utterance received by the receiver component. Further, distortions other than additive and convolutive distortions, including other stationary and nonstationary sources, can also be estimated used to update the parameters of the second model.
    Type: Grant
    Filed: September 24, 2008
    Date of Patent: July 3, 2012
    Assignee: Microsoft Corporation
    Inventors: Jinyu Li, Li Deng, Dong Yu, Yifan Gong, Alejandro Acero
  • Publication number: 20120130710
    Abstract: Noise and channel distortion parameters in the vectorized logarithmic or the cepstral domain for an utterance may be estimated, and subsequently the distorted speech parameters in the same domain may be updated using an unscented transformation framework during online automatic speech recognition. An utterance, including speech generated from a transmission source for delivery to a receiver, may be received by a computing device. The computing device may execute instructions for applying the unscented transformation framework to speech feature vectors, representative of the speech, in order to estimate, in a sequential or online manner, static noise and channel distortion parameters and dynamic noise distortion parameters in the unscented transformation framework. The static and dynamic parameters for the distorted speech in the utterance may then be updated from clean speech parameters and the noise and channel distortion parameters using non-linear mapping.
    Type: Application
    Filed: November 18, 2010
    Publication date: May 24, 2012
    Applicant: Microsoft Corporation
    Inventors: Deng Li, Jinyu Li, Dong Yu, Yifan Gong
  • Patent number: 8185376
    Abstract: The language of origin of a word is determined by analyzing non-uniform letter sequence portions of the word.
    Type: Grant
    Filed: March 20, 2006
    Date of Patent: May 22, 2012
    Assignee: Microsoft Corporation
    Inventors: Min Chu, Yi Ning Chen, Shiun-Zu Kuo, Xiaodong He, Megan Riley, Kevin E. Feige, Yifan Gong
  • Patent number: 8185389
    Abstract: Described is noise reduction technology generally for speech input in which a noise-suppression related gain value for the frame is determined based upon a noise level associated with that frame in addition to the signal to noise ratios (SNRs). In one implementation, a noise reduction mechanism is based upon minimum mean square error, Mel-frequency cepstra noise reduction technology. A high gain value (e.g., one) is set to accomplish little or no noise suppression when the noise level is below a threshold low level, and a low gain value set or computed to accomplish large noise suppression above a threshold high noise level. A noise-power dependent function, e.g., a log-linear interpolation, is used to compute the gain between the thresholds. Smoothing may be performed by modifying the gain value based upon a prior frame's gain value. Also described is learning parameters used in noise reduction via a step-adaptive discriminative learning algorithm.
    Type: Grant
    Filed: December 16, 2008
    Date of Patent: May 22, 2012
    Assignee: Microsoft Corporation
    Inventors: Dong Yu, Li Deng, Yifan Gong, Jian Wu, Alejandro Acero