Patents by Inventor Kaisheng Yao

Kaisheng Yao has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

  • Publication number: 20220122580
    Abstract: An example intent-recognition system comprises a processor and memory storing instructions. The instructions cause the processor to receive speech input comprising spoken words. The instructions cause the processor to generate text results based on the speech input and generate acoustic feature annotations based on the speech input. The instructions also cause the processor to apply an intent model to the text result and the acoustic feature annotations to recognize an intent based on the speech input. An example system for adapting an emotional text-to-speech model comprises a processor and memory. The memory stores instructions that cause the processor to receive training examples comprising speech input and receive labelling data comprising emotion information associated with the speech input. The instructions also cause the processor to extract audio signal vectors from the training examples and generate an emotion-adapted voice font model based on the audio signal vectors and the labelling data.
    Type: Application
    Filed: December 24, 2021
    Publication date: April 21, 2022
    Inventors: Pei ZHAO, Kaisheng YAO, Max LEUNG, Bo YAN, Jian LUAN, Yu SHI, Malone MA, Mei-Yuh HWANG
  • Patent number: 11244689
    Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for determining voice characteristics are provided. One of the methods includes: obtaining speech data of a speaker; inputting the speech data into a model trained at least by jointly minimizing a first loss function and a second loss function, wherein the first loss function comprises a non-sampling-based loss function and the second loss function comprises a Gaussian mixture loss function with non-unit multi-variant covariance matrix; and obtaining from the trained model one or more voice characteristics of the speaker.
    Type: Grant
    Filed: March 22, 2021
    Date of Patent: February 8, 2022
    Assignee: ALIPAY (HANGZHOU) INFORMATION TECHNOLOGY CO., LTD.
    Inventors: Zhiming Wang, Kaisheng Yao, Xiaolong Li
  • Patent number: 11238842
    Abstract: An example intent-recognition system comprises a processor and memory storing instructions. The instructions cause the processor to receive speech input comprising spoken words. The instructions cause the processor to generate text results based on the speech input and generate acoustic feature annotations based on the speech input. The instructions also cause the processor to apply an intent model to the text result and the acoustic feature annotations to recognize an intent based on the speech input. An example system for adapting an emotional text-to-speech model comprises a processor and memory. The memory stores instructions that cause the processor to receive training examples comprising speech input and receive labelling data comprising emotion information associated with the speech input. The instructions also cause the processor to extract audio signal vectors from the training examples and generate an emotion-adapted voice font model based on the audio signal vectors and the labelling data.
    Type: Grant
    Filed: June 7, 2017
    Date of Patent: February 1, 2022
    Assignee: MICROSOFT TECHNOLOGY LICENSING, LLC
    Inventors: Pei Zhao, Kaisheng Yao, Max Leung, Bo Yan, Jian Luan, Yu Shi, Malone Ma, Mei-Yuh Hwang
  • Patent number: 11100412
    Abstract: Implementations of the present specification provide a method and an apparatus for extending question and answer samples. According to the method, a random number is generated for each existing sample, a question is blurred for a sample whose random number belongs to sample extension random numbers, to generate an extended sample, so that an overall sample blurring extension rate can be effectively controlled. In addition, for a sample needing blurring extension, a question is extended by deleting a word with a predetermined part of speech in the corresponding question, and then an extended sample is generated based on an extended question, so that more question expression ways are compatible. As such, a question and answer model is trained by using a sample set to which extended samples are added, so that an answer can be provided to a user more effectively.
    Type: Grant
    Filed: March 13, 2020
    Date of Patent: August 24, 2021
    Assignee: Advanced New Technologies Co., Ltd.
    Inventors: Kaisheng Yao, Jiaxing Zhang, Jia Liu, Xiaolong Li
  • Publication number: 20210225357
    Abstract: An example intent-recognition system comprises a processor and memory storing instructions. The instructions cause the processor to receive speech input comprising spoken words. The instructions cause the processor to generate text results based on the speech input and generate acoustic feature annotations based on the speech input. The instructions also cause the processor to apply an intent model to the text result and the acoustic feature annotations to recognize an intent based on the speech input. An example system for adapting an emotional text-to-speech model comprises a processor and memory. The memory stores instructions that cause the processor to receive training examples comprising speech input and receive labelling data comprising emotion information associated with the speech input. The instructions also cause the processor to extract audio signal vectors from the training examples and generate an emotion-adapted voice font model based on the audio signal vectors and the labelling data.
    Type: Application
    Filed: June 7, 2017
    Publication date: July 22, 2021
    Applicant: MICROSOFT TECHNOLOGY LICENSING, LLC
    Inventors: Pei ZHAO, Kaisheng YAO, Max LEUNG, Bo YAN, Jian LUAN, Yu SHI, Malone MA, Mei-Yuh HWANG
  • Publication number: 20210210101
    Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for determining voice characteristics are provided. One of the methods includes: obtaining speech data of a speaker; inputting the speech data into a model trained at least by jointly minimizing a first loss function and a second loss function, wherein the first loss function comprises a non-sampling-based loss function and the second loss function comprises a Gaussian mixture loss function with non-unit multi-variant covariance matrix; and obtaining from the trained model one or more voice characteristics of the speaker.
    Type: Application
    Filed: March 22, 2021
    Publication date: July 8, 2021
    Inventors: Zhiming WANG, Kaisheng YAO, Xiaolong LI
  • Patent number: 11031018
    Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for personalized speaker verification are provided. One of the methods includes: obtaining first speech data of a speaker as a positive sample and second speech data of an entity different from the speaker as a negative sample; feeding the positive sample and the negative sample to a first model for determining voice characteristics to correspondingly output a positive voice characteristic and a negative voice characteristic of the speaker; obtaining a gradient based at least on the positive voice characteristic and the negative voice characteristic; and feeding the gradient to the first model to update one or more parameters of the first model to obtain a second model for personalized speaker verification.
    Type: Grant
    Filed: December 22, 2020
    Date of Patent: June 8, 2021
    Assignee: ALIPAY (HANGZHOU) INFORMATION TECHNOLOGY CO., LTD.
    Inventors: Zhiming Wang, Kaisheng Yao, Xiaolong Li
  • Patent number: 10997980
    Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for determining voice characteristics are provided. One of the methods includes: obtaining speech data of a speaker; inputting the speech data into a model trained at least by jointly minimizing a first loss function and a second loss function, wherein the first loss function comprises a non-sampling-based loss function and the second function comprises a Gaussian mixture loss function with non-unit multi-variant covariance matrix; and obtaining from the trained model one or more voice characteristics of the speaker.
    Type: Grant
    Filed: October 27, 2020
    Date of Patent: May 4, 2021
    Assignee: ALIPAY (HANGZHOU) INFORMATION TECHNOLOGY CO., LTD.
    Inventors: Zhiming Wang, Kaisheng Yao, Xiaolong Li
  • Publication number: 20210110833
    Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for personalized speaker verification are provided. One of the methods includes: obtaining first speech data of a speaker as a positive sample and second speech data of an entity different from the speaker as a negative sample; feeding the positive sample and the negative sample to a first model for determining voice characteristics to correspondingly output a positive voice characteristic and a negative voice characteristic of the speaker; obtaining a gradient based at least on the positive voice characteristic and the negative voice characteristic; and feeding the gradient to the first model to update one or more parameters of the first model to obtain a second model for personalized speaker verification.
    Type: Application
    Filed: December 22, 2020
    Publication date: April 15, 2021
    Inventors: Zhiming WANG, Kaisheng YAO, Xiaolong LI
  • Publication number: 20210043216
    Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for determining voice characteristics are provided. One of the methods includes: obtaining speech data of a speaker; inputting the speech data into a model trained at least by jointly minimizing a first loss function and a second loss function, wherein the first loss function comprises a non-sampling-based loss function and the second function comprises a Gaussian mixture loss function with non-unit multi-variant covariance matrix; and obtaining from the trained model one or more voice characteristics of the speaker.
    Type: Application
    Filed: October 27, 2020
    Publication date: February 11, 2021
    Inventors: Zhiming WANG, Kaisheng YAO, Xiaolong LI
  • Publication number: 20210027177
    Abstract: Implementations of the present specification provide a method and an apparatus for extending question and answer samples. According to the method, a random number is generated for each existing sample, a question is blurred for a sample whose random number belongs to sample extension random numbers, to generate an extended sample, so that an overall sample blurring extension rate can be effectively controlled. In addition, for a sample needing blurring extension, a question is extended by deleting a word with a predetermined part of speech in the corresponding question, and then an extended sample is generated based on an extended question, so that more question expression ways are compatible. As such, a question and answer model is trained by using a sample set to which extended samples are added, so that an answer can be provided to a user more effectively.
    Type: Application
    Filed: March 13, 2020
    Publication date: January 28, 2021
    Inventors: Kaisheng Yao, Jiaxing Zhang, Jia Liu, Xiaolong Li
  • Patent number: 10867597
    Abstract: Technologies pertaining to slot filling are described herein. A deep neural network, a recurrent neural network, and/or a spatio-temporally deep neural network are configured to assign labels to words in a word sequence set forth in natural language. At least one label is a semantic label that is assigned to at least one word in the word sequence.
    Type: Grant
    Filed: September 2, 2013
    Date of Patent: December 15, 2020
    Assignee: Microsoft Technology Licensing, LLC
    Inventors: Anoop Deoras, Kaisheng Yao, Xiaodong He, Li Deng, Geoffrey Gerson Zweig, Ruhi Sarikaya, Dong Yu, Mei-Yuh Hwang, Gregoire Mesnil
  • Publication number: 20200167388
    Abstract: Data storage and calling methods and devices are provided. One of the methods includes: receiving first motion data and business data; establishing an association relationship between the first motion data and the business data and storing the association relationship; receiving second motion data; and determining first motion data that matches the second motion data, and returning, to a sender of the second motion data, business data associated with the matched first motion data.
    Type: Application
    Filed: January 28, 2020
    Publication date: May 28, 2020
    Inventors: Kaisheng YAO, Peng Xu, Yuan Qi, Xiaofu Chang
  • Publication number: 20190294632
    Abstract: Data storage and calling methods and devices are provided. One of the methods includes: receiving first motion data and business data; establishing an association relationship between the first motion data and the business data and storing the association relationship; receiving second motion data; and determining first motion data that matches the second motion data, and returning, to a sender of the second motion data, business data associated with the matched first motion data.
    Type: Application
    Filed: June 12, 2019
    Publication date: September 26, 2019
    Inventors: Kaisheng YAO, Peng Xu, Yuan Qi, Xiaofu Chang
  • Patent number: 10127901
    Abstract: The technology relates to converting text to speech utilizing recurrent neural networks (RNNs). The recurrent neural networks may be implemented as multiple modules for determining properties of the text. In embodiments, a part-of-speech RNN module, letter-to-sound RNN module, a linguistic prosody tagger RNN module, and a context awareness and semantic mining RNN module may all be utilized. The properties from the RNN modules are processed by a hyper-structure RNN module that determine the phonetic properties of the input text based on the outputs of the other RNN modules. The hyper-structure RNN module may generate a generation sequence that is capable of being converting to audible speech by a speech synthesizer. The generation sequence may also be optimized by a global optimization module prior to being synthesized into audible speech.
    Type: Grant
    Filed: June 13, 2014
    Date of Patent: November 13, 2018
    Assignee: Microsoft Technology Licensing, LLC
    Inventors: Pei Zhao, Max Leung, Kaisheng Yao, Bo Yan, Sheng Zhao, Fileno A. Alleva
  • Patent number: 10089974
    Abstract: An example text-to-speech learning system performs a method for generating a pronunciation sequence conversion model. The method includes generating a first pronunciation sequence from a speech input of a training pair and generating a second pronunciation sequence from a text input of the training pair. The method also includes determining a pronunciation sequence difference between the first pronunciation sequence and the second pronunciation sequence; and generating a pronunciation sequence conversion model based on the pronunciation sequence difference. An example speech recognition learning system performs a method for generating a pronunciation sequence conversion model. The method includes extracting an audio signal vector from a speech input and applying an audio signal conversion model to the audio signal vector to generate a converted audio signal vector. The method also includes adapting an acoustic model based on the converted audio signal vector to generate an adapted acoustic model.
    Type: Grant
    Filed: March 31, 2016
    Date of Patent: October 2, 2018
    Assignee: Microsoft Technology Licensing, LLC
    Inventors: Pei Zhao, Kaisheng Yao, Max Leung, Bo Yan
  • Publication number: 20170287465
    Abstract: An example text-to-speech learning system performs a method for generating a pronunciation sequence conversion model. The method includes generating a first pronunciation sequence from a speech input of a training pair and generating a second pronunciation sequence from a text input of the training pair. The method also includes determining a pronunciation sequence difference between the first pronunciation sequence and the second pronunciation sequence; and generating a pronunciation sequence conversion model based on the pronunciation sequence difference. An example speech recognition learning system performs a method for generating a pronunciation sequence conversion model. The method includes extracting an audio signal vector from a speech input and applying an audio signal conversion model to the audio signal vector to generate a converted audio signal vector. The method also includes adapting an acoustic model based on the converted audio signal vector to generate an adapted acoustic model.
    Type: Application
    Filed: March 31, 2016
    Publication date: October 5, 2017
    Applicant: Microsoft Technology Licensing, LLC
    Inventors: Pei Zhao, Kaisheng Yao, Max Leung, Bo Yan
  • Publication number: 20160307565
    Abstract: Aspects of the technology described herein relate to a new type of deep neural network (DNN). The new DNN is described herein as a deep neural support vector machine (DNSVM). Traditional DNNs use the multinomial logistic regression (softmax activation) at the top layer and underlying layers for training. The new DNN instead uses a support vector machine (SVM) as one or more layers, including the top layer. The technology described herein can use one of two training algorithms to train the DNSVM to learn parameters of SVM and DNN in the maximum-margin criteria. The first training method is a frame-level training. In the frame-level training, the new model is shown to be related to the multi-class SVM with DNN features. The second training method is the sequence-level training. The sequence-level training is related to the structured SVM with DNN features and HMM state transition features.
    Type: Application
    Filed: February 16, 2016
    Publication date: October 20, 2016
    Inventors: CHAOJUN LIU, KAISHENG YAO, YIFAN GONG, SHIXIONG ZHANG
  • Publication number: 20160091965
    Abstract: A “Natural Motion Controller” identifies various motions of one or more parts of a user's body to interact with electronic devices, thereby enabling various natural user interface (NUI) scenarios. The Natural Motion Controller constructs composite motion recognition windows by concatenating an adjustable number of sequential periods of inertial sensor data received from a plurality of separate sets of inertial sensors. Each of these separate sets of inertial sensors are coupled to, or otherwise provide sensor data relating to, a separate user worn, carried, or held mobile computing device. Each composite motion recognition window is then passed to a motion recognition model trained by one or more machine-based deep learning processes. This motion recognition model is then applied to the composite motion recognition windows to identify a sequence of one or more predefined motions. Identified motions are then used as the basis for triggering execution of one or more application commands.
    Type: Application
    Filed: September 30, 2014
    Publication date: March 31, 2016
    Inventors: Jiaping Wang, Yujia Li, Xuedong Huang, Lingfeng Wu, Wei Xiong, Kaisheng Yao, Geoffrey Zweig
  • Patent number: 9280969
    Abstract: Techniques and systems for training an acoustic model are described. In an embodiment, a technique for training an acoustic model includes dividing a corpus of training data that includes transcription errors into N parts, and on each part, decoding an utterance with an incremental acoustic model and an incremental language model to produce a decoded transcription. The technique may further include inserting silence between a pair of words into the decoded transcription and aligning an original transcription corresponding to the utterance with the decoded transcription according to time for each part. The technique may further include selecting a segment from the utterance having at least Q contiguous matching aligned words, and training the incremental acoustic model with the selected segment. The trained incremental acoustic model may then be used on a subsequent part of the training data. Other embodiments are described and claimed.
    Type: Grant
    Filed: June 10, 2009
    Date of Patent: March 8, 2016
    Assignee: MICROSOFT TECHNOLOGY LICENSING, LLC
    Inventors: Jinyu Li, Yifan Gong, Chaojun Liu, Kaisheng Yao