Patents by Inventor Kaisheng Yao
Kaisheng Yao has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).
-
Publication number: 20220122580Abstract: An example intent-recognition system comprises a processor and memory storing instructions. The instructions cause the processor to receive speech input comprising spoken words. The instructions cause the processor to generate text results based on the speech input and generate acoustic feature annotations based on the speech input. The instructions also cause the processor to apply an intent model to the text result and the acoustic feature annotations to recognize an intent based on the speech input. An example system for adapting an emotional text-to-speech model comprises a processor and memory. The memory stores instructions that cause the processor to receive training examples comprising speech input and receive labelling data comprising emotion information associated with the speech input. The instructions also cause the processor to extract audio signal vectors from the training examples and generate an emotion-adapted voice font model based on the audio signal vectors and the labelling data.Type: ApplicationFiled: December 24, 2021Publication date: April 21, 2022Inventors: Pei ZHAO, Kaisheng YAO, Max LEUNG, Bo YAN, Jian LUAN, Yu SHI, Malone MA, Mei-Yuh HWANG
-
Patent number: 11244689Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for determining voice characteristics are provided. One of the methods includes: obtaining speech data of a speaker; inputting the speech data into a model trained at least by jointly minimizing a first loss function and a second loss function, wherein the first loss function comprises a non-sampling-based loss function and the second loss function comprises a Gaussian mixture loss function with non-unit multi-variant covariance matrix; and obtaining from the trained model one or more voice characteristics of the speaker.Type: GrantFiled: March 22, 2021Date of Patent: February 8, 2022Assignee: ALIPAY (HANGZHOU) INFORMATION TECHNOLOGY CO., LTD.Inventors: Zhiming Wang, Kaisheng Yao, Xiaolong Li
-
Patent number: 11238842Abstract: An example intent-recognition system comprises a processor and memory storing instructions. The instructions cause the processor to receive speech input comprising spoken words. The instructions cause the processor to generate text results based on the speech input and generate acoustic feature annotations based on the speech input. The instructions also cause the processor to apply an intent model to the text result and the acoustic feature annotations to recognize an intent based on the speech input. An example system for adapting an emotional text-to-speech model comprises a processor and memory. The memory stores instructions that cause the processor to receive training examples comprising speech input and receive labelling data comprising emotion information associated with the speech input. The instructions also cause the processor to extract audio signal vectors from the training examples and generate an emotion-adapted voice font model based on the audio signal vectors and the labelling data.Type: GrantFiled: June 7, 2017Date of Patent: February 1, 2022Assignee: MICROSOFT TECHNOLOGY LICENSING, LLCInventors: Pei Zhao, Kaisheng Yao, Max Leung, Bo Yan, Jian Luan, Yu Shi, Malone Ma, Mei-Yuh Hwang
-
Patent number: 11100412Abstract: Implementations of the present specification provide a method and an apparatus for extending question and answer samples. According to the method, a random number is generated for each existing sample, a question is blurred for a sample whose random number belongs to sample extension random numbers, to generate an extended sample, so that an overall sample blurring extension rate can be effectively controlled. In addition, for a sample needing blurring extension, a question is extended by deleting a word with a predetermined part of speech in the corresponding question, and then an extended sample is generated based on an extended question, so that more question expression ways are compatible. As such, a question and answer model is trained by using a sample set to which extended samples are added, so that an answer can be provided to a user more effectively.Type: GrantFiled: March 13, 2020Date of Patent: August 24, 2021Assignee: Advanced New Technologies Co., Ltd.Inventors: Kaisheng Yao, Jiaxing Zhang, Jia Liu, Xiaolong Li
-
Publication number: 20210225357Abstract: An example intent-recognition system comprises a processor and memory storing instructions. The instructions cause the processor to receive speech input comprising spoken words. The instructions cause the processor to generate text results based on the speech input and generate acoustic feature annotations based on the speech input. The instructions also cause the processor to apply an intent model to the text result and the acoustic feature annotations to recognize an intent based on the speech input. An example system for adapting an emotional text-to-speech model comprises a processor and memory. The memory stores instructions that cause the processor to receive training examples comprising speech input and receive labelling data comprising emotion information associated with the speech input. The instructions also cause the processor to extract audio signal vectors from the training examples and generate an emotion-adapted voice font model based on the audio signal vectors and the labelling data.Type: ApplicationFiled: June 7, 2017Publication date: July 22, 2021Applicant: MICROSOFT TECHNOLOGY LICENSING, LLCInventors: Pei ZHAO, Kaisheng YAO, Max LEUNG, Bo YAN, Jian LUAN, Yu SHI, Malone MA, Mei-Yuh HWANG
-
Publication number: 20210210101Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for determining voice characteristics are provided. One of the methods includes: obtaining speech data of a speaker; inputting the speech data into a model trained at least by jointly minimizing a first loss function and a second loss function, wherein the first loss function comprises a non-sampling-based loss function and the second loss function comprises a Gaussian mixture loss function with non-unit multi-variant covariance matrix; and obtaining from the trained model one or more voice characteristics of the speaker.Type: ApplicationFiled: March 22, 2021Publication date: July 8, 2021Inventors: Zhiming WANG, Kaisheng YAO, Xiaolong LI
-
Patent number: 11031018Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for personalized speaker verification are provided. One of the methods includes: obtaining first speech data of a speaker as a positive sample and second speech data of an entity different from the speaker as a negative sample; feeding the positive sample and the negative sample to a first model for determining voice characteristics to correspondingly output a positive voice characteristic and a negative voice characteristic of the speaker; obtaining a gradient based at least on the positive voice characteristic and the negative voice characteristic; and feeding the gradient to the first model to update one or more parameters of the first model to obtain a second model for personalized speaker verification.Type: GrantFiled: December 22, 2020Date of Patent: June 8, 2021Assignee: ALIPAY (HANGZHOU) INFORMATION TECHNOLOGY CO., LTD.Inventors: Zhiming Wang, Kaisheng Yao, Xiaolong Li
-
Patent number: 10997980Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for determining voice characteristics are provided. One of the methods includes: obtaining speech data of a speaker; inputting the speech data into a model trained at least by jointly minimizing a first loss function and a second loss function, wherein the first loss function comprises a non-sampling-based loss function and the second function comprises a Gaussian mixture loss function with non-unit multi-variant covariance matrix; and obtaining from the trained model one or more voice characteristics of the speaker.Type: GrantFiled: October 27, 2020Date of Patent: May 4, 2021Assignee: ALIPAY (HANGZHOU) INFORMATION TECHNOLOGY CO., LTD.Inventors: Zhiming Wang, Kaisheng Yao, Xiaolong Li
-
Publication number: 20210110833Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for personalized speaker verification are provided. One of the methods includes: obtaining first speech data of a speaker as a positive sample and second speech data of an entity different from the speaker as a negative sample; feeding the positive sample and the negative sample to a first model for determining voice characteristics to correspondingly output a positive voice characteristic and a negative voice characteristic of the speaker; obtaining a gradient based at least on the positive voice characteristic and the negative voice characteristic; and feeding the gradient to the first model to update one or more parameters of the first model to obtain a second model for personalized speaker verification.Type: ApplicationFiled: December 22, 2020Publication date: April 15, 2021Inventors: Zhiming WANG, Kaisheng YAO, Xiaolong LI
-
Publication number: 20210043216Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for determining voice characteristics are provided. One of the methods includes: obtaining speech data of a speaker; inputting the speech data into a model trained at least by jointly minimizing a first loss function and a second loss function, wherein the first loss function comprises a non-sampling-based loss function and the second function comprises a Gaussian mixture loss function with non-unit multi-variant covariance matrix; and obtaining from the trained model one or more voice characteristics of the speaker.Type: ApplicationFiled: October 27, 2020Publication date: February 11, 2021Inventors: Zhiming WANG, Kaisheng YAO, Xiaolong LI
-
Publication number: 20210027177Abstract: Implementations of the present specification provide a method and an apparatus for extending question and answer samples. According to the method, a random number is generated for each existing sample, a question is blurred for a sample whose random number belongs to sample extension random numbers, to generate an extended sample, so that an overall sample blurring extension rate can be effectively controlled. In addition, for a sample needing blurring extension, a question is extended by deleting a word with a predetermined part of speech in the corresponding question, and then an extended sample is generated based on an extended question, so that more question expression ways are compatible. As such, a question and answer model is trained by using a sample set to which extended samples are added, so that an answer can be provided to a user more effectively.Type: ApplicationFiled: March 13, 2020Publication date: January 28, 2021Inventors: Kaisheng Yao, Jiaxing Zhang, Jia Liu, Xiaolong Li
-
Patent number: 10867597Abstract: Technologies pertaining to slot filling are described herein. A deep neural network, a recurrent neural network, and/or a spatio-temporally deep neural network are configured to assign labels to words in a word sequence set forth in natural language. At least one label is a semantic label that is assigned to at least one word in the word sequence.Type: GrantFiled: September 2, 2013Date of Patent: December 15, 2020Assignee: Microsoft Technology Licensing, LLCInventors: Anoop Deoras, Kaisheng Yao, Xiaodong He, Li Deng, Geoffrey Gerson Zweig, Ruhi Sarikaya, Dong Yu, Mei-Yuh Hwang, Gregoire Mesnil
-
Publication number: 20200167388Abstract: Data storage and calling methods and devices are provided. One of the methods includes: receiving first motion data and business data; establishing an association relationship between the first motion data and the business data and storing the association relationship; receiving second motion data; and determining first motion data that matches the second motion data, and returning, to a sender of the second motion data, business data associated with the matched first motion data.Type: ApplicationFiled: January 28, 2020Publication date: May 28, 2020Inventors: Kaisheng YAO, Peng Xu, Yuan Qi, Xiaofu Chang
-
Publication number: 20190294632Abstract: Data storage and calling methods and devices are provided. One of the methods includes: receiving first motion data and business data; establishing an association relationship between the first motion data and the business data and storing the association relationship; receiving second motion data; and determining first motion data that matches the second motion data, and returning, to a sender of the second motion data, business data associated with the matched first motion data.Type: ApplicationFiled: June 12, 2019Publication date: September 26, 2019Inventors: Kaisheng YAO, Peng Xu, Yuan Qi, Xiaofu Chang
-
Patent number: 10127901Abstract: The technology relates to converting text to speech utilizing recurrent neural networks (RNNs). The recurrent neural networks may be implemented as multiple modules for determining properties of the text. In embodiments, a part-of-speech RNN module, letter-to-sound RNN module, a linguistic prosody tagger RNN module, and a context awareness and semantic mining RNN module may all be utilized. The properties from the RNN modules are processed by a hyper-structure RNN module that determine the phonetic properties of the input text based on the outputs of the other RNN modules. The hyper-structure RNN module may generate a generation sequence that is capable of being converting to audible speech by a speech synthesizer. The generation sequence may also be optimized by a global optimization module prior to being synthesized into audible speech.Type: GrantFiled: June 13, 2014Date of Patent: November 13, 2018Assignee: Microsoft Technology Licensing, LLCInventors: Pei Zhao, Max Leung, Kaisheng Yao, Bo Yan, Sheng Zhao, Fileno A. Alleva
-
Patent number: 10089974Abstract: An example text-to-speech learning system performs a method for generating a pronunciation sequence conversion model. The method includes generating a first pronunciation sequence from a speech input of a training pair and generating a second pronunciation sequence from a text input of the training pair. The method also includes determining a pronunciation sequence difference between the first pronunciation sequence and the second pronunciation sequence; and generating a pronunciation sequence conversion model based on the pronunciation sequence difference. An example speech recognition learning system performs a method for generating a pronunciation sequence conversion model. The method includes extracting an audio signal vector from a speech input and applying an audio signal conversion model to the audio signal vector to generate a converted audio signal vector. The method also includes adapting an acoustic model based on the converted audio signal vector to generate an adapted acoustic model.Type: GrantFiled: March 31, 2016Date of Patent: October 2, 2018Assignee: Microsoft Technology Licensing, LLCInventors: Pei Zhao, Kaisheng Yao, Max Leung, Bo Yan
-
Publication number: 20170287465Abstract: An example text-to-speech learning system performs a method for generating a pronunciation sequence conversion model. The method includes generating a first pronunciation sequence from a speech input of a training pair and generating a second pronunciation sequence from a text input of the training pair. The method also includes determining a pronunciation sequence difference between the first pronunciation sequence and the second pronunciation sequence; and generating a pronunciation sequence conversion model based on the pronunciation sequence difference. An example speech recognition learning system performs a method for generating a pronunciation sequence conversion model. The method includes extracting an audio signal vector from a speech input and applying an audio signal conversion model to the audio signal vector to generate a converted audio signal vector. The method also includes adapting an acoustic model based on the converted audio signal vector to generate an adapted acoustic model.Type: ApplicationFiled: March 31, 2016Publication date: October 5, 2017Applicant: Microsoft Technology Licensing, LLCInventors: Pei Zhao, Kaisheng Yao, Max Leung, Bo Yan
-
Publication number: 20160307565Abstract: Aspects of the technology described herein relate to a new type of deep neural network (DNN). The new DNN is described herein as a deep neural support vector machine (DNSVM). Traditional DNNs use the multinomial logistic regression (softmax activation) at the top layer and underlying layers for training. The new DNN instead uses a support vector machine (SVM) as one or more layers, including the top layer. The technology described herein can use one of two training algorithms to train the DNSVM to learn parameters of SVM and DNN in the maximum-margin criteria. The first training method is a frame-level training. In the frame-level training, the new model is shown to be related to the multi-class SVM with DNN features. The second training method is the sequence-level training. The sequence-level training is related to the structured SVM with DNN features and HMM state transition features.Type: ApplicationFiled: February 16, 2016Publication date: October 20, 2016Inventors: CHAOJUN LIU, KAISHENG YAO, YIFAN GONG, SHIXIONG ZHANG
-
Publication number: 20160091965Abstract: A “Natural Motion Controller” identifies various motions of one or more parts of a user's body to interact with electronic devices, thereby enabling various natural user interface (NUI) scenarios. The Natural Motion Controller constructs composite motion recognition windows by concatenating an adjustable number of sequential periods of inertial sensor data received from a plurality of separate sets of inertial sensors. Each of these separate sets of inertial sensors are coupled to, or otherwise provide sensor data relating to, a separate user worn, carried, or held mobile computing device. Each composite motion recognition window is then passed to a motion recognition model trained by one or more machine-based deep learning processes. This motion recognition model is then applied to the composite motion recognition windows to identify a sequence of one or more predefined motions. Identified motions are then used as the basis for triggering execution of one or more application commands.Type: ApplicationFiled: September 30, 2014Publication date: March 31, 2016Inventors: Jiaping Wang, Yujia Li, Xuedong Huang, Lingfeng Wu, Wei Xiong, Kaisheng Yao, Geoffrey Zweig
-
Patent number: 9280969Abstract: Techniques and systems for training an acoustic model are described. In an embodiment, a technique for training an acoustic model includes dividing a corpus of training data that includes transcription errors into N parts, and on each part, decoding an utterance with an incremental acoustic model and an incremental language model to produce a decoded transcription. The technique may further include inserting silence between a pair of words into the decoded transcription and aligning an original transcription corresponding to the utterance with the decoded transcription according to time for each part. The technique may further include selecting a segment from the utterance having at least Q contiguous matching aligned words, and training the incremental acoustic model with the selected segment. The trained incremental acoustic model may then be used on a subsequent part of the training data. Other embodiments are described and claimed.Type: GrantFiled: June 10, 2009Date of Patent: March 8, 2016Assignee: MICROSOFT TECHNOLOGY LICENSING, LLCInventors: Jinyu Li, Yifan Gong, Chaojun Liu, Kaisheng Yao