Patents by Inventor Frank Kao-Ping Soong

Frank Kao-Ping Soong has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

  • Publication number: 20110071835
    Abstract: Embodiments of small footprint text-to-speech engine are disclosed. In operation, the small footprint text-to-speech engine generates a set of feature parameters for an input text. The set of feature parameters includes static feature parameters and delta feature parameters. The small footprint text-to-speech engine then derives a saw-tooth stochastic trajectory that represents the speech characteristics of the input text based on the static feature parameters and the delta parameters. Finally, the small footprint text-to-speech engine produces a smoothed trajectory from the saw-tooth stochastic trajectory, and generates synthesized speech based on the smoothed trajectory.
    Type: Application
    Filed: September 22, 2009
    Publication date: March 24, 2011
    Applicant: MICROSOFT CORPORATION
    Inventors: Yi-Ning Chen, Zhi-Jie Yan, Frank Kao-Ping Soong
  • Publication number: 20110054903
    Abstract: Embodiments of rich text modeling for speech synthesis are disclosed. In operation, a text-to-speech engine refines a plurality of rich context models based on decision tree-tied Hidden Markov Models (HMMs) to produce a plurality of refined rich context models. The text-to-speech engine then generates synthesized speech for an input text based at least on some of the plurality of refined rich context models.
    Type: Application
    Filed: December 2, 2009
    Publication date: March 3, 2011
    Applicant: MICROSOFT CORPORATION
    Inventors: Zhi-Jie Yan, Yao Qian, Frank Kao-Ping Soong
  • Patent number: 7885456
    Abstract: A forward pass through a sequence of strokes representing a handwritten equation is performed from the first stroke to the last stroke in the sequence. At each stroke, a path score is determined for a plurality of symbol-relation pairs that each represents a symbol and its spatial relation to a predecessor symbol. A symbol graph having nodes and links is constructed by backtracking through the strokes from the last stroke to the first stroke and assigning scores to the links based on the path scores for the symbol-relation pairs. The symbol graph is used to recognize a mathematical expression based in part on the scores for the links and the mathematical expression is stored.
    Type: Grant
    Filed: March 29, 2007
    Date of Patent: February 8, 2011
    Assignee: Microsoft Corporation
    Inventors: Yu Shi, Frank Kao-Ping Soong, Jian-Iai Zhou, Dongmei Zhang, legal representative
  • Patent number: 7844457
    Abstract: Methods are disclosed for automatic accent labeling without manually labeled data. The methods are designed to exploit accent distribution between function and content words.
    Type: Grant
    Filed: February 20, 2007
    Date of Patent: November 30, 2010
    Assignee: Microsoft Corporation
    Inventors: YiNing Chen, Frank Kao-ping Soong, Min Chu
  • Patent number: 7805301
    Abstract: A reliable full covariance matrix estimation algorithm for pattern unit's state output distribution in pattern recognition system is discussed. An intermediate hierarchical tree structure is built to relate models for product units. Full covariance matrices of pattern unit's state output distribution are estimated based on all the related nodes in the tree.
    Type: Grant
    Filed: July 1, 2005
    Date of Patent: September 28, 2010
    Assignee: Microsoft Corporation
    Inventors: Ye Tian, Frank Kao-Ping Soong, Jian-Lai Zhou
  • Publication number: 20100198577
    Abstract: Creation of sub-phonemic Hidden Markov Model (HMM) states and the mapping of those states results in improved cross-language speaker adaptation. The smaller sub-phonemic mapping provides improvements in usability and intelligibility particularly between languages with few common phonemes. HMM states of different languages may be mapped to one another using a distance between the HMM states in acoustic space. This distance may be calculated using Kullback-Leibler divergence and multi-space probability distribution. By combining distance mapping and context mapping for different speakers of the same language improved cross-language speaker adaptation is possible.
    Type: Application
    Filed: February 3, 2009
    Publication date: August 5, 2010
    Applicant: Microsoft Corporation
    Inventors: Yi-Ning Chen, Yao Qian, Frank Kao-Ping Soong
  • Publication number: 20100166314
    Abstract: Methods and apparatuses for generating, by a computing device configured to interpret a handwritten expression, a symbol graph to represent strokes associated with the handwritten expression, are described herein. The symbol graph may include nodes, each node corresponding to a combination of a stroke and a candidate symbol for that stroke. The computing device may also generate a segment graph based on the symbol graph by combining nodes associated with a same stroke if strokes of their preceding nodes are the same. Also the computing device may perform a structure analysis on at least a subset of segment sequences represented by the segment graph to determine hypotheses for the handwritten expression. In other embodiments, rather than generate a segment graph, the computing device may determine segment sequences by selecting a number of symbol sequences from the symbol graph and combining symbol sequences having the same segmentation.
    Type: Application
    Filed: December 30, 2008
    Publication date: July 1, 2010
    Applicant: MICROSOFT CORPORATION
    Inventors: Yu Shi, Frank Kao-Ping Soong
  • Publication number: 20100082345
    Abstract: An “Animation Synthesizer” uses trainable probabilistic models, such as Hidden Markov Models (HMM), Artificial Neural Networks (ANN), etc., to provide speech and text driven body animation synthesis. Probabilistic models are trained using synchronized motion and speech inputs (e.g., live or recorded audio/video feeds) at various speech levels, such as sentences, phrases, words, phonemes, sub-phonemes, etc., depending upon the available data, and the motion type or body part being modeled. The Animation Synthesizer then uses the trainable probabilistic model for selecting animation trajectories for one or more different body parts (e.g., face, head, hands, arms, etc.) based on an arbitrary text and/or speech input. These animation trajectories are then used to synthesize a sequence of animations for digital avatars, cartoon characters, computer generated anthropomorphic persons or creatures, actual motions for physical robots, etc.
    Type: Application
    Filed: September 26, 2008
    Publication date: April 1, 2010
    Applicant: MICROSOFT CORPORATION
    Inventors: Lijuan Wang, Lei Ma, Frank Kao-Ping Soong
  • Patent number: 7689408
    Abstract: The language of origin of a word or named entity is predicted using estimates of frequency of occurrence of the word or named entity in different languages. In one embodiment, the normalized frequency of occurrence of the word or named entity in a variety of different languages is estimated and the values are used as features in a feature vector which is scored and used to identify language of origin.
    Type: Grant
    Filed: September 1, 2006
    Date of Patent: March 30, 2010
    Assignee: Microsoft Corporation
    Inventors: Yi Ning Chen, Min Chu, Jiali You, Frank Kao-Ping Soong
  • Patent number: 7689421
    Abstract: Described is a voice persona service by which users convert text into speech waveforms, based on user-provided parameters and voice data from a service data store. The service may be remotely accessed, such as via the Internet. The user may provide text tagged with parameters, with the text sent to a text-to-speech engine along with base or custom voice data, and the resulting waveform morphed based on the tags. The user may also provide speech. Once created, a voice persona corresponding to the speech waveform may be persisted, exchanged, made public, shared and so forth. In one example, the voice persona service receives user input and parameters, and retrieves a base or custom voice that may be edited by the user via a morphing algorithm. The service outputs a waveform, such as a .wav file for embedding in a software program, and persists the voice persona corresponding to that waveform.
    Type: Grant
    Filed: June 27, 2007
    Date of Patent: March 30, 2010
    Assignee: Microsoft Corporation
    Inventors: Yusheng Li, Min Chu, Xin Zou, Frank Kao-ping Soong
  • Publication number: 20100066742
    Abstract: Described is a technology by which the prosody of synthesized speech may be changed by varying data associated with that speech. An interface displays a visual representation of synthesized speech as one or more waveforms, along with the corresponding text from which the speech was synthesized. The user may interact with the visual representation to change data corresponding to the prosody, e.g., to change duration, pitch and/or loudness data, with respect to a part (or all) of the speech. The part of the speech that may be varied may comprise a phoneme, a morpheme, a syllable, a word, a phrase, and/or a sentence. The changed speech can be played back to hear the change in prosody resulting from the interactive changes. The user can also change the text and hear/see newly synthesized speech, which may then be similarly edited to change data that corresponds to that speech's prosody.
    Type: Application
    Filed: September 18, 2008
    Publication date: March 18, 2010
    Applicant: MICROSOFT CORPORATION
    Inventors: Yao Qian, Frank Kao-ping Soong
  • Patent number: 7680657
    Abstract: Possible segmentations for an audio signal are scored based on distortions for feature vectors of the audio signal and the total number of segments in the segmentation. The scores are used to select a segmentation and the selected segmentation is used to identify a starting point and an ending point for a speech signal in the audio signal.
    Type: Grant
    Filed: August 15, 2006
    Date of Patent: March 16, 2010
    Assignee: Microsoft Corporation
    Inventors: Yu Shi, Frank Kao-ping Soong, Jian-Iai Zhou
  • Patent number: 7680664
    Abstract: A multi-state pattern recognition model with non-uniform kernel allocation is formed by setting a number of states for a multi-state pattern recognition model and assigning different numbers of kernels to different states. The kernels are then trained using training data to form the multi-state pattern recognition model.
    Type: Grant
    Filed: August 16, 2006
    Date of Patent: March 16, 2010
    Assignee: Microsoft Corporation
    Inventors: Peng Liu, Jian-Iai Zhou, Frank Kao-ping Soong
  • Publication number: 20090324082
    Abstract: An exemplary method includes receiving stroke information for a partially written East Asian character, the East Asian character representable by one or more radicals; based on the stroke information, selecting a radical on a prefix tree wherein the prefix tree branches to East Asian characters as end states; identifying one or more East Asian characters as end states that correspond to the selected radical for the partially written East Asian character; and receiving user input to verify that one of the identified one or more East Asian characters is the end state for the partially written East Asian character. In such a method, the selection of a radical can occur using radical-based hidden Markov models. Various other exemplary methods, devices, systems, etc., are also disclosed.
    Type: Application
    Filed: June 26, 2008
    Publication date: December 31, 2009
    Applicant: Microsoft Corporation
    Inventors: Peng Liu, Lei Ma, Frank Kao-Ping Soong
  • Publication number: 20090245646
    Abstract: One way of recognizing online handwritten mathematical expressions is to use a one-pass dynamic programming based symbol decoding generation algorithm. This method embeds segmentation into symbol identification to form a unified framework for symbol recognition. Along with decoding, a symbol graph is produced. Besides accurately recognizing handwritten mathematical expressions, this method can produce high quality symbol graphs. This method uses six knowledge source models to help search for possible symbol hypotheses during the decoding process. Here, knowledge source exponential weights and a symbol insertion penalty are used to weigh the various knowledge source model probabilities to increase accuracy.
    Type: Application
    Filed: March 28, 2008
    Publication date: October 1, 2009
    Applicant: Microsoft Corporation
    Inventors: Yu Shi, Frank Kao-Ping Soong
  • Publication number: 20090240501
    Abstract: Described is a technology by which artificial words are generated based on seed words, and then used with a letter-to-sound conversion model. To generate an artificial word, a stressed syllable of a seed word is replaced with a different syllable, such as a candidate (artificial) syllable, when the phonemic structure and/or graphonemic structure of the stressed syllable and the candidate syllable match one another. In one aspect, the artificial words are provided for use with a letter-to-sound conversion model, which may be used to generate artificial phonemes from a source of words, such as in conjunction with other models. If the phonemes provided by the various models for a selected source word are in agreement relative to one another, the selected source word and an associated artificial phoneme may be added to a training set which may then be used to retrain the letter-to-sound conversion model.
    Type: Application
    Filed: March 19, 2008
    Publication date: September 24, 2009
    Applicant: MICROSOFT CORPORATION
    Inventors: Yi Ning Chen, Jia Li You, Frank Kao-ping Soong
  • Publication number: 20090228273
    Abstract: A speech recognition result is displayed for review by a user. If it is incorrect, the user provides pen-based editing marks. An error type and location (within the speech recognition result) are identified based on the pen-based editing marks. An alternative result template is generated, and an N-best alternative list is also generated by applying the template to intermediate recognition results from an automatic speech recognizer. The N-best alternative list is output for use in correcting the speech recognition results.
    Type: Application
    Filed: March 5, 2008
    Publication date: September 10, 2009
    Applicant: MICROSOFT CORPORATION
    Inventors: Lijuan Wang, Frank Kao-Ping Soong
  • Publication number: 20090214117
    Abstract: Described is a bimodal data input technology by which handwriting recognition results are combined with speech recognition results to improve overall recognition accuracy. Handwriting data and speech data corresponding to mathematical symbols are received and processed (including being recognized) into respective graphs. A fusion mechanism uses the speech graph to enhance the handwriting graph, e.g., to better distinguish between similar handwritten symbols that are often misrecognized. The graphs include nodes representing symbols, and arcs between the nodes representing probability scores. When arcs in the first and second graphs are determined to match one another, such as aligned in time and associated with corresponding symbols, the probability score in the second graph for that arc is used to adjust the matching probability score in the first graph. Normalization and smoothing may be performed to correspond the graphs to one another and to control the influence of one graph on the other.
    Type: Application
    Filed: February 26, 2008
    Publication date: August 27, 2009
    Applicant: MICROSOFT CORPORATION
    Inventors: Lei Ma, Yu Shi, Frank Kao-ping Soong
  • Publication number: 20090177601
    Abstract: Described is a technology by which personal information that comes into a computer system is intelligently managed according to current state data including user presence and/or user attention data. Incoming information is processed against the state data to determine whether corresponding data is to be output, and if so, what output modality or modalities to use. For example, if a user is present and busy, a notification may be blocked or deferred to avoid disturbing the user. Cost analysis may be used to determine the cost of outputting the data. In addition to user state data, the importance of the information, other state data, the cost of converting data to another format for output (e.g., text-to-speech), and/or user preference data, may factor into the decision. The output data may be modified (e.g., audio made louder) based on a current output environment as determined via the state data.
    Type: Application
    Filed: January 8, 2008
    Publication date: July 9, 2009
    Applicant: MICROSOFT CORPORATION
    Inventors: Chao Huang, Chunhui Zhang, Frank Kao-ping Soong, Zhengyou Zhang, Yuan Kong
  • Publication number: 20090083036
    Abstract: Described is a technology by which synthesized speech generated from text is evaluated against a prosody model (trained offline) to determine whether the speech will sound unnatural. If so, the speech is regenerated with modified data. The evaluation and regeneration may be iterative until deemed natural sounding. For example, text is built into a lattice that is then (e.g., Viterbi) searched to find a best path. The sections (e.g., units) of data on the path are evaluated via a prosody model. If the evaluation deems a section to correspond to unnatural prosody, that section is replaced, e.g., by modifying/pruning the lattice and re-performing the search. Replacement may be iterative until all sections pass the evaluation. Unnatural prosody detection may be biased such that during evaluation, unnatural prosody is falsely detected at a higher rate relative to a rate at which unnatural prosody is missed.
    Type: Application
    Filed: September 20, 2007
    Publication date: March 26, 2009
    Applicant: Microsoft Corporation
    Inventors: Yong Zhao, Frank Kao-ping Soong, Min Chu, Lijuan Wang