Patents by Inventor Frank Kao-Ping Soong

Frank Kao-Ping Soong has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

ROBUST AUTHENTICATION OF DIGITAL AUDIO

Publication number: 20240203431

Abstract: Solutions for authenticating digital audio include: generating a first band-limited watermark using a first key, generating a second band-limited watermark using a second key, wherein the bandwidth of the second watermark does not overlap with the bandwidth of the first watermark; and embedding the first watermark and the second watermark into a segment of the digital audio file. Solutions also include determining a first watermark score of a segment of the digital audio file for the first watermark using the first key; determining a second watermark score of the segment of the digital audio file for the second watermark using the second key; based on at least the first watermark score and the second watermark score, determining a probability that the digital audio file is watermarked; and generating a report indicating whether the digital audio file is watermarked. In some examples, solutions may also embed and decode messages.

Type: Application

Filed: May 8, 2021

Publication date: June 20, 2024

Inventors: Yang CUI, Ke WANG, Lei HE, Frank Kao-Ping SOONG
Audio human interactive proof based on text-to-speech and semantics

Patent number: 10319363

Abstract: The text-to-speech audio HIP technique described herein in some embodiments uses different correlated or uncorrelated words or sentences generated via a text-to-speech engine as audio HIP challenges. The technique can apply different effects in the text-to-speech synthesizer speaking a sentence to be used as a HIP challenge string. The different effects can include, for example, spectral frequency warping; vowel duration warping; background addition; echo addition; and varying the time duration between words, among others. In some embodiments the technique varies the set of parameters to prevent using Automated Speech Recognition tools from using previously used audio HIP challenges to learn a model which can then be used to recognize future audio HIP challenges generated by the technique. Additionally, in some embodiments the technique introduces the requirement of semantic understanding in HIP challenges.

Type: Grant

Filed: February 17, 2012

Date of Patent: June 11, 2019

Assignee: MICROSOFT TECHNOLOGY LICENSING, LLC

Inventors: Yao Qian, Frank Kao-Ping Soong, Bin Benjamin Zhu
Minimum converted trajectory error (MCTE) audio-to-video engine

Patent number: 8751228

Abstract: Embodiments of an audio-to-video engine are disclosed. In operation, the audio-to-video engine generates facial movement (e.g., a virtual talking head) based on an input speech. The audio-to-video engine receives the input speech and recognizes the input speech as a source feature vector. The audio-to-video engine then determines a Maximum A Posterior (MAP) mixture sequence based on the source feature vector. The MAP mixture sequence may be a function of a refined Gaussian Mixture Model (GMM). The audio-to-video engine may then use the MAP to estimate video feature parameters. The video feature parameters are then interpreted as facial movement. The facial movement may be stored as data to a storage module and/or it may be displayed as video to a display device.

Type: Grant

Filed: November 4, 2010

Date of Patent: June 10, 2014

Assignee: Microsoft Corporation

Inventors: Lijuan Wang, Frank Kao-Ping Soong
Name synthesis

Patent number: 8719027

Abstract: An automated method of providing a pronunciation of a word to a remote device is disclosed. The method includes receiving an input indicative of the word to be pronounced. The method further includes searching a database having a plurality of records. Each of the records has an indication of a textual representation and an associated indication of an audible representation. At least one output is provided to the remote device of an audible representation of the word to be pronounced.

Type: Grant

Filed: February 28, 2007

Date of Patent: May 6, 2014

Assignee: Microsoft Corporation

Inventors: Yining Chen, Yusheng Li, Min Chu, Frank Kao-Ping Soong
EVALUATING TEXT-TO-SPEECH INTELLIGIBILITY USING TEMPLATE CONSTRAINED GENERALIZED POSTERIOR PROBABILITY

Publication number: 20140025381

Abstract: Instead of relying on humans to subjectively evaluate speech intelligibility of a subject, a system objectively evaluates the speech intelligibility. The system receives speech input and calculates confidence scores at multiple different levels using a Template Constrained Generalized Posterior Probability algorithm. One or multiple intelligibility classifiers are utilized to classify the desired entities on an intelligibility scale. A specific intelligibility classifier utilizes features such as the various confidence scores. The scale of the intelligibility classification can be adjusted to suit the application scenario. Based on the confidence score distributions and the intelligibility classification results at multiple levels an overall objective intelligibility score is calculated. The objective intelligibility scores can be used to rank different subjects or systems being assessed according to their intelligibility levels. The speech that is below a predetermined intelligibility (e.g.

Type: Application

Filed: July 20, 2012

Publication date: January 23, 2014

Applicant: MICROSOFT CORPORATION

Inventors: Linfang Wang, Yan Teng, Lijuan Wang, Frank Kao-Ping Soong, Zhe Geng, William Brad Waller, Mark Tillman Hanson
Frame mapping approach for cross-lingual voice transformation

Patent number: 8594993

Abstract: Frame mapping-based cross-lingual voice transformation may transform a target speech corpus in a particular language into a transformed target speech corpus that remains recognizable, and has the voice characteristics of a target speaker that provided the target speech corpus. A formant-based frequency warping is performed on the fundamental frequencies and the linear predictive coding (LPC) spectrums of source speech waveforms in a first language to produce transformed fundamental frequencies and transformed LPC spectrums. The transformed fundamental frequencies and the transformed LPC spectrums are then used to generate warped parameter trajectories. The warped parameter trajectories are further used to transform the target speech waveforms in the second language to produce transformed target speech waveform with voice characteristics of the first language that nevertheless retain at least some voice characteristics of the target speaker.

Type: Grant

Filed: April 4, 2011

Date of Patent: November 26, 2013

Assignee: Microsoft Corporation

Inventors: Yao Qian, Frank Kao-Ping Soong
Unnatural prosody detection in speech synthesis

Patent number: 8583438

Abstract: Described is a technology by which synthesized speech generated from text is evaluated against a prosody model (trained offline) to determine whether the speech will sound unnatural. If so, the speech is regenerated with modified data. The evaluation and regeneration may be iterative until deemed natural sounding. For example, text is built into a lattice that is then (e.g., Viterbi) searched to find a best path. The sections (e.g., units) of data on the path are evaluated via a prosody model. If the evaluation deems a section to correspond to unnatural prosody, that section is replaced, e.g., by modifying/pruning the lattice and re-performing the search. Replacement may be iterative until all sections pass the evaluation. Unnatural prosody detection may be biased such that during evaluation, unnatural prosody is falsely detected at a higher rate relative to a rate at which unnatural prosody is missed.

Type: Grant

Filed: September 20, 2007

Date of Patent: November 12, 2013

Assignee: Microsoft Corporation

Inventors: Yong Zhao, Frank Kao-ping Soong, Min Chu, Lijuan Wang
Character auto-completion for online east asian handwriting input

Patent number: 8542927

Abstract: An exemplary method includes receiving stroke information for a partially written East Asian character, the East Asian character representable by one or more radicals; based on the stroke information, selecting a radical on a prefix tree wherein the prefix tree branches to East Asian characters as end states; identifying one or more East Asian characters as end states that correspond to the selected radical for the partially written East Asian character; and receiving user input to verify that one of the identified one or more East Asian characters is the end state for the partially written East Asian character. In such a method, the selection of a radical can occur using radical-based hidden Markov models. Various other exemplary methods, devices, systems, etc., are also disclosed.

Type: Grant

Filed: June 26, 2008

Date of Patent: September 24, 2013

Assignee: Microsoft Corporation

Inventors: Peng Liu, Lei Ma, Frank Kao-Ping Soong
AUDIO HUMAN INTERACTIVE PROOF BASED ON TEXT-TO-SPEECH AND SEMANTICS

Publication number: 20130218566

Abstract: The text-to-speech audio HIP technique described herein in some embodiments uses different correlated or uncorrelated words or sentences generated via a text-to-speech engine as audio HIP challenges. The technique can apply different effects in the text-to-speech synthesizer speaking a sentence to be used as a HIP challenge string. The different effects can include, for example, spectral frequency warping; vowel duration warping; background addition; echo addition; and varying the time duration between words, among others. In some embodiments the technique varies the set of parameters to prevent using Automated Speech Recognition tools from using previously used audio HIP challenges to learn a model which can then be used to recognize future audio HIP challenges generated by the technique. Additionally, in some embodiments the technique introduces the requirement of semantic understanding in HIP challenges.

Type: Application

Filed: February 17, 2012

Publication date: August 22, 2013

Applicant: MICROSOFT CORPORATION

Inventors: Yao Qian, Frank Kao-Ping Soong, Bin Benjamin Zhu
Position-dependent phonetic models for reliable pronunciation identification

Patent number: 8355917

Abstract: A representation of a speech signal is received and is decoded to identify a sequence of position-dependent phonetic tokens wherein each token comprises a phone and a position indicator that indicates the position of the phone within a syllable.

Type: Grant

Filed: February 1, 2012

Date of Patent: January 15, 2013

Assignee: Microsoft Corporation

Inventors: Peng Liu, Yu Shi, Frank Kao-ping Soong
Rich context modeling for text-to-speech engines

Patent number: 8340965

Abstract: Embodiments of rich context modeling for speech synthesis are disclosed. In operation, a text-to-speech engine refines a plurality of rich context models based on decision tree-tied Hidden Markov Models (HMMs) to produce a plurality of refined rich context models. The text-to-speech engine then generates synthesized speech for an input text based at least on some of the plurality of refined rich context models.

Type: Grant

Filed: December 2, 2009

Date of Patent: December 25, 2012

Assignee: Microsoft Corporation

Inventors: Zhi-Jie Yan, Yao Qian, Frank Kao-Ping Soong
Talking Teacher Visualization for Language Learning

Publication number: 20120276504

Abstract: A representation of a virtual language teacher assists in language learning. The virtual language teacher may appear as a “talking head” in a video that a student views to practice pronunciation of a foreign language. A system for generating a virtual language teacher receives input text. The system may generate a video showing the virtual language teacher as a talking head having a mouth that moves in synchronization with speech generated from the input text. The video of the virtual language teacher may then be presented to the student.

Type: Application

Filed: April 29, 2011

Publication date: November 1, 2012

Applicant: Microsoft Corporation

Inventors: Gang Chen, Weijiang Xu, Lijuan Wang, Matthew Robert Scott, Frank Kao-Ping Soong, Hao Wei
FRAME MAPPING APPROACH FOR CROSS-LINGUAL VOICE TRANSFORMATION

Publication number: 20120253781

Abstract: Frame mapping-based cross-lingual voice transformation may transform a target speech corpus in a particular language into a transformed target speech corpus that remains recognizable, and has the voice characteristics of a target speaker that provided the target speech corpus. A formant-based frequency warping is performed on the fundamental frequencies and the linear predictive coding (LPC) spectrums of source speech waveforms in a first language to produce transformed fundamental frequencies and transformed LPC spectrums. The transformed fundamental frequencies and the transformed LPC spectrums are then used to generate warped parameter trajectories. The warped parameter trajectories are further used to transform the target speech waveforms in the second language to produce transformed target speech waveform with voice characteristics of the first language that nevertheless retain at least some voice characteristics of the target speaker.

Type: Application

Filed: April 4, 2011

Publication date: October 4, 2012

Applicant: MICROSOFT CORPORATION

Inventors: Yao Qian, Frank Kao-Ping Soong
POSITION-DEPENDENT PHONETIC MODELS FOR RELIABLE PRONUNCIATION IDENTIFICATION

Publication number: 20120191456

Abstract: A representation of a speech signal is received and is decoded to identify a sequence of position-dependent phonetic tokens wherein each token comprises a phone and a position indicator that indicates the position of the phone within a syllable.

Type: Application

Filed: February 1, 2012

Publication date: July 26, 2012

Applicant: MICROSOFT CORPORATION

Inventors: Peng Liu, Yu Shi, Frank Kao-ping Soong
Speech and text driven HMM-based body animation synthesis

Patent number: 8224652

Abstract: An “Animation Synthesizer” uses trainable probabilistic models, such as Hidden Markov Models (HMM), Artificial Neural Networks (ANN), etc., to provide speech and text driven body animation synthesis. Probabilistic models are trained using synchronized motion and speech inputs (e.g., live or recorded audio/video feeds) at various speech levels, such as sentences, phrases, words, phonemes, sub-phonemes, etc., depending upon the available data, and the motion type or body part being modeled. The Animation Synthesizer then uses the trainable probabilistic model for selecting animation trajectories for one or more different body parts (e.g., face, head, hands, arms, etc.) based on an arbitrary text and/or speech input. These animation trajectories are then used to synthesize a sequence of animations for digital avatars, cartoon characters, computer generated anthropomorphic persons or creatures, actual motions for physical robots, etc.

Type: Grant

Filed: September 26, 2008

Date of Patent: July 17, 2012

Assignee: Microsoft Corporation

Inventors: Lijuan Wang, Lei Ma, Frank Kao-Ping Soong
Trajectory Tiling Approach for Text-to-Speech

Publication number: 20120143611

Abstract: Hidden Markov Models HMM trajectory tiling (HTT)-based approaches may be used to synthesize speech from text. In operation, a set of Hidden Markov Models (HMMs) and a set of waveform units may be obtained from a speech corpus. The set of HMMs are further refined via minimum generation error (MGE) training to generate a refined set of HMMs. Subsequently, a speech parameter trajectory may be generated by applying the refined set of HMMs to an input text. A unit lattice of candidate waveform units may be selected from the set of waveform units based at least on the speech parameter trajectory. A normalized cross-correlation (NCC)-based search on the unit lattice may be performed to obtain a minimal concatenation cost sequence of candidate waveform units, which are concatenated into a concatenated waveform sequence that is synthesized into speech.

Type: Application

Filed: December 7, 2010

Publication date: June 7, 2012

Applicant: Microsoft Corporation

Inventors: Yao Qian, Zhi-Jie Yan, Yi-Jian Wu, Frank Kao-Ping Soong
Real-time Animation for an Expressive Avatar

Publication number: 20120130717

Abstract: Techniques for providing real-time animation for a personalized cartoon avatar are described. In one example, a process trains one or more animated models to provide a set of probabilistic motions of one or more upper body parts based on speech and motion data. The process links one or more predetermined phrases that represent emotional states to the one or more animated models. After creation of the models, the process receives real-time speech input. Next, the process identifies an emotional state to be expressed based on the one or more predetermined phrases matching in context to the real-time speech input. The process then generates an animated sequence of motions of the one or more upper body parts by applying the one or more animated models in response to the real-time speech input.

Type: Application

Filed: November 19, 2010

Publication date: May 24, 2012

Applicant: Microsoft Corporation

Inventors: Ning Xu, Lijuan Wang, Frank Kao-Ping Soong, Xiao Liang, Qi Luo, Ying-Qing Xu, Xin Zou
Minimum Converted Trajectory Error (MCTE) Audio-to-Video Engine

Publication number: 20120116761

Abstract: Embodiments of an audio-to-video engine are disclosed. In operation, the audio-to-video engine generates facial movement (e.g., a virtual talking head) based on an input speech. The audio-to-video engine receives the input speech and recognizes the input speech as a source feature vector. The audio-to-video engine then determines a Maximum A Posterior (MAP) mixture sequence based on the source feature vector. The MAP mixture sequence may be a function of a refined Gaussian Mixture Model (GMM). The audio-to-video engine may then use the MAP to estimate video feature parameters. The video feature parameters are then interpreted as facial movement. The facial movement may be stored as data to a storage module and/or it may be displayed as video to a display device.

Type: Application

Filed: November 4, 2010

Publication date: May 10, 2012

Applicant: MICROSOFT CORPORATION

Inventors: Lijuan Wang, Frank Kao-Ping Soong
Position-dependent phonetic models for reliable pronunciation identification

Patent number: 8135590

Abstract: A representation of a speech signal is received and is decoded to identify a sequence of position-dependent phonetic tokens wherein each token comprises a phone and a position indicator that indicates the position of the phone within a syllable.

Type: Grant

Filed: January 11, 2007

Date of Patent: March 13, 2012

Assignee: Microsoft Corporation

Inventors: Peng Liu, Yu Shi, Frank Kao-ping Soong
Handwriting symbol recognition accuracy using speech input

Patent number: 8077975

Abstract: Described is a bimodal data input technology by which handwriting recognition results are combined with speech recognition results to improve overall recognition accuracy. Handwriting data and speech data corresponding to mathematical symbols are received and processed (including being recognized) into respective graphs. A fusion mechanism uses the speech graph to enhance the handwriting graph, e.g., to better distinguish between similar handwritten symbols that are often misrecognized. The graphs include nodes representing symbols, and arcs between the nodes representing probability scores. When arcs in the first and second graphs are determined to match one another, such as aligned in time and associated with corresponding symbols, the probability score in the second graph for that arc is used to adjust the matching probability score in the first graph. Normalization and smoothing may be performed to correspond the graphs to one another and to control the influence of one graph on the other.

Type: Grant

Filed: February 26, 2008

Date of Patent: December 13, 2011

Assignee: Microsoft Corporation

Inventors: Lei Ma, Yu Shi, Frank Kao-ping Soong

1 2 3 next