Patents by Inventor Frank Torsten Bernd Seide

Frank Torsten Bernd Seide has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

Subword-based multi-level pronunciation adaptation for recognizing accented speech

Patent number: 8825481

Abstract: Techniques are described for training a speech recognition model for accented speech. A subword parse table is employed that models mispronunciations at multiple subword levels, such as the syllable, position-specific cluster, and/or phone levels. Mispronunciation probability data is then generated at each level based on inputted training data, such as phone-level annotated transcripts of accented speech. Data from different levels of the subword parse table may then be combined to determine the accented speech model. Mispronunciation probability data at each subword level is based at least in part on context at that level. In some embodiments, phone-level annotated transcripts are generated using a semi-supervised method.

Type: Grant

Filed: January 20, 2012

Date of Patent: September 2, 2014

Assignee: Microsoft Corporation

Inventors: Albert Joseph Kishan Thambiratnam, Timo Pascal Mertens, Frank Torsten Bernd Seide
MUSIC STEERING WITH AUTOMATICALLY DETECTED MUSICAL ATTRIBUTES

Publication number: 20140149468

Abstract: Described is a technology by which a playback list comprising similar songs is automatically built based on automatically detected/generated song attributes, such as by extracting numeric features of each song. The attributes may be downloaded from a remote connection, and/or may be locally generated on the playback device. To build a playlist, a seed song's attributes may be compared against attributes of other songs to determine which other songs are similar to the seed song and thus included in the playlist. Another way to build a playlist is based on similarity of songs to a set of user provided-attributes, such as corresponding to moods or usage modes such as “resting” “reading” “jogging” or “driving” moods/modes. The playlist may be dynamically adjusted based on user interaction with the device, such as when a user skips a song, queues a song, or dequeues a song.

Type: Application

Filed: February 3, 2014

Publication date: May 29, 2014

Applicant: MICROSOFT CORPORATION

Inventors: Lie Lu, Frank Torsten Bernd Seide, Gabriel White
DEEP NEURAL NETWORKS TRAINING FOR SPEECH AND PATTERN RECOGNITION

Publication number: 20140142929

Abstract: The use of a pipelined algorithm that performs parallelized computations to train deep neural networks (DNNs) for performing data analysis may reduce training time. The DNNs may be one of context-independent DNNs or context-dependent DNNs. The training may include partitioning training data into sample batches of a specific batch size. The partitioning may be performed based on rates of data transfers between processors that execute the pipelined algorithm, considerations of accuracy and convergence, and the execution speed of each processor. Other techniques for training may include grouping layers of the DNNs for processing on a single processor, distributing a layer of the DNNs to multiple processors for processing, or modifying an execution order of steps in the pipelined algorithm.

Type: Application

Filed: November 20, 2012

Publication date: May 22, 2014

Applicant: MICROSOFT CORPORATION

Inventors: Frank Torsten Bernd Seide, Gang Li, Dong Yu, Adam C. Eversole, Xie Chen
Exploiting sparseness in training deep neural networks

Patent number: 8700552

Abstract: Deep Neural Network (DNN) training technique embodiments are presented that train a DNN while exploiting the sparseness of non-zero hidden layer interconnection weight values. Generally, a fully connected DNN is initially trained by sweeping through a full training set a number of times. Then, for the most part, only the interconnections whose weight magnitudes exceed a minimum weight threshold are considered in further training. This minimum weight threshold can be established as a value that results in only a prescribed maximum number of interconnections being considered when setting interconnection weight values via an error back-propagation procedure during the training. It is noted that the continued DNN training tends to converge much faster than the initial training.

Type: Grant

Filed: November 28, 2011

Date of Patent: April 15, 2014

Assignee: Microsoft Corporation

Inventors: Dong Yu, Li Deng, Frank Torsten Bernd Seide, Gang Li
Leveraging speech recognizer feedback for voice activity detection

Patent number: 8650029

Abstract: A voice activity detection (VAD) module analyzes a media file, such as an audio file or a video file, to determine whether one or more frames of the media file include speech. A speech recognizer generates feedback relating to an accuracy of the VAD determination. The VAD module leverages the feedback to improve subsequent VAD determinations. The VAD module also utilizes a look-ahead window associated with the media file to adjust estimated probabilities or VAD decisions for previously processed frames.

Type: Grant

Filed: February 25, 2011

Date of Patent: February 11, 2014

Assignee: Microsoft Corporation

Inventors: Albert Joseph Kishan Thambiratnam, Weiwu Zhu, Frank Torsten Bernd Seide
Music steering with automatically detected musical attributes

Patent number: 8642872

Abstract: Described is a technology by which a playback list comprising similar songs is automatically built based on automatically detected/generated song attributes, such as by extracting numeric features of each song. The attributes may be downloaded from a remote connection, and/or may be locally generated on the playback device. To build a playlist, a seed song's attributes may be compared against attributes of other songs to determine which other songs are similar to the seed song and thus included in the playlist. Another way to build a playlist is based on similarity of songs to a set of user provided-attributes, such as corresponding to moods or usage modes such as “resting” “reading” “jogging” or “driving” moods/modes. The playlist may be dynamically adjusted based on user interaction with the device, such as when a user skips a song, queues a song, or dequeues a song.

Type: Grant

Filed: June 4, 2008

Date of Patent: February 4, 2014

Assignee: Microsoft Corporation

Inventors: Lie Lu, Frank Torsten Bernd Seide, Gabriel White
Subword-Based Multi-Level Pronunciation Adaptation for Recognizing Accented Speech

Publication number: 20130191126

Abstract: Techniques are described for training a speech recognition model for accented speech. A subword parse table is employed that models mispronunciations at multiple subword levels, such as the syllable, position-specific cluster, and/or phone levels. Mispronunciation probability data is then generated at each level based on inputted training data, such as phone-level annotated transcripts of accented speech. Data from different levels of the subword parse table may then be combined to determine the accented speech model. Mispronunciation probability data at each subword level is based at least in part on context at that level. In some embodiments, phone-level annotated transcripts are generated using a semi-supervised method.

Type: Application

Filed: January 20, 2012

Publication date: July 25, 2013

Applicant: Microsoft Corporation

Inventors: Albert Joseph Kishan Thambiratnam, Timo Pascal Mertens, Frank Torsten Bernd Seide
EXPLOITING SPARSENESS IN TRAINING DEEP NEURAL NETWORKS

Publication number: 20130138589

Abstract: Deep Neural Network (DNN) training technique embodiments are presented that train a DNN while exploiting the sparseness of non-zero hidden layer interconnection weight values. Generally, a fully connected DNN is initially trained by sweeping through a full training set a number of times. Then, for the most part, only the interconnections whose weight magnitudes exceed a minimum weight threshold are considered in further training. This minimum weight threshold can be established as a value that results in only a prescribed maximum number of interconnections being considered when setting interconnection weight values via an error back-propagation procedure during the training. It is noted that the continued DNN training tends to converge much faster than the initial training.

Type: Application

Filed: November 28, 2011

Publication date: May 30, 2013

Applicant: MICROSOFT CORPORATION

Inventors: Dong Yu, Li Deng, Frank Torsten Bernd Seide, Gang Li
DISCRIMINATIVE PRETRAINING OF DEEP NEURAL NETWORKS

Publication number: 20130138436

Abstract: Discriminative pretraining technique embodiments are presented that pretrain the hidden layers of a Deep Neural Network (DNN). In general, a one-hidden-layer neural network is trained first using labels discriminatively with error back-propagation (BP). Then, after discarding an output layer in the previous one-hidden-layer neural network, another randomly initialized hidden layer is added on top of the previously trained hidden layer along with a new output layer that represents the targets for classification or recognition. The resulting multiple-hidden-layer DNN is then discriminatively trained using the same strategy, and so on until the desired number of hidden layers is reached. This produces a pretrained DNN. The discriminative pretraining technique embodiments have the advantage of bringing the DNN layer weights close to a good local optimum, while still leaving them in a range with a high gradient so that they can be fine-tuned effectively.

Type: Application

Filed: November 26, 2011

Publication date: May 30, 2013

Applicant: MICROSOFT CORPORATION

Inventors: Dong Yu, Li Deng, Frank Torsten Bernd Seide, Gang Li
Keyword Generation for Media Content

Publication number: 20120226696

Abstract: In various embodiments, a transcript that represents a media file is created. Keyword candidates that may represent topics and/or content associated with the media content are then be extracted from the transcript. Furthermore, a keyword set may be generated for the media content utilizing a mutual information criteria. In other embodiments, one or more queries may be generated based at least in part on the transcript, and a plurality of web documents may be retrieved based at least in part on the one or more queries. Additional keyword candidates may be extracted from each web document and then ranked. A subset of the keyword candidates may then be selected to form a keyword set associated with the media content.

Type: Application

Filed: March 4, 2011

Publication date: September 6, 2012

Applicant: MICROSOFT CORPORATION

Inventors: Albert Joseph Kishan Thambiratnam, Sha Meng, Gang Li, Frank Torsten Bernd Seide
LEVERAGING SPEECH RECOGNIZER FEEDBACK FOR VOICE ACTIVITY DETECTION

Publication number: 20120221330

Abstract: A voice activity detection (VAD) module analyzes a media file, such as an audio file or a video file, to determine whether one or more frames of the media file include speech. A speech recognizer generates feedback relating to an accuracy of the VAD determination. The VAD module leverages the feedback to improve subsequent VAD determinations. The VAD module also utilizes a look-ahead window associated with the media file to adjust estimated probabilities or VAD decisions for previously processed frames.

Type: Application

Filed: February 25, 2011

Publication date: August 30, 2012

Applicant: MICROSOFT CORPORATION

Inventors: Albert Joseph Kishan Thambiratnam, Weiwu Zhu, Frank Torsten Bernd Seide
TRANSCRIPTION, ARCHIVING AND THREADING OF VOICE COMMUNICATIONS

Publication number: 20100268534

Abstract: Described is a technology that provides highly accurate speech-recognized text transcripts of conversations, particularly telephone or meeting conversations. Speech is received for recognition when it is at a high quality and separate for each user, that is, independent of any transmission. Moreover, because the speech is received separately, a personalized recognition model adapted to each user's voice and vocabulary may be used. The separately recognized text is then merged into a transcript of the communication. The transcript may be labeled with the identity of each user that spoke the corresponding speech. The output of the transcript may be dynamic as the conversation takes place, or may occur later, such as contingent upon each user agreeing to release his or her text. The transcript may be incorporated into the text or data of another program, such as to insert it as a thread in a larger email conversation or the like.

Type: Application

Filed: April 17, 2009

Publication date: October 21, 2010

Applicant: MICROSOFT CORPORATION

Inventors: ALBERT JOSEPH KISHAN THAMBIRATNAM, FRANK TORSTEN BERND SEIDE, PENG YU, ROY GEOFFREY WALLACE
Personalized user specific grammars

Patent number: 7693267

Abstract: Improved systems and methods are provided for transcribing audio files of voice mails sent over a unified messaging system. Customized grammars specific to a voice mail recipient are created and utilized to transcribe a received voice mail by comparing the audio file to commonly utilized words, names, acronyms, and phrases used by the recipient. Key elements are identified from the resulting text transcription to aid the recipient in processing received voice mails based on the significant content contained in the voice mail.

Type: Grant

Filed: December 30, 2005

Date of Patent: April 6, 2010

Assignee: Microsoft Corporation

Inventors: David Andrew Howell, Sridhar Sundararaman, David T. Fong, Frank Torsten Bernd Seide
Method of speech recognition using hidden trajectory Hidden Markov Models

Patent number: 7617104

Abstract: A method of speech recognition is provided that determines a production-related value, vocal-tract resonance frequencies in particular, for a state at a particular frame based on the production-related values associated with two preceding frames using a recursion. The production-related value is used to determine a probability distribution of the observed feature vector for the state. A probability for an observed value received for the frame is then determined from the probability distribution. Under one embodiment, the production-related value is determined using a noise-free recursive definition for the value. Use of the recursion substantially improves the decoding speed. When the decoding algorithm is applied to training data with known phonetic transcripts, forced alignment is created which improves the phone segmentation obtained from the prior art.

Type: Grant

Filed: January 21, 2003

Date of Patent: November 10, 2009

Assignee: Microsoft Corporation

Inventors: Li Deng, Jian-Iai Zhou, Frank Torsten Bernd Seide
MUSIC STEERING WITH AUTOMATICALLY DETECTED MUSICAL ATTRIBUTES

Publication number: 20090217804

Abstract: Described is a technology by which a playback list comprising similar songs is automatically built based on automatically detected/generated song attributes, such as by extracting numeric features of each song. The attributes may be downloaded from a remote connection, and/or may be locally generated on the playback device. To build a playlist, a seed song's attributes may be compared against attributes of other songs to determine which other songs are similar to the seed song and thus included in the playlist. Another way to build a playlist is based on similarity of songs to a set of user provided-attributes, such as corresponding to moods or usage modes such as “resting” “reading” “jogging” or “driving” moods/modes. The playlist may be dynamically adjusted based on user interaction with the device, such as when a user skips a song, queues a song, or dequeues a song.

Type: Application

Filed: June 4, 2008

Publication date: September 3, 2009

Applicant: MICROSOFT CORPORATION

Inventors: Lie Lu, Frank Torsten Bernd Seide, Gabriel White
Method of speech recognition using time-dependent interpolation and hidden dynamic value classes

Patent number: 7206741

Abstract: A speech signal is decoded by determining a production-related value for a current state based on an optimal production-related value at the end of a preceding state, the optimal production-related value being selected from a set of continuous values. The production-related value is used to determine a likelihood of a phone being represented by a set of observation vectors that are aligned with a path between the preceding state and the current state. The likelihood of the phone is combined with a score from the preceding state to determine a score for the current state, the score from the preceding state being associated with a discrete class of production-related values wherein the class matches the class of the optimal production-related value.

Type: Grant

Filed: December 6, 2005

Date of Patent: April 17, 2007

Assignee: Microsoft Corporation

Inventors: Li Deng, Jian-lai Zhou, Frank Torsten Bernd Seide, Asela J. R. Gunawardana, Hagai Attias, Alejandro Acero, Xuedong Huang
Method of speech recognition using time-dependent interpolation and hidden dynamic value classes

Patent number: 7050975

Abstract: A method of speech recognition is provided that identifies a production-related dynamics value by performing a linear interpolation between a production-related dynamics value at a previous time and a production-related target using a time-dependent interpolation weight. The hidden production-related dynamics value is used to compute a predicted value that is compared to an observed value of acoustics to determine the likelihood of the observed acoustics given a sequence of hidden phonological units. In some embodiments, the production-related dynamics value at the previous time is selected from a set of continuous values. In addition, the likelihood of the observed acoustics given a sequence of hidden phonological units is combined with a score associated with a discrete class of production-related dynamic values at the previous time to determine a score for a current phonological state.

Type: Grant

Filed: October 9, 2002

Date of Patent: May 23, 2006

Assignee: Microsoft Corporation

Inventors: Li Deng, Jian-Iai Zhou, Frank Torsten Bernd Seide, Asela J. R. Gunawardana, Hagai Attias, Alejandro Acero, Xuedong Huang
Tone features for speech recognition

Patent number: 6829578

Abstract: Robust acoustic tone features are achieved first by the introduction of on-line, look-ahead trace back of the fundamental frequency (F0) contour with adaptive pruning, this fundamental frequency serves as the signal preprocessing front-end. The F0 contour is subsequently decomposed into lexical tone effect, phrase intonation effect, and random effect by means of time-variant, weighted moving average (MA) filter in conjunction with weighted (placing more emphasis on vowels) least squares of the F0 contour. The intonation effect is removed by subtraction of the F0 contour under superposition assumption. The acoustic tone features are defined as two parts. First, is the coefficients of the second order weighted regression of the de-intonation of the F0 contour over neighbouring frames. The second part deals with the degree of the periodicity of the signal, which are the coefficients of the second order regression of the auto-correlation.

Type: Grant

Filed: July 9, 2001

Date of Patent: December 7, 2004

Assignee: Koninklijke Philips Electronics, N.V.

Inventors: Chang-Han Huang, Frank Torsten Bernd Seide
Method of speech recognition using hidden trajectory hidden markov models

Publication number: 20040143435

Abstract: A method of speech recognition is provided that determines a production-related value, vocal-tract resonance frequencies in particular, for a state at a particular frame based on the production-related values associated with two preceding frames using a recursion. The production-related value is used to determine a probability distribution of the observed feature vector for the state. A probability for an observed value received for the frame is then determined from the probability distribution. Under one embodiment, the production-related value is determined using a noise-free recursive definition for the value. Use of the recursion substantially improves the decoding speed. When the decoding algorithm is applied to training data with known phonetic transcripts, forced alignment is created which improves the phone segmentation obtained from the prior art.

Type: Application

Filed: January 21, 2003

Publication date: July 22, 2004

Inventors: Li Deng, Jian-Iai Zhou, Frank Torsten Bernd Seide
Method of speech recognition using time-dependent interpolation and hidden dynamic value classes

Publication number: 20040019483

Abstract: A method of speech recognition is provided that identifies a production-related dynamics value by performing a linear interpolation between a production-related dynamics value at a previous time and a production-related target using a time-dependent interpolation weight. The hidden production-related dynamics value is used to compute a predicted value that is compared to an observed value of acoustics to determine the likelihood of the observed acoustics given a sequence of hidden phonological units. In some embodiments, the production-related dynamics value at the previous time is selected from a set of continuous values. In addition, the likelihood of the observed acoustics given a sequence of hidden phonological units is combined with a score associated with a discrete class of production-related dynamic values at the previous time to determine a score for a current phonological state.

Type: Application

Filed: October 9, 2002

Publication date: January 29, 2004

Inventors: Li Deng, Jian-Iai Zhou, Frank Torsten Bernd Seide, Asela J.R. Gunawardana, Hagai Attias, Alejandro Acero, Xuedong Huang

prev 1 2 3 next