Neural Network Patents (Class 704/259)
-
Patent number: 12118323Abstract: An approach for generating an optimized video of a speaker, translated from a source language into a target language with the speaker's lips synchronized to the translated speech, while balancing optimization of the translation into a target language. A source video may be fed into a neural machine translation model. The model may synthesize a plurality of potential translations. The translations may be received by a generative adversarial network which generates video for each translation and classifies the translations as in-sync or out of sync. A lip-syncing score may be for each of the generated videos that are classified as in-sync.Type: GrantFiled: September 23, 2021Date of Patent: October 15, 2024Assignee: International Business Machines CorporationInventors: Sathya Santhar, Sridevi Kannan, Sarbajit K. Rakshit, Samuel Mathew Jawaharlal
-
Patent number: 12033644Abstract: Captured vocals may be automatically transformed using advanced digital signal processing techniques that provide captivating applications, and even purpose-built devices, in which mere novice user-musicians may generate, audibly render and share musical performances. In some cases, the automated transformations allow spoken vocals to be segmented, arranged, temporally aligned with a target rhythm, meter or accompanying backing tracks and pitch corrected in accord with a score or note sequence. Speech-to-song music applications are one such example. In some cases, spoken vocals may be transformed in accord with musical genres such as rap using automated segmentation and temporal alignment techniques, often without pitch correction. Such applications, which may employ different signal processing and different automated transformations, may nonetheless be understood as speech-to-rap variations on the theme.Type: GrantFiled: September 20, 2021Date of Patent: July 9, 2024Assignee: SMULE, INC.Inventors: Parag Chordia, Mark Godfrey, Alexander Rae, Prerna Gupta, Perry R. Cook
-
Patent number: 11847561Abstract: Computer-implemented techniques can include obtaining, by a client computing device, a digital media item and a request for a processing task on the digital item and determining a set of operating parameters based on (i) available computing resources at the client computing device and (ii) a condition of a network. Based on the set of operating parameters, the client computing device or a server computing device can select one of a plurality of artificial neural networks (ANNs), each ANN defining which portions of the processing task are to be performed by the client and server computing devices. The client and server computing devices can coordinate processing of the processing task according to the selected ANN. The client computing device can also obtain final processing results corresponding to a final evaluation of the processing task and generate an output based on the final processing results.Type: GrantFiled: November 25, 2020Date of Patent: December 19, 2023Assignee: GOOGLE LLCInventors: Matthew Sharifi, Jakob Nicolaus Foerster
-
Patent number: 11842722Abstract: Disclosed is a speech synthesis method including: acquiring fundamental frequency information and acoustic feature information from original speech; generating an impulse train from the fundamental frequency information, and inputting it to a harmonic time-varying filter; inputting the acoustic feature information into a neural network filter estimator to obtain corresponding impulse response information; generating noise signal by a noise generator; determining, by the harmonic time-varying filter, harmonic component information through filtering processing on the impulse train and the impulse response information; determining, by a noise time-varying filter, noise component information based on the impulse response information and the noise; and generating a synthesized speech from the harmonic component information and the noise component information.Type: GrantFiled: June 9, 2021Date of Patent: December 12, 2023Assignee: AI SPEECH CO., LTD.Inventors: Kai Yu, Zhijun Liu, Kuan Chen
-
Patent number: 11816565Abstract: Methods and apparatus are disclosed for interpreting a deep neural network (DNN) using a Semantic Coherence Analysis (SCA)-based interpretation technique. In embodiments, a multi-layered DNN that was trained for one task is analyzed using the SCA technique to select one layer in the DNN that produces salient features for another task. In embodiments, the DNN layers are tested with test samples labeled with a set of concept labels. The output features of a DNN layer are gathered and analyzed according to the concepts. In embodiments, the output is scored with a semantic coherence score, which indicates how well the layer separates the concepts, and one layer is selected from the DNN based on its semantic coherence score. In some embodiments, a support vector machine (SVM) or additional neural network may be added to the selected layer and trained to generate classification results based on the outputs of the selected layer.Type: GrantFiled: February 17, 2020Date of Patent: November 14, 2023Assignee: Apple Inc.Inventors: Moussa Doumbouya, Xavier Suau Cuadros, Luca Zappella, Nicholas E. Apostoloff
-
Patent number: 11769482Abstract: The present disclosure provides a method and apparatus of synthesizing a speech, a method and apparatus of training a speech synthesis model, an electronic device, and a storage medium. The method of synthesizing a speech includes acquiring a style information of a speech to be synthesized, a tone information of the speech to be synthesized, and a content information of a text to be processed; generating an acoustic feature information of the text to be processed, by using a pre-trained speech synthesis model, based on the style information, the tone information, and the content information of the text to be processed; and synthesizing the speech for the text to be processed, based on the acoustic feature information of the text to be processed.Type: GrantFiled: September 29, 2021Date of Patent: September 26, 2023Assignee: Beijing Baidu Netcom Science Technology Co., Ltd.Inventors: Wenfu Wang, Tao Sun, Xilei Wang, Junteng Zhang, Zhengkun Gao, Lei Jia
-
Patent number: 11670292Abstract: An electronic device comprising circuitry configured to perform a transcript based voice enhancement based on a transcript to obtain an enhanced audio signal.Type: GrantFiled: February 6, 2020Date of Patent: June 6, 2023Assignee: SONY CORPORATIONInventors: Fabien Cardinaux, Marc Ferras Font
-
Patent number: 11625880Abstract: According to a first aspect of this specification, there is described a computer-implemented method of tagging video frames. The method comprises generating, using a frame tagging model, a tag for each of a plurality of frames of an animation sequence. The frame tagging model comprises: a first neural network portion configured to process, for each frame of the plurality of frames, a plurality of features associated with the frame and generate an encoded representation for the frame. The frame tagging model further comprises a second neural network portion configured to receive input comprising the encoded representations of each frame and generate output indicative of a tag for each of the plurality of frames.Type: GrantFiled: February 9, 2021Date of Patent: April 11, 2023Assignee: Electronic Arts Inc.Inventor: Elaheh Akhoundi
-
Patent number: 11449790Abstract: A computer-implemented method for controlling a device based on an ensemble model can include receiving sensing information associated with a user's biometric state; inputting first sensing information to a first model, determining a first uncertainty of the first model, and generating a first weight value for weighting a first result value; inputting second sensing information into a second model, determining a second uncertainty of the second model, and generating a second weight value for weighting a second result value; generating a final result value based on combining the first result value weighted by the first weight value and the second result value weighted by the second weight value; generating a predicted biometric state of the user based on the final result value; and executing an operation of the device based on the predicted biometric state.Type: GrantFiled: October 23, 2018Date of Patent: September 20, 2022Assignee: LG ELECTRONICS INC.Inventors: Gyuseog Hong, Taehwan Kim, Byunghun Choi
-
Patent number: 11393453Abstract: A method for representing an intended prosody in synthesized speech includes receiving a text utterance having at least one word, and selecting an utterance embedding for the text utterance. Each word in the text utterance has at least one syllable and each syllable has at least one phoneme. The utterance embedding represents an intended prosody. For each syllable, using the selected utterance embedding, the method also includes: predicting a duration of the syllable by encoding linguistic features of each phoneme of the syllable with a corresponding prosodic syllable embedding for the syllable; predicting a pitch contour of the syllable based on the predicted duration for the syllable; and generating a plurality of fixed-length predicted pitch frames based on the predicted duration for the syllable. Each fixed-length predicted pitch frame represents part of the predicted pitch contour of the syllable.Type: GrantFiled: January 13, 2021Date of Patent: July 19, 2022Assignee: Google LLCInventors: Robert Clark, Chun-an Chan, Vincent Wan
-
Patent number: 11335322Abstract: The present technology relates to a learning device, a learning method, a voice synthesis device, and a voice synthesis method configured so that information can be provided via voice allowing easy understanding of contents by a user as a speech destination. A learning device according to one embodiment of the present technology performs voice recognition of speech voice of a plurality of users, estimates statuses when a speech is made, and learns, on the basis of speech voice data, a voice recognition result, and the statuses when the speech is made, voice synthesis data to be used for generation of synthesized voice according to statuses upon voice synthesis. Moreover, a voice synthesis device estimates statuses, and uses the voice synthesis data to generate synthesized voice indicating the contents of predetermined text data and obtained according to the estimated statuses. The present technology can be applied to an agent device.Type: GrantFiled: February 27, 2018Date of Patent: May 17, 2022Assignee: SONY CORPORATIONInventors: Hiro Iwase, Mari Saito, Shinichi Kawano
-
Patent number: 11322135Abstract: An example system includes a processor to receive a linguistic sequence and a prosody info offset. The processor can generate, via a trained prosody info predictor, combined prosody info including a number of observations based on the linguistic sequence. The number of observations include linear combinations of statistical measures evaluating a prosodic component over a predetermined period of time. The processor can generate, via a trained neural network, an acoustic sequence based on the combined prosody info, the prosody info offset, and the linguistic sequence.Type: GrantFiled: September 12, 2019Date of Patent: May 3, 2022Assignee: International Business Machines CorporationInventor: Vyacheslav Shechtman
-
Patent number: 11289068Abstract: The disclosure provides a method, an apparatus, a device, and a computer-readable storage medium for speech synthesis in parallel. The method includes: splitting a piece of text into a plurality of segments; based on the piece of text, obtaining a plurality of initial hidden states of the plurality of segments for a recurrent neural network. The method further includes: synthesizing the plurality of segments in parallel based on the plurality of initial hidden states and input features of the plurality of segments.Type: GrantFiled: May 14, 2020Date of Patent: March 29, 2022Assignee: BAIDU ONLINE NETWORK TECHNOLOGY (BEIJING) CO., LTD.Inventors: Wenfu Wang, Chenxi Sun, Tao Sun, Xi Chen, Guibin Wang, Lei Jia
-
Patent number: 11264010Abstract: A method for providing a frame-based mel spectral representation of speech includes receiving a text utterance having at least one word, and selecting a mel spectral embedding for the text utterance. Each word in the text utterance has at least one syllable and each syllable has at least one phoneme. For each phoneme, using the selected mel spectral embedding, the method also includes: predicting a duration of the corresponding phoneme by encoding linguistic features of the corresponding phoneme with a corresponding syllable embedding for the syllable that includes the corresponding phoneme; and generating a plurality of fixed-length predicted mel-frequency spectrogram frames based on the predicted duration for the corresponding phoneme. Each fixed-length predicted mel-frequency spectrogram frame representing mel-spectral information of the corresponding phoneme.Type: GrantFiled: November 8, 2019Date of Patent: March 1, 2022Assignee: Google LLCInventors: Robert Andrew James Clark, Chun-an Chan, Vincent Ping Leung Wan
-
Patent number: 11238899Abstract: A computer system configured to generate an audio description of a media file is provided. The system includes a display, a memory, and a processor coupled to the display and the memory. The memory stores a media file, including video data that is accessible via a time index and audio data synchronized with the video data by the time index and a transcript of the audio data, including transcription data synchronized with the video data via the time index. The processor is configured to render, via the display, images from portions of the video data; render text from portions of the transcription data in synchrony with the images; receive input identifying a point within the time index; receive input specifying audio description data to associate with the point; store, in the memory, the audio description data; and store an association between the audio description data and the point.Type: GrantFiled: January 10, 2020Date of Patent: February 1, 2022Assignee: 3Play Media Inc.Inventors: Joshua Miller, Christopher S. Antunes, Lily Megan Berthold-Bond, Andrew H. Schwartz, Kelly J. Savietta, Lanya Lee Butler, Sharon Lee Tomasulo, Jeremy E. Barron, Christopher E. Johnson, Roger S. Zimmerman
-
Patent number: 11080520Abstract: Computer-implemented techniques are provided for machine recognition of gestures and transformation of recognized gestures to text or speech, for gestures communicate in a sign language such as American Sign Language (ASL). In an embodiment, a computer-implemented method comprises: storing a training dataset comprising a plurality of digital images of sign language gestures and an alphabetical letter assigned to each digital image of the plurality of digital images, training a neural network using the plurality of digital images of sign language gestures as input and the alphabetical letter assigned to each digital image as output, receiving a particular digital image comprising a particular sign language gesture, and using the trained neural network to classify the particular digital image as a particular alphabetical letter.Type: GrantFiled: June 26, 2019Date of Patent: August 3, 2021Assignees: ATLASSIAN PTY LTD., ATLASSIAN INC.Inventor: Xiaoyuan Gu
-
Patent number: 11030394Abstract: A keyphrase extraction service implements techniques for determining a set of keyphrases associated with set of words. A word is selected from the set of words and a neural model is used to determine a label for the word based on features of the word and labels corresponding to other words of the set of words. The set of keyphrases is determined from the labels associated with the set of words.Type: GrantFiled: May 4, 2017Date of Patent: June 8, 2021Assignee: Amazon Technologies, Inc.Inventors: Zornitsa Petrova Kozareva, Sheng Zha, Hyokun Yun
-
Patent number: 10978042Abstract: The present disclosure discloses a method and apparatus for generating a speech synthesis model. A specific embodiment of the method comprises: acquiring a text characteristic of a text and an acoustic characteristic of a speech corresponding to the text used for training a neural network corresponding to a speech synthesis model, fundamental frequency data in the acoustic characteristic of the speech corresponding to the text used for the training being extracted through a fundamental frequency data extraction model, and the fundamental frequency data extraction model being generated based on pre-training a neural network corresponding to the fundamental frequency data extraction model using the speech including each frame of speech having corresponding fundamental frequency data; and training the neural network corresponding to the speech synthesis model using the text characteristic of the text and the acoustic characteristic of the speech corresponding to the text.Type: GrantFiled: August 3, 2018Date of Patent: April 13, 2021Assignee: BAIDU ONLINE NETWORK TECHNOLOGY (BEIJING) CO., LTD.Inventor: Hao Li
-
Patent number: 10971170Abstract: Methods, systems, and computer program products for generating, from an input character sequence, an output sequence of audio data representing the input character sequence. The output sequence of audio data includes a respective audio output sample for each of a number of time steps. One example method includes, for each of the time steps: generating a mel-frequency spectrogram for the time step by processing a representation of a respective portion of the input character sequence using a decoder neural network; generating a probability distribution over a plurality of possible audio output samples for the time step by processing the mel-frequency spectrogram for the time step using a vocoder neural network; and selecting the audio output sample for the time step from the possible audio output samples in accordance with the probability distribution.Type: GrantFiled: August 8, 2018Date of Patent: April 6, 2021Assignee: Google LLCInventors: Yonghui Wu, Jonathan Shen, Ruoming Pang, Ron J. Weiss, Michael Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Russell John Wyatt Skerry-Ryan, Ryan M. Rifkin, Ioannis Agiomyrgiannakis
-
Patent number: 10971131Abstract: The present disclosure discloses a method and apparatus for generating a speech synthesis model. A specific embodiment of the method comprises: acquiring a plurality of types of training samples, each of the plurality of types of training samples including a text of the type, and a speech of the text having a style of speech corresponding to the type read by an announcer corresponding to the type; and training a neural network corresponding to a speech synthesis model using the plurality of types of training samples and an annotation of the style of speech in the each of the plurality of types of training samples to obtain the speech synthesis model, the speech synthesis model being used to synthesize speech of the announcer corresponding to each of the plurality of types having a plurality of styles.Type: GrantFiled: August 3, 2018Date of Patent: April 6, 2021Assignee: BAIDU ONLINE NETWORK TECHNOLOGY (BEIJING) CO., LTD.Inventor: Yongguo Kang
-
Patent number: 10963781Abstract: In one embodiment, an audio signal for an audio track is received and segmented into a plurality of segments of the audio signal. The plurality of segments of audio are input into a classification network that is configured to predict output values based on a plurality of genre and mood combinations formed from different combinations of a plurality of genres and a plurality of moods. The classification network predicts a set of output values for the plurality of segments, each of the set of output values corresponding to one or more the plurality of genre and mood combinations. One or more of the plurality of genre and mood combinations are assigned to the audio track based on the set of output values for one or more of the plurality of segments.Type: GrantFiled: August 14, 2017Date of Patent: March 30, 2021Assignee: MICROSOFT TECHNOLOGY LICENSING, LLCInventors: Oren Barkan, Noam Koenigstein, Nir Nice
-
Patent number: 10923107Abstract: A method for representing an intended prosody in synthesized speech includes receiving a text utterance having at least one word, and selecting an utterance embedding for the text utterance. Each word in the text utterance has at least one syllable and each syllable has at least one phoneme. The utterance embedding represents an intended prosody. For each syllable, using the selected utterance embedding, the method also includes: predicting a duration of the syllable by encoding linguistic features of each phoneme of the syllable with a corresponding prosodic syllable embedding for the syllable; predicting a pitch contour of the syllable based on the predicted duration for the syllable; and generating a plurality of fixed-length predicted pitch frames based on the predicted duration for the syllable. Each fixed-length predicted pitch frame represents part of the predicted pitch contour of the syllable.Type: GrantFiled: April 12, 2019Date of Patent: February 16, 2021Assignee: Google LLCInventors: Robert Clark, Chun-an Chan, Vincent Wan
-
Patent number: 10770063Abstract: Techniques for a recursive deep-learning approach for performing speech synthesis using a repeatable structure that splits an input tensor into a left half and right half similar to the operation of the Fast Fourier Transform, performs a 1-D convolution on each respective half, performs a summation and then applies a post-processing function. The repeatable structure may be utilized in a series configuration to operate as a vocoder or perform other speech processing functions.Type: GrantFiled: August 22, 2018Date of Patent: September 8, 2020Assignees: Adobe Inc., The Trustees of Princeton UniversityInventors: Zeyu Jin, Gautham J. Mysore, Jingwan Lu, Adam Finkelstein
-
Patent number: 10643109Abstract: Disclosed are a method and a system for automatically classifying data expressed as a plurality of factors with values of a text word and a symbol sequence by using deep learning. The method comprises the steps of: inputting the data expressed by the plurality of factors so as to express a word vector including sequence information of the factors through sequence learning of words corresponding to the factors with respect to each factor constituting the data in a first model; inputting an output of the first model so as to calculate points of each category for classifying the categories of the data by using the word vector including the sequence information of the factor in a second model; and determining at least one category for the data by using the points of each category.Type: GrantFiled: April 2, 2018Date of Patent: May 5, 2020Assignee: NAVER CorporationInventors: Jung Woo Ha, Hyuna Pyo, Jeong Hee Kim
-
Patent number: 10235991Abstract: A hybrid frame, phone, diphone, morpheme, and word-level Deep Neural Networks (DNN) in model training and applications-is based on training a regular ASR system, which can be based on Gaussian Mixture Models (GMM) or DNN. All the training data (in the format of features) are aligned with the transcripts in terms of phonemes and words with the timing information and new features are formed in terms of phonemes, diphones, morphemes, and up to words. Regular ASR produces a result lattice with timing information for each word. A feature is then extracted and sent to the word-level DNN for scoring Phoneme features are sent to corresponding DNNs for training. Scores are combined to form the word level scores, a rescored lattice and a new recognition result.Type: GrantFiled: August 9, 2017Date of Patent: March 19, 2019Assignee: AppTek, Inc.Inventors: Jintao Jiang, Hassan Sawaf, Mudar Yaghi
-
Patent number: 9626575Abstract: In an approach for visual liveness detection, a video-audio signal related to a speaker speaking a text is obtained. The video-audio signal is split into a video signal which records images of the speaker and an audio signal which records a speech spoken by the speaker. Then a first sequence indicating visual mouth openness is obtained from the video signal, and a second sequence indicating acoustic mouth openness is obtained based on the text and the audio signal. Synchrony between the first and second sequences is measured, and the liveness of the speaker is determined based on the synchrony.Type: GrantFiled: August 7, 2015Date of Patent: April 18, 2017Assignee: International Business Machines CorporationInventors: Min Li, Wen Liu, Yong Qin, Zhong Su, Shi Lei Zhang, Shiwan Zhao
-
Patent number: 9542929Abstract: Systems and methods for providing non-lexical cues in synthesized speech are described herein. Original text is analyzed to determine characteristics of the text and/or to derive or augment an intent (e.g., an intent code). Non-lexical cue insertion points are determined based on the characteristics of the text and/or the intent. One or more nonlexical cues are inserted at insertion points to generate augmented text. The augmented text is synthesized into speech, including converting the non-lexical cues to speech output.Type: GrantFiled: September 26, 2014Date of Patent: January 10, 2017Assignee: INTEL CORPORATIONInventors: Jessica M. Christian, Peter Graff, Crystal A. Nakatsu, Beth Ann Hockey
-
Patent number: 9460711Abstract: Methods and systems for processing multilingual DNN acoustic models are described. An example method may include receiving training data that includes a respective training data set for each of two or more or languages. A multilingual deep neural network (DNN) acoustic model may be processed based on the training data. The multilingual DNN acoustic model may include a feedforward neural network having multiple layers of one or more nodes. Each node of a given layer may connect with a respective weight to each node of a subsequent layer, and the multiple layers of one or more nodes may include one or more shared hidden layers of nodes and a language-specific output layer of nodes corresponding to each of the two or more languages. Additionally, weights associated with the multiple layers of one or more nodes of the processed multilingual DNN acoustic model may be stored in a database.Type: GrantFiled: April 15, 2013Date of Patent: October 4, 2016Assignee: Google Inc.Inventors: Vincent Olivier Vanhoucke, Jeffrey Adgate Dean, Georg Heigold, Marc'aurelio Ranzato, Matthieu Devin, Patrick An Phu Nguyen, Andrew William Senior
-
Patent number: 9135231Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for customizing the punctuation style of a transcription. A method includes receiving an utterance from a user, obtaining an unpunctuated transcription of the utterance, identifying an instance within the unpunctuated transcription where a punctuation mark may be placed, identifying, using data associated with the user, one or more past instances that are similar to the identified instance, punctuating the unpunctuated transcription based at least on the one or more past instances, and presenting the punctuated transcription to the user.Type: GrantFiled: January 31, 2013Date of Patent: September 15, 2015Assignee: Google Inc.Inventors: Hugo B. Barra, Robert W. Hamilton
-
Publication number: 20150073804Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for providing a representation based on structured data in resources. The methods, systems, and apparatus include actions of receiving target acoustic features output from a neural network that has been trained to predict acoustic features given linguistic features. Additional actions include determining a distance between the target acoustic features and acoustic features of a stored acoustic sample. Further actions include selecting the acoustic sample to be used in speech synthesis based at least on the determined distance and synthesizing speech based on the selected acoustic sample.Type: ApplicationFiled: September 6, 2013Publication date: March 12, 2015Applicant: Google Inc.Inventors: Andrew W. Senior, Javier Gonzalvo Fructuoso
-
Patent number: 8949129Abstract: A method and apparatus are provided for processing a set of communicated signals associated with a set of muscles, such as the muscles near the larynx of the person, or any other muscles the person use to achieve a desired response. The method includes the steps of attaching a single integrated sensor, for example, near the throat of the person proximate to the larynx and detecting an electrical signal through the sensor. The method further includes the steps of extracting features from the detected electrical signal and continuously transforming them into speech sounds without the need for further modulation. The method also includes comparing the extracted features to a set of prototype features and selecting a prototype feature of the set of prototype features providing a smallest relative difference.Type: GrantFiled: August 12, 2013Date of Patent: February 3, 2015Assignee: Ambient CorporationInventors: Michael Callahan, Thomas Coleman
-
Publication number: 20140358547Abstract: Systems and methods for prosody prediction include extracting features from runtime data using a parametric model. The features from runtime data are compared with features from training data using an exemplar-based model to predict prosody of the runtime data. The features from the training data are paired with exemplars from the training data and stored on a computer readable storage medium.Type: ApplicationFiled: September 17, 2013Publication date: December 4, 2014Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATIONInventors: Raul Fernandez, Asaf Rendel
-
Publication number: 20140358546Abstract: Systems and methods for prosody prediction include extracting features from runtime data using a parametric model. The features from runtime data are compared with features from training data using an exemplar-based model to predict prosody of the runtime data. The features from the training data are paired with exemplars from the training data and stored on a computer readable storage medium.Type: ApplicationFiled: August 28, 2013Publication date: December 4, 2014Applicant: International Business Machines CorporationInventors: Raul Fernandez, Asaf Rendel
-
Method and apparatus of transforming speech feature vectors using an auto-associative neural network
Patent number: 8838446Abstract: Provided is a method and apparatus for transforming a speech feature vector. The method includes extracting a feature vector required for speech recognition from a speech signal and transforming the extracted feature vector using an auto-associative neural network (AANN).Type: GrantFiled: August 31, 2007Date of Patent: September 16, 2014Assignee: Samsung Electronics Co., Ltd.Inventors: So-young Jeong, Kwang-cheol Oh, Jae-hoon Jeong, Jeong-su Kim -
Patent number: 8788256Abstract: Computer implemented speech processing generates one or more pronunciations of an input word in a first language by a non-native speaker of the first language who is a native speaker of a second language. The input word is converted into one or more pronunciations. Each pronunciation includes one or more phonemes selected from a set of phonemes associated with the second language. Each pronunciation is associated with the input word in an entry in a computer database. Each pronunciation in the database is associated with information identifying a pronunciation language and/or a phoneme language.Type: GrantFiled: February 2, 2010Date of Patent: July 22, 2014Assignee: Sony Computer Entertainment Inc.Inventors: Ruxin Chen, Gustavo Hernandez-Abrego, Masanori Omote, Xavier Menendez-Pidal
-
Patent number: 8527276Abstract: A method and system for is disclosed for speech synthesis using deep neural networks. A neural network may be trained to map input phonetic transcriptions of training-time text strings into sequences of acoustic feature vectors, which yield predefined speech waveforms when processed by a signal generation module. The training-time text strings may correspond to written transcriptions of speech carried in the predefined speech waveforms. Subsequent to training, a run-time text string may be translated to a run-time phonetic transcription, which may include a run-time sequence of phonetic-context descriptors, each of which contains a phonetic speech unit, data indicating phonetic context, and data indicating time duration of the respective phonetic speech unit. The trained neural network may then map the run-time sequence of the phonetic-context descriptors to run-time predicted feature vectors, which may in turn be translated into synthesized speech by the signal generation module.Type: GrantFiled: October 25, 2012Date of Patent: September 3, 2013Assignee: Google Inc.Inventors: Andrew William Senior, Byungha Chun, Michael Schuster
-
Patent number: 8527273Abstract: Systems and methods for identifying the N-best strings of a weighted automaton. A potential for each state of an input automaton to a set of destination states of the input automaton is first determined. Then, the N-best paths are found in the result of an on-the-fly determinization of the input automaton. Only the portion of the input automaton needed to identify the N-best paths is determinized. As the input automaton is determinized, a potential for each new state of the partially determinized automaton is determined and is used in identifying the N-best paths of the determinized automaton, which correspond exactly to the N-best strings of the input automaton.Type: GrantFiled: July 30, 2012Date of Patent: September 3, 2013Assignee: AT&T Intellectual Property II, L.P.Inventors: Mehryar Mohri, Michael Dennis Riley
-
Patent number: 8433573Abstract: A prosody modification device includes: a real voice prosody input part that receives real voice prosody information extracted from an utterance of a human; a regular prosody generating part that generates regular prosody information having a regular phoneme boundary that determines a boundary between phonemes and a regular phoneme length of a phoneme by using data representing a regular or statistical phoneme length in an utterance of a human with respect to a section including at least a phoneme or a phoneme string to be modified in the real voice prosody information; and a real voice prosody modification part that resets a real voice phoneme boundary by using the generated regular prosody information so that the real voice phoneme boundary and a real voice phoneme length of the phoneme or the phoneme string to be modified in the real voice prosody information are approximate to an actual phoneme boundary and an actual phoneme length of the utterance of the human, thereby modifying the real voice prosody inType: GrantFiled: February 11, 2008Date of Patent: April 30, 2013Assignee: Fujitsu LimitedInventors: Kentaro Murase, Nobuyuki Katae
-
Patent number: 8392184Abstract: The invention relates to speech signal processing that detects a speech signal from more than one microphone and obtains microphone signals that are processed by a beamformer to obtain a beamformed signal that is post-filtered signal with a filter that employs adaptable filter weights to obtain an enhanced beamformed signal with the post-filter adapting the filter weights with previously learned filter weights.Type: GrantFiled: January 21, 2009Date of Patent: March 5, 2013Assignee: Nuance Communications, Inc.Inventors: Markus Buck, Klaus Scheufele
-
Patent number: 8280739Abstract: The present invention provides a speech analysis method comprising steps of obtaining a speech signal and a corresponding DEGG/EGG signal; regarding the speech signal as the output of a vocal tract filter in a source-filter model taking the DEGG/EGG signal as the input; and estimating the features of the vocal tract filter from the speech signal as the output and the DEGG/EGG signal as the input, wherein the features of the vocal tract filter are expressed by the state vectors of the vocal tract filter at selected time points, and the step of estimating is performed using Kalman filtering.Type: GrantFiled: April 3, 2008Date of Patent: October 2, 2012Assignee: Nuance Communications, Inc.Inventors: Dan Ning Jiang, Fan Ping Meng, Yong Qin, Zhi Wei Shuang
-
Patent number: 8160880Abstract: Techniques for operating a reading machine are disclosed. The techniques include forming an N-dimensional features vector based on features of an image, the features corresponding to characteristics of at least one object depicted in the image, representing the features vector as a point in n-dimensional space, where n corresponds to N, the number of features in the features vector and comparing the point in n-dimensional space to a centroid that represents a cluster of points in the n-dimensional space corresponding to a class of objects to determine whether the point belongs in the class of objects corresponding to the centroid.Type: GrantFiled: April 28, 2008Date of Patent: April 17, 2012Assignee: K-NFB Reading Technology, Inc.Inventors: Paul Albrecht, Rafael Maya Zetune, Lucy Gibson, Raymond C. Kurzweil
-
Patent number: 7991616Abstract: The present invention is a speech synthesizer that generates speech data of text including a fixed part and a variable part, in combination with recorded speech and rule-based synthetic speech. The speech synthesizer is a high-quality one in which recorded speech and synthetic speech are concatenated with the discontinuity of timbres and prosodies not perceived.Type: GrantFiled: October 22, 2007Date of Patent: August 2, 2011Assignee: Hitachi, Ltd.Inventors: Yusuke Fujita, Ryota Kamoshida, Kenji Nagamatsu
-
Patent number: 7916848Abstract: Indications of which participant is providing information during a multi-party conference. Each participant has equipment to display information being transferred during the conference. A sourcing signaler residing in the participant equipment provides a signal that indicates the identity of its participant when this participant is providing information to the conference. The source indicators of the other participant equipment receive the signal and cause a UI to indicate that the participant identified by the received signal is providing information (e.g. the UI can causes the identifier to change appearance). An audio discriminator is used to distinguish between an acoustic signal that was generated by a person speaking from that generated in a band-limited manner. The audio discriminator analyzes the spectrum of detected audio signals and generates several parameters from the spectrum and from past determinations to determine the source of an audio signal on a frame-by-frame basis.Type: GrantFiled: October 1, 2003Date of Patent: March 29, 2011Assignee: Microsoft CorporationInventors: Yong Rui, Anoop Gupta
-
Patent number: 7836002Abstract: A system that can automatically narrow the search space or recognition scope within an activity-centric environment based upon a current activity or set of activities is provided. In addition, the activity and context data can also be used to rank the results of the recognition or search activity. In accordance with the domain scoping, natural language processing (NLP) as well as other types of conversion and recognition systems can dynamically adjust to the scope of the activity or group of activities thereby increasing the recognition systems accuracy and usefulness. In operation, a user context, activity context, environment context and/or device profile can be employed to effectuate the scoping. As well, the system can combine context with extrinsic data, including but not limited to, calendar, profile, historical activity data, etc. in order to define the parameters for an appropriate scoping.Type: GrantFiled: June 27, 2006Date of Patent: November 16, 2010Assignee: Microsoft CorporationInventors: Steven W. Macbeth, Roland L. Fernandez, Brian R. Meyers, Desney S. Tan, George G. Robertson, Nuria M. Oliver, Oscar E. Murillo
-
Patent number: 7765103Abstract: A rule based speech synthesis apparatus by which concatenation distortion may be less than a preset value without dependency on utterance, wherein a parameter correction unit reads out a target parameter for a vowel from a target parameter storage, responsive to the phoneme at a leading end and at a trailing end of a speech element and acoustic feature parameters output from a speech element selector, and accordingly corrects the acoustic feature parameters of the speech element. The parameter correction unit corrects the parameters, so that the parameters ahead and behind the speech element are equal to the target parameter for the vowel of the corresponding phoneme, and outputs the corrected parameters.Type: GrantFiled: June 9, 2004Date of Patent: July 27, 2010Assignee: Sony CorporationInventor: Nobuhide Yamazaki
-
Publication number: 20090157409Abstract: A method includes, generating, for each parameter of the prosody vector, an initial parameter prediction model with a plurality of attributes related to difference prosody prediction and at least part of attribute combinations of the plurality of attributes, in which each of the plurality of attributes and the attribute combinations is included as an item, calculating importance of each item in the parameter prediction model, deleting the item having the lowest importance calculated, re-generating a parameter prediction model with the remaining items, determining whether the re-generated parameter prediction model is an optimal model, and repeating the step of calculating importance and the steps following the step of calculating importance with the re-generated parameter prediction model, if the re-generated parameter prediction model is determined as not an optimal model, wherein the difference prosody vector and all parameter prediction models of the difference prosody vector constitute the difference prosType: ApplicationFiled: December 4, 2008Publication date: June 18, 2009Inventors: Yi Lifu, Li Jian, Lou Xiaoyan, Hao Jie
-
Patent number: 7444282Abstract: A method of automatic labeling using an optimum-partitioned classified neural network includes searching for neural networks having minimum errors with respect to a number of L phoneme combinations from a number of K neural network combinations generated at an initial stage or updated, updating weights during learning of the K neural networks by K phoneme combination groups searched with the same neural networks, and composing an optimum-partitioned classified neural network combination using the K neural networks of which a total error sum has converged; and tuning a phoneme boundary of a first label file by using the phoneme combination group classification result and the optimum-partitioned classified neural network combination, and generating a final label file reflecting the tuning result.Type: GrantFiled: March 1, 2004Date of Patent: October 28, 2008Assignee: Samsung Electronics Co., Ltd.Inventors: Ki-hyun Choo, Jeong-su Kim, Jae-won Lee, Ki-seung Lee
-
Publication number: 20080243510Abstract: Embodiments of the present invention address deficiencies of the art in respect to screen reading non-sequential text and provide a method, system and computer program product for overlapping screen reading of non-sequential text, such as a tag cloud or Web page header. In an embodiment of the invention, an overlapping screen reading method for a non-sequential list of words can include computing different speech synthesis parameters for different words in a non-sequential list of words, generating different audio forms for each of the different words according to the different speech synthesis parameters, and overlappingly merging the generated different audio forms into a single audio stream. The speech synthesis parameters can include, for instance, separation, volume, tone and location speech synthesis parameters. Thereafter, the method can include playing back the single audio stream to simulate a natural visual scanning of the non-sequential list of words.Type: ApplicationFiled: March 28, 2007Publication date: October 2, 2008Inventor: Lawrence C. Smith
-
Patent number: 7409340Abstract: A neural network is used to obtain more robust performance in determining prosodic markers on the basis of linguistic categories.Type: GrantFiled: January 27, 2003Date of Patent: August 5, 2008Assignee: Siemens AktiengesellschaftInventors: Martin Holzapfel, Achim Mueller
-
Patent number: 7343289Abstract: A system and method for detecting speech utilizing audio and video inputs. In one aspect, the invention collects audio data generated from a microphone device. In another aspect, the invention collects video data and processes the data to determine a mouth location for a given speaker. The audio and video are inputted into a time-delay neural network that processes the data to determine which target is speaking. The neural network processing is based upon a correlation to detected mouth movement from the video data and audio sounds detected by the microphone.Type: GrantFiled: June 25, 2003Date of Patent: March 11, 2008Assignee: Microsoft Corp.Inventors: Ross Cutler, Ashish Kapoor