Image To Speech Patents (Class 704/260)
-
Patent number: 12367867Abstract: A method for generating voice in an ongoing call session based on artificial intelligent techniques is provided. The method includes extracting a plurality of features from a voice input through an artificial neural network (ANN); identifying one or more lost audio frames within the voice input; predicting by the ANN, for each of the one or more lost audio frames, one or more features of the respective lost audio frame; and superposing the predicted features upon the voice input to generate an updated voice input.Type: GrantFiled: October 25, 2022Date of Patent: July 22, 2025Assignee: SAMSUNG ELECTRONICS CO., LTD.Inventors: Sandeep Singh Spall, Tarun Gupta, Narang Lucky Manoharlal
-
Patent number: 12361925Abstract: A speech recognition module receives training data of speech and creates a representation for individual words, non-words, phonemes, and any combination. A set of speech processing detectors analyze the training data of speech from humans communicating. The set of speech processing detectors detect speech parameters that are indicative of paralinguistic effects on top of enunciated words, phonemes, and non-words in the audio stream. One or more machine learning models undergo supervised machine learning on their neural network to train on how to associate one or more mark-up markers with a textual representation, for each individual word, individual non-word, individual phoneme, and any combinations of these, that was enunciated with a particular paralinguistic effect. Each mark-up marker can correspond to its own paralinguistic effect.Type: GrantFiled: December 29, 2020Date of Patent: July 15, 2025Assignee: SRI InternationalInventors: Harry Bratt, Colleen Richey, Maneesh Yadav
-
Patent number: 12361946Abstract: A speech interaction method, a speech interaction system and a storage medium. The method includes: receiving interactive speech input by a user; determining an emotional tag corresponding to the interactive speech based on the interactive speech and interactive text corresponding to the interactive speech; determining, based on the emotional tag, a response text corresponding to the interactive text, and a first prosodic feature and a second prosodic feature corresponding to the response text. The first prosodic feature is used for characterizing the whole sentence prosodic feature of the response text, and the second prosodic feature is used for characterizing a local prosodic feature of each character in the response text; and generating and outputting a response speech corresponding to the interactive speech based on the response text, the first prosodic feature and the second prosodic feature.Type: GrantFiled: January 14, 2025Date of Patent: July 15, 2025Inventors: Huapeng Sima, Zhengkun Mei, Yiping Tang
-
Patent number: 12340788Abstract: A system for use in video game development to generate expressive speech audio comprises a user interface configured to receive user-input text data and a user selection of a speech style. The system includes a machine-learned synthesizer comprising a text encoder, a speech style encoder and a decoder. The machine-learned synthesizer is configured to generate one or more text encodings derived from the user-input text data, using the text encoder of the machine-learned synthesizer; generate a speech style encoding by processing a set of speech style features associated with the selected speech style using the speech style encoder of the machine-learned synthesizer; combine the one or more text encodings and the speech style encoding to generate one or more combined encodings; and decode the one or more combined encodings with the decoder of the machine-learned synthesizer to generate predicted acoustic features.Type: GrantFiled: May 7, 2024Date of Patent: June 24, 2025Assignee: ELECTRONIC ARTS INC.Inventors: Siddharth Gururani, Kilol Gupta, Dhaval Shah, Zahra Shakeri, Jervis Pinto, Mohsen Sardari, Navid Aghdaie, Kazi Zaman
-
Patent number: 12334056Abstract: Broadly speaking, the present techniques provide methods for conditioning a neural network, which not only improve the generalizable performance of conditional neural networks, but also reduce model size and latency significantly. The resulting conditioned neural network is suitable for on-device deployment due to having a significantly lower model size, lower dynamic memory requirement, and lower latency.Type: GrantFiled: July 27, 2022Date of Patent: June 17, 2025Assignee: Samsung Electronics Co., Ltd.Inventors: Sourav Bhattacharya, Abhinav Mehrotra, Alberto Gil C. P. Ramos
-
Patent number: 12333176Abstract: Systems and methods for a novel architecture to support always-on applications and/or models suffering drift in their results. The system may comprise one or more servers that are configured to track the historical behavior of incoming request data for a model and/or redirect the request as needed using parallel data domains. The one or more servers may maintain and update a catalog of potential data domains that partitions the historically received data. One partition may comprise data output from a current model. Another partition may comprise detected outliers in the data. In the case of drift, outliers, and/or anomalies in the incoming data, the system may return an error signal that causes data to be duplicated into a new data domain.Type: GrantFiled: July 14, 2023Date of Patent: June 17, 2025Assignee: Capital One Services, LLCInventors: Trijeet Sethi, Muralikumar Venkatasubramaniam
-
Patent number: 12334049Abstract: Implementations are directed to receiving unstructured free-form natural language input, generating a chatbot based on the unstructured free-form natural language input and in response to receiving the unstructured free-form natural language input, and causing the chatbot to perform task(s) associated with an entity and on behalf of the user. In various implementations, the unstructured free-form natural language input conveys details of the task(s) to be performed, but does not define any corresponding dialog state map (e.g., does not define any dialog states or any dialog state transitions). Nonetheless, the unstructured free-form natural language input may be utilized to fine-tune and/or prime a machine learning model that is already capable of being utilized in conducting generalized conversations. As a result, the chatbot can be generated and deployed in a quick and efficient manner for performance of the task(s) on behalf of the user.Type: GrantFiled: December 5, 2022Date of Patent: June 17, 2025Assignee: GOOGLE LLCInventors: Asaf Aharoni, Eyal Segalis, Sasha Goldshtein, Ofer Ron, Yaniv Leviathan, Yoav Tzur
-
Patent number: 12327491Abstract: An information processing apparatus includes an audio information obtaining unit obtaining audio information on a user learning a first language, an analysis processing unit estimating a voice production condition of the user in accordance with analyzing the audio information, and a display processing unit displaying voice production condition images in animated display in accordance with a time-series change of the voice production condition. The display processing unit executes processing to superimpose to display a first feature point and a second feature point on the voice production condition images. The first feature point is identified in accordance with the voice production condition observed when a first sound in the first language is pronounced. The second feature point is identified in accordance with the voice production condition observed when a second sound similar to the first sound is pronounced in a second language different from the first language.Type: GrantFiled: May 8, 2023Date of Patent: June 10, 2025Inventors: Nic Schumann, Daniel Brenners, Bryan Ma, Mateus Rezende, Justin Chen
-
Patent number: 12322374Abstract: The present disclosure provides methods and apparatuses for phrase-based end-to-end text-to-speech (TTS) synthesis. A text may be obtained. A target phrase in the text may be identified. A phrase context of the target phrase may be determined. An acoustic feature corresponding to the target phrase may be generated based at least on the target phrase and the phrase context. A speech waveform corresponding to the target phrase may be generated based on the acoustic feature.Type: GrantFiled: March 19, 2021Date of Patent: June 3, 2025Assignee: Microsoft Technology Licensing, LLCInventors: Ran Zhang, Jian Luan, Yahuan Cong
-
Patent number: 12323377Abstract: Device(s) and computer program products for creating custom music/video messages to facilitate and/or improve social interaction. The created music/video messages include at least portions of: music, video, pictures, slideshows, and/or text. The music/video messages enable feelings or emotions to be communicated by the user of the device to one or more recipient device(s).Type: GrantFiled: July 7, 2023Date of Patent: June 3, 2025Assignee: Ameritech Solutions, Inc.Inventors: Nader Asghari Kamrani, Kamran Asghari Kamrani
-
Patent number: 12315490Abstract: The present disclosure relates generally to speech processing. Humans change their speech patterns in noisy environments. The systems and devices described herein can compensate for noisy environments to be more human-like. Thus, the configurations and implementations herein can determine a sound profile for the sound environment where the user is listening. Based on the sound profile, the devices can determine a transform to apply to output speech from the device. This transform is applied to the wake word, speech recognition, and to the output speech to compensate for the noise level of the environment by mimicking the Lombard effect.Type: GrantFiled: December 30, 2021Date of Patent: May 27, 2025Assignee: Spotify ABInventors: Daniel Bromand, Björn Erik Roth, Kåre Sjölander
-
Patent number: 12314300Abstract: Systems and methods are provided for a device to obtain a query, such as from a user. The query is vectorized to obtain a numerical representation of the query and provided to a vector database to find the nearest vectors corresponding to most relevant context, such as for a particular domain or subject matter. The query, query vector, and context vectors, and optionally past query history and past query responses, are provided to an artificial intelligence, such as a large language model (LLM), to receive a response to the query without providing the context to the LLM.Type: GrantFiled: December 28, 2023Date of Patent: May 27, 2025Assignee: Open Text Inc.Inventors: Vikash Sharma, Laxman Singh Chauhan, Raja Parshotam Lalwani
-
Patent number: 12296265Abstract: This specification describes a computer-implemented method of generating context-dependent speech audio in a video game. The method comprises obtaining contextual information relating to a state of the video game. The contextual information is inputted into a prosody prediction module. The prosody prediction module comprises a trained machine learning model which is configured to generate predicted prosodic features based on the contextual information. Input data comprising the predicted prosodic features and speech content data associated with the state of the video game is inputted into a speech audio generation module. An encoded representation of the speech content data dependent on the predicted prosodic features is generated using one or more encoders of the speech audio generation module. Context-dependent speech audio is generated, based on the encoded representation, using a decoder of the speech audio generation module.Type: GrantFiled: January 9, 2024Date of Patent: May 13, 2025Assignee: ELECTRONIC ARTS INC.Inventors: Kilol Gupta, Zahra Shakeri, Gordon Durity, Mohsen Sardari, Harold Chaput, Navid Aghdaie
-
Patent number: 12288567Abstract: A neural network, a system using this neural network and a method for training a neural network to output a description of the environment in the vicinity of at least one sound acquisition device on the basis of an audio signal acquired by the sound acquisition device, the method including: obtaining audio and image training signals of a scene showing an environment with objects generating sounds, obtaining a target description of the environment seen on the image training signal, inputting the audio training signal to the neural network so that the neural network outputs a training description of the environment, and comparing the target description of the environment with the training description of the environment.Type: GrantFiled: January 10, 2020Date of Patent: April 29, 2025Assignees: TOYOTA JIDOSHA KABUSHIKI KAISHA, ETH ZÜRICHInventors: Wim Abbeloos, Arun Balajee Vasudevan, Dengxin Dai, Luc Van Gool
-
Patent number: 12272350Abstract: During text-to-speech processing, a speech model creates output audio data, including speech, that corresponds to input text data that includes a representation of the speech. A spectrogram estimator estimates a frequency spectrogram of the speech; the corresponding frequency-spectrogram data is used to condition the speech model. A plurality of acoustic features corresponding to different segments of the input text data, such as phonemes, syllable-level features, and/or word-level features, may be separately encoded into context vectors; the spectrogram estimator uses these separate context vectors to create the frequency spectrogram.Type: GrantFiled: May 15, 2024Date of Patent: April 8, 2025Assignee: Amazon Technologies, Inc.Inventors: Jaime Lorenzo Trueba, Thomas Renaud Drugman, Viacheslav Klimkov, Srikanth Ronanki, Thomas Edward Merritt, Andrew Paul Breen, Roberto Barra-Chicote
-
Patent number: 12266340Abstract: Systems and methods receive, in real-time from a user via a user device, input audio data comprising communication element(s) and trained model(s) are applied thereto to categorize the communication element(s), the categorizing comprising assigning a contextual category to a communication element. Text is generated that includes a response to the communication element(s), the response including individualized and contextualized qualities predicted to provide an optimal outcome based on (i) the assigned contextual category and (ii) the user. Text-to-speech processing of the text is implemented to produce an audio output comprising (a) the response and (b) a speech pattern predicted to facilitate the optimal outcome. The audio output is provided to the user via the user device, and based thereon a user's reaction is measured according to a quantifiable quality score that is used to modify future iterations of text-to-speech processing to provide future audio output(s) including a revised speech pattern.Type: GrantFiled: December 7, 2022Date of Patent: April 1, 2025Assignee: TRUIST BANKInventor: Bjorn Austraat
-
Patent number: 12260260Abstract: Systems, apparatus, methods, and articles of manufacture for digital delegate computer system architecture that provides for improved multi-agent LLM implementations.Type: GrantFiled: March 29, 2024Date of Patent: March 25, 2025Assignee: The Travelers Indemnity CompanyInventors: Matthew J. Gorman, Vincent E. Haines, Girish A. Modgil, Brad E. Gawron
-
Patent number: 12249314Abstract: Techniques are described for invoking and switching between chatbots of a chatbot system. In some embodiments, the chatbot system is capable of routing an utterance received while a user is already interacting with a first chatbot in the chatbot system. For instance, the chatbot system may identify a second chatbot based on determining that (i) such an utterance is an invalid input to the first chatbot or (ii) that the first chatbot is attempting to route the utterance to a destination associated with the first chatbot. Identifying the second chatbot can involve computing, using a predictive model, separate confidence scores for the first chatbot and the second chatbot, and then determining that a confidence score for the second chatbot satisfies one or more confidence score thresholds. The utterance is then routed to the second chatbot based on the identifying of the second chatbot.Type: GrantFiled: April 19, 2023Date of Patent: March 11, 2025Assignee: Oracle International CorporationInventors: Vishal Vishnoi, Xin Xu, Srinivasa Phani Kumar Gadde, Fen Wang, Muruganantham Chinnananchi, Manish Parekh, Stephen Andrew McRitchie, Jae Min John, Crystal C. Pan, Gautam Singaraju, Saba Amsalu Teserra
-
Patent number: 12238451Abstract: Embodiments are disclosed for predicting, using neural networks, editing operations for application to a video sequence based on processing conversational messages by a video editing system. In particular, in one or more embodiments, the disclosed systems and methods comprise receiving an input including a video sequence and text sentences, the text sentences describing a modification to the video sequence, mapping, by a first neural network content of the text sentences describing the modification to the video sequence to a candidate editing operation, processing, by a second neural network, the video sequence to predict parameter values for the candidate editing operation, and generating a modified video sequence by applying the candidate editing operation with the predicted parameter values to the video sequence.Type: GrantFiled: November 14, 2022Date of Patent: February 25, 2025Assignee: Adobe Inc.Inventors: Uttaran Bhattacharya, Gang Wu, Viswanathan Swaminathan, Stefano Petrangeli
-
Patent number: 12229516Abstract: A data processing system receives a natural language communication including an ordered sequence of a plurality of word spellings of a natural human language. A processor of the data processing system parses the plurality of word spelling of the natural language communication utilizing constraint-based parsing to identify a plurality of satisfied parsing constraints. The processor performs semantic analysis based on the plurality of satisfied parsing constraints. Performing semantic analysis includes obtaining at least mid-level comprehension of the natural language communication by identifying in the natural language communication utilizing constraints at least one of the following set: a clausal structure within the natural language communication, a sentence structure of a sentence in the natural language communication, an implied topic of the natural language communication, and a classical linguistic role in the natural language communication.Type: GrantFiled: June 19, 2023Date of Patent: February 18, 2025Inventor: Thomas A. Visel
-
Patent number: 12225284Abstract: Systems, methods, devices and non-transitory, computer-readable storage mediums are disclosed for a wearable multimedia device and cloud computing platform with an application ecosystem for processing multimedia data captured by the wearable multimedia device. In an embodiment, a method comprises: receiving, by one or more processors of a cloud computing platform, context data from a wearable multimedia device, the wearable multimedia device including at least one data capture device for capturing the context data; creating a data processing pipeline with one or more applications based on one or more characteristics of the context data and a user request; processing the context data through the data processing pipeline; and sending output of the data processing pipeline to the wearable multimedia device or other device for presentation of the output.Type: GrantFiled: March 27, 2023Date of Patent: February 11, 2025Assignee: Humane, Inc.Inventors: Imran A. Chaudhri, Bethany Bongiorno, Shahzad Chaudhri
-
Patent number: 12217755Abstract: A voice conversion device is provided with a linguistic information extraction unit that extracts linguistic information corresponding to utterance content from a conversion source voice signal, an appearance feature extraction unit that extracts appearance features expressing features related to the look of a person's face from a captured image of the person, and a converted voice generation unit that generates a converted voice on a basis of the linguistic information and the appearance features.Type: GrantFiled: September 4, 2020Date of Patent: February 4, 2025Assignee: Nippon Telegraph and Telephone CorporationInventors: Hirokazu Kameoka, Ko Tanaka, Yasunori Oishi, Takuhiro Kaneko, Aaron Valero Puche
-
Patent number: 12217002Abstract: Apparatuses, systems, and techniques to parse textual data using parallel computing devices. In at least one embodiment, text is parsed by a plurality of parallel processing units using a finite state machine and logical stack to convert the text to a tree data structure. Data is extracted from the tree by the plurality of parallel processors and stored.Type: GrantFiled: May 11, 2022Date of Patent: February 4, 2025Assignee: NVIDIA CorporationInventors: Elias Stehle, Gregory Michael Kimball
-
Patent number: 12197812Abstract: Systems, methods, and devices may generate speech files that reflect emotion of text-based content. An example process includes selecting a first text from a first source of text content and selecting a second text from a second source of text content. The first text and the second text are aggregated into an aggregated text, and the aggregated text includes a first emotion associated with content of the first text. The aggregated text also includes a second emotion associated with content of the second text. The aggregated text is converted into a speech stored in an audio file. The speech replicates human expression of the first emotion and of the second emotion.Type: GrantFiled: April 13, 2023Date of Patent: January 14, 2025Assignee: DISH Technologies L.L.C.Inventor: John C. Calef, III
-
Patent number: 12197862Abstract: Disclosed is a method for identifying a word corresponding to a target word in text information, which is performed by one or more processors of a computing device. The method may include: determining a target word; determining a threshold for an edit distance associated with the target word; determining a word of which the edit distance from the target word among words included in text information is equal to or less than the threshold; and identifying the word corresponding to the target word based on the determined word.Type: GrantFiled: November 3, 2022Date of Patent: January 14, 2025Assignee: ActionPower Corp.Inventor: Seongmin Park
-
Patent number: 12190872Abstract: An electronic apparatus includes: a memory storing one or more commands; and a processor connected to the memory and configured to control the electronic apparatus, wherein the processor is configured, by executing the one or more instructions, to: identify a first intention word and a first target word from first speech, acquire second speech received after the first speech based on at least one of the identified first intention word or the identified first target word not matching a word stored in the memory, acquire a similarity between the first speech and the second speech, and acquire response information based on the first speech and the second speech based on the similarity being a threshold value or more.Type: GrantFiled: April 5, 2022Date of Patent: January 7, 2025Assignee: Samsung Electronics Co., Ltd.Inventors: Ali Ahmad Adel Al Massaeed, Abdelrahman Mustafa Mahmoud Yaseen, Shrouq Walid Jamil Sabah
-
Patent number: 12190018Abstract: A system and method for generating, triggering and playing a sequence of audio files with cues for delivering a presentation for a presenter using a personal audio device coupled to a computing device. The system comprising the comprising a computer device that is coupled to a presentation data analysis server through a network. The method includes (i) generating a sequence of audio files with cues for delivering a presentation, (ii) triggering playing an audio file from the sequence of audio files, and (iii) playing the sequence of audio files one by one, on the computing device, using the personal audio device coupled to a computing device to enable the presenter to recall and speak the content based on the sequence of the audio files.Type: GrantFiled: January 4, 2021Date of Patent: January 7, 2025Inventor: Arjun Karthik Bala
-
Patent number: 12183320Abstract: A method for generating synthetic speech for text through a user interface is provided. The method may include receiving one or more sentences, determining a speech style characteristic for the received one or more sentences, and outputting a synthetic speech for the one or more sentences that reflects the determined speech style characteristic. The one or more sentences and the determined speech style characteristic may be inputted to an artificial neural network text-to-speech synthesis model and the synthetic speech may be generated based on the speech data outputted from the artificial neural network text-to-speech synthesis model.Type: GrantFiled: January 20, 2021Date of Patent: December 31, 2024Assignee: NEOSAPIENCE, INC.Inventors: Taesu Kim, Younggun Lee
-
Patent number: 12183340Abstract: Systems and methods of intent identification for customized dialogue support in virtual environments are provided. Dialogue intent models stored in memory may each specify one or more intents each associated with a dialogue filter. Input data may be received from a user device of a user. Such input data may be captured during an interactive session of an interactive title that provides a virtual environment to the user device. The input data may be analyzed based on the intent models in response to a detected dialogue trigger and may be determined to correspond to one of the stored intents. The dialogue filter associated with the determined intent may be applied to a plurality of available dialogue outputs associated with the detected dialogue filter. A customized dialogue output may be generated in accordance with a filtered subset of the available dialogue outputs.Type: GrantFiled: July 21, 2022Date of Patent: December 31, 2024Assignees: SONY INTERACTIVE ENTERTAINMENT LLC, SONY INTERACTIVE ENTERTAINMENT INC.Inventors: Benaisha Patel, Alessandra Luizello, Olga Rudi
-
Patent number: 12175394Abstract: The present invention is directed to interactive training, and in particular, to methods and systems for computerized interactive skill training. An example embodiment provides a method and system for providing skill training using a computerized system. The computerized system receives a selection of a first training subject. Several related training components can be invoked, such as reading, watching, performing, and/or reviewing components. In addition, a scored challenge session is provided, wherein a training challenge is provided to a user via a terminal, optionally in video form.Type: GrantFiled: April 12, 2023Date of Patent: December 24, 2024Assignee: Breakthrough PerformanceTech, LLCInventors: Martin L. Cohen, Edward G. Brown
-
Patent number: 12165647Abstract: A computer-implemented method is disclosed. A search query of a text transcription is received. The search query includes a word or words having a specified spelling. A sequence of search phonemes corresponding to the specified spelling is generated. A sequence of transcript phonemes corresponding to the text transcription is generated from the text transcription. A search alignment in which the sequence of search phonemes is aligned to a transcript phoneme fragment is generated. Based at least on the search alignment having a quality score exceeding a quality score threshold, the transcript phoneme fragment and an associated portion of the text transcription is determined to result from an utterance of the specified spelling in an audio session corresponding to the text transcription. A search result indicating that the transcript phoneme fragment and the associated portion of the text transcription is determined to have resulted from the utterance is output.Type: GrantFiled: May 27, 2022Date of Patent: December 10, 2024Assignee: Microsoft Technology Licensing, LLCInventor: Yuchen Li
-
Patent number: 12154563Abstract: An electronic apparatus, based on a text sentence being input, obtains prosody information of the text sentence, segments the text sentence into a plurality of sentence elements, obtains a speech in which prosody information is reflected to each of the plurality of sentence elements in parallel by inputting the plurality of sentence elements and the prosody information of the text sentence to a text to speech (TTS) module, and merges the speech for the plurality of sentence elements that are obtained in parallel to output speech for the text sentence.Type: GrantFiled: February 24, 2022Date of Patent: November 26, 2024Assignee: SAMSUNG ELECTRONICS CO., LTD.Inventors: Jonghoon Jeong, Hosang Sung, Doohwa Hong, Kyoungbo Min, Eunmi Oh, Kihyun Choo
-
Patent number: 12149781Abstract: Systems and methods for determining whether a first electronic device detects a media item that is to be output by a second electronic device is described herein. In some embodiments, an individual may request, using a first electronic device, that a media item be played on a second electronic device. The backend system may send first audio data representing a first response to the first electronic device, along with instructions to delay outputting the first response, as well as to continue sending audio data of additional audio captured thereby. The backend system may also send second audio data representing a second response to the second electronic device along with the media item. Text data may be generated representing the captured audio, which may then be compared with text data representing the second response to determine whether or not they match.Type: GrantFiled: September 16, 2022Date of Patent: November 19, 2024Assignee: Amazon Technologies, Inc.Inventor: Dennis Francis Cwik
-
Patent number: 12147646Abstract: Disclosed herein are system, method, and computer program product embodiments for unifying graphical user interface (GUI) displays across different device types. In an embodiment, a unification system may convert various GUI view appearing on, for example, a desktop device into a GUI view on a mobile device. Both devices may be accessing the same application and/or may use a cloud computing platform to access the application. The unification system may aid in reproducing GUI modifications performed on one user device onto other user devices. In this manner, the unification system may maintain a consistent look-and-feel for a user across different computing device type.Type: GrantFiled: December 14, 2018Date of Patent: November 19, 2024Assignee: Salesforce, Inc.Inventors: Eric Jacobson, Michael Gonzalez, Wayne Cho, Adheip Varadarajan, John Vollmer, Benjamin Snyder
-
Patent number: 12142257Abstract: Systems and methods are provided for providing emotion-based text to speech. The systems and methods perform operations comprising accessing a text string; storing a plurality of embeddings associated with a plurality of speakers, a first embedding for a first speaker being associated with a first emotion and a second embedding for a second speaker of the plurality of speakers being associated with a second emotion; selecting the first speaker to speak one or more words of the text string; determining that the one or more words are associated with the second emotion; generating, based on the first embedding and the second embedding, a third embedding for the first speaker associated with the second emotion; and applying the third embedding and the text string to a vocoder to generate an audio stream comprising the one or more words being spoken by the first speaker with the second emotion.Type: GrantFiled: February 8, 2022Date of Patent: November 12, 2024Assignee: SNAP INC.Inventors: Liron Harazi, Jacob Assa, Alan Bekker
-
Patent number: 12142256Abstract: An information processing device according to embodiments includes a communication unit configured to receive audio data of content and text data corresponding to the audio data, an audio data reproduction unit configured to perform reproduction of the audio data, a text data reproduction unit configured to perform the reproduction by audio synthesis of the text data, and a controller that controls the reproduction of the audio data or the text data. The controller causes the text data reproduction unit to perform the reproduction of the text data when the audio data reproduction unit is unable to perform the reproduction of the audio data.Type: GrantFiled: October 20, 2023Date of Patent: November 12, 2024Assignee: TOYOTA JIDOSHA KABUSHIKI KAISHAInventor: Jun Tsukamoto
-
Patent number: 12136259Abstract: A method for training a neural network, including: determining a neural network; training the neural network at a first learning rate according to a first optimization mode, where the first learning rate is updated each time the neural network is trained; mapping the first learning rate of the first optimization mode to a second learning rate of a second optimization mode in the same vector space; determining the second learning rate satisfies a preset update condition; and continuing to train the neural network at the second learning rate according to the second optimization mode.Type: GrantFiled: August 20, 2020Date of Patent: November 5, 2024Assignee: BIGO TECHNOLOGY PTE. LTD.Inventors: Wei Xiang, Chao Pei
-
Patent number: 12100386Abstract: A speech processing system for generating translated speech, the system comprising: an input for receiving a first speech signal comprising a second language; an output for outputting a second speech signal comprising a first language; and a processor configured to: generate a first text signal from a segment of the first speech signal, the first text signal comprising the second language; generate a second text signal from the first text signal, the second text signal comprising the first language; extract a plurality of first feature vectors from the segment of the first speech signal, wherein the first feature vectors comprise information relating to audio data corresponding to the segment of the first speech signal; generate a speaker vector using a first trained algorithm taking one or more of the first feature vectors as input, wherein the speaker vector represents a set of features corresponding to a speaker; generate a second speech signal segment using a second trained algorithm taking information reType: GrantFiled: March 13, 2019Date of Patent: September 24, 2024Assignee: Papercup Technologies LimitedInventor: Jiameng Gao
-
Patent number: 12096068Abstract: A method and a device for enabling a spatio-temporal navigation of content. In response to a request, from a client, for a content comprising a first content of a first type, the device transmits to the client: the first content; a second content of a second type generated from the first content; synchronization metadata associating each word of one of the contents with a time marker to the other content; and a script, for execution at the client, configured to re-establish at least one of the contents and offer a navigation of said content depending on the type of content re-established by using at least one among: the first content, the second content, and the synchronization metadata.Type: GrantFiled: December 13, 2019Date of Patent: September 17, 2024Assignee: ORANGEInventors: Fabrice Boudin, Frederic Herledan
-
Patent number: 12087270Abstract: Techniques for generating customized synthetic voices personalized to a user, based on user-provided feedback, are described. A system may determine embedding data representing a user-provided description of a desired synthetic voice and profile data associated with the user, and generate synthetic voice embedding data using synthetic voice embedding data corresponding a profile associated with a user determined to be similar to the current user. Based on user-provided feedback with respect to a customized synthetic voice, generated using synthetic voice characteristics corresponding to the synthetic voice embedding data and presented to the user, and the synthetic voice embedding data, the system may generate new synthetic voice embedding data, corresponding to a new customized synthetic voice. The system may be configured to assign the customized synthetic voice to the user, such that a subsequent user may not be presented with the same customized synthetic voice.Type: GrantFiled: September 29, 2022Date of Patent: September 10, 2024Assignee: Amazon Technologies, Inc.Inventors: Sebastian Dariusz Cygert, Daniel Korzekwa, Kamil Pokora, Piotr Tadeusz Bilinski, Kayoko Yanagisawa, Abdelhamid Ezzerg, Thomas Edward Merritt, Raghu Ram Sreepada Srinivas, Nikhil Sharma
-
Patent number: 12087273Abstract: A method includes receiving an input text sequence to be synthesized into speech in a first language and obtaining a speaker embedding, the speaker embedding specifying specific voice characteristics of a target speaker for synthesizing the input text sequence into speech that clones a voice of the target speaker. The target speaker includes a native speaker of a second language different than the first language. The method also includes generating, using a text-to-speech (TTS) model, an output audio feature representation of the input text by processing the input text sequence and the speaker embedding. The output audio feature representation includes the voice characteristics of the target speaker specified by the speaker embedding.Type: GrantFiled: January 30, 2023Date of Patent: September 10, 2024Assignee: Google LLCInventors: Yu Zhang, Ron J. Weiss, Byungha Chun, Yonghui Wu, Zhifeng Chen, Russell John Wyatt Skerry-Ryan, Ye Jia, Andrew M. Rosenberg, Bhuvana Ramabhadran
-
Patent number: 12087272Abstract: A method (800) of training a text-to-speech (TTS) model (108) includes obtaining training data (150) including reference input text (104) that includes a sequence of characters, a sequence of reference audio features (402) representative of the sequence of characters, and a sequence of reference phone labels (502) representative of distinct speech sounds of the reference audio features. For each of a plurality of time steps, the method includes generating a corresponding predicted audio feature (120) based on a respective portion of the reference input text for the time step and generating, using a phone label mapping network (510), a corresponding predicted phone label (520) associated with the predicted audio feature. The method also includes aligning the predicted phone label with the reference phone label to determine a corresponding predicted phone label loss (622) and updating the TTS model based on the corresponding predicted phone label loss.Type: GrantFiled: December 13, 2019Date of Patent: September 10, 2024Assignee: Google LLCInventors: Andrew Rosenberg, Bhuvana Ramabhadran, Fadi Biadsy, Yu Zhang
-
Patent number: 12076850Abstract: A monitor is installed in an eye of a robot, and an eye image is displayed on the monitor. The robot extracts a feature quantity of an eye of a user from a filmed image of the user. The feature quantity of the eye of the user is reflected in the eye image. For example, a feature quantity is a size of a pupillary region and a pupil image, and a form of an eyelid image. Also, a blinking frequency or the like may also be reflected as a feature quantity. Familiarity with respect to each user is set, and which user's feature quantity is to be reflected may be determined in accordance with the familiarity.Type: GrantFiled: April 7, 2023Date of Patent: September 3, 2024Assignee: GROOVE X, INC.Inventor: Kaname Hayashi
-
Patent number: 12080270Abstract: An apparatus for synthesizing speech according to an embodiment is a computing apparatus that includes one or more processors and a memory storing one or more programs executed by the one or more processors. The apparatus for synthesizing speech includes a pre-processing module that marks a preset classification symbol on each of unit texts input; and a speech synthesis module that receives each unit text marked with the classification symbol and synthesizes speech uttering the unit text based on the input unit text.Type: GrantFiled: December 22, 2020Date of Patent: September 3, 2024Assignee: DEEPBRAIN AI INC.Inventors: Gyeongsu Chae, Dalhyun Kim
-
Patent number: 12080272Abstract: A method (400) for representing an intended prosody in synthesized speech includes receiving a text utterance (310) having at least one word (240), and selecting an utterance embedding (204) for the text utterance. Each word in the text utterance has at least one syllable (230) and each syllable has at least one phoneme (220). The utterance embedding represents an intended prosody. For each syllable, using the selected utterance embedding, the method also includes: predicting a duration (238) of the syllable by decoding a prosodic syllable embedding (232, 234) for the syllable based on attention by an attention mechanism (340) to linguistic features (222) of each phoneme of the syllable and generating a plurality of fixed-length predicted frames (260) based on the predicted duration for the syllable.Type: GrantFiled: December 10, 2019Date of Patent: September 3, 2024Assignee: Google LLCInventors: Robert Clark, Chun-an Chan, Vincent Wan
-
Patent number: 12080089Abstract: A computer-implemented method, a computer system and a computer program product enhance machine translation of a document. The method includes capturing an image of the document. The document includes a plurality of characters that are arranged in a character layout. The method also includes classifying the image by a document type based on the character layout. The method further includes determining a strategy for an intelligent character recognition (ICR) algorithm with the image based on the character layout of the image. Lastly, the method includes generating a translated document by applying the intelligent character recognition (ICR) algorithm to the plurality of characters in the image using the strategy. The translated document includes a plurality of translated characters that are arranged in the character layout.Type: GrantFiled: December 8, 2021Date of Patent: September 3, 2024Assignee: INTERNATIONAL BUSINESS MACHINES CORPORATIONInventors: Barton Wayne Emanuel, Nadiya Kochura, Su Liu, Tetsuya Shimada
-
Patent number: 12073838Abstract: A speech-processing system may provide access to multiple virtual assistants via one or more voice-controlled devices. Each assistant may leverage language processing and language generation features of the speech-processing system, while handling different commands and/or providing access to different back applications. Each assistant may be associated with its own voice and/or speech style, and thus be perceived as having a particular “personality.” Different assistants may be available for use with a particular voice-controlled device based on time, location, the particular user, etc. In some situations, language processing may be improved by leveraging data such as intent data, grammars, lexicons, entities, etc., associated with assistants available for use with the particular voice-controlled device.Type: GrantFiled: December 7, 2020Date of Patent: August 27, 2024Assignee: Amazon Technologies, Inc.Inventors: Naveen Bobbili, David Henry, Mark Vincent Mattione, Richard Du, Jyoti Chhabra
-
Patent number: 12067130Abstract: The disclosed exemplary embodiments include computer-implemented systems, devices, apparatuses, and processes that maintain data confidentiality in communications involving voice-enabled devices in a distributed computing environment using homomorphic encryption. By way of example, an apparatus may receive encrypted command data from a computing system, decrypt the encrypted command data using a homomorphic private key, and perform operations that associate the decrypted command data with a request for an element of data. Using a public cryptographic key associated with a device, the apparatus generate an encrypted response that includes the requested data element, and transmit the encrypted response to the device. The device may decrypt the encrypted response using a private cryptographic key and to perform operations that present first audio content representative of the requested data element through an acoustic interface.Type: GrantFiled: November 12, 2021Date of Patent: August 20, 2024Assignee: The Toronto-Dominion BankInventors: Alexey Shpurov, Milos Dunjic, Brian Andrew Lam
-
Patent number: 12062357Abstract: A method of registering an attribute in a speech synthesis model, an apparatus of registering an attribute in a speech synthesis model, an electronic device, and a medium are provided, which relate to a field of an artificial intelligence technology such as a deep learning and intelligent speech technology. The method includes: acquiring a plurality of data associated with an attribute to be registered; and registering the attribute in the speech synthesis model by using the plurality of data associated with the attribute, wherein the speech synthesis model is trained in advance by using a training data in a training data set.Type: GrantFiled: November 16, 2021Date of Patent: August 13, 2024Assignee: Beijing Baidu Netcom Science Technology Co., Ltd.Inventors: Wenfu Wang, Xilei Wang, Tao Sun, Han Yuan, Zhengkun Gao, Lei Jia
-
Patent number: 12057106Abstract: A method of authorizing content for use, e.g., in association with a conversational bot. The method begins by configuring a conversational bot using a machine learning model trained to classify utterances into topics. Utterances that are not recognized by the machine learning model (e.g., according to some configurable threshold) are then identified. Using a clustering algorithm, one or more of the identified utterances are then processed into a grouping. Information identifying a topic associated with the grouping is then received and, in response, the machine learning model is updated to include the topic.Type: GrantFiled: March 15, 2023Date of Patent: August 6, 2024Assignee: Drift.com, Inc.Inventors: Maria C. Moya, Natalie Duerr, Jeffrey D. Orkin, Jane S. Taschman, Carolina M. Caprile, Christopher M. Ward