Image To Speech Patents (Class 704/260)
-
Patent number: 12238451Abstract: Embodiments are disclosed for predicting, using neural networks, editing operations for application to a video sequence based on processing conversational messages by a video editing system. In particular, in one or more embodiments, the disclosed systems and methods comprise receiving an input including a video sequence and text sentences, the text sentences describing a modification to the video sequence, mapping, by a first neural network content of the text sentences describing the modification to the video sequence to a candidate editing operation, processing, by a second neural network, the video sequence to predict parameter values for the candidate editing operation, and generating a modified video sequence by applying the candidate editing operation with the predicted parameter values to the video sequence.Type: GrantFiled: November 14, 2022Date of Patent: February 25, 2025Assignee: Adobe Inc.Inventors: Uttaran Bhattacharya, Gang Wu, Viswanathan Swaminathan, Stefano Petrangeli
-
Patent number: 12229516Abstract: A data processing system receives a natural language communication including an ordered sequence of a plurality of word spellings of a natural human language. A processor of the data processing system parses the plurality of word spelling of the natural language communication utilizing constraint-based parsing to identify a plurality of satisfied parsing constraints. The processor performs semantic analysis based on the plurality of satisfied parsing constraints. Performing semantic analysis includes obtaining at least mid-level comprehension of the natural language communication by identifying in the natural language communication utilizing constraints at least one of the following set: a clausal structure within the natural language communication, a sentence structure of a sentence in the natural language communication, an implied topic of the natural language communication, and a classical linguistic role in the natural language communication.Type: GrantFiled: June 19, 2023Date of Patent: February 18, 2025Inventor: Thomas A. Visel
-
Patent number: 12225284Abstract: Systems, methods, devices and non-transitory, computer-readable storage mediums are disclosed for a wearable multimedia device and cloud computing platform with an application ecosystem for processing multimedia data captured by the wearable multimedia device. In an embodiment, a method comprises: receiving, by one or more processors of a cloud computing platform, context data from a wearable multimedia device, the wearable multimedia device including at least one data capture device for capturing the context data; creating a data processing pipeline with one or more applications based on one or more characteristics of the context data and a user request; processing the context data through the data processing pipeline; and sending output of the data processing pipeline to the wearable multimedia device or other device for presentation of the output.Type: GrantFiled: March 27, 2023Date of Patent: February 11, 2025Assignee: Humane, Inc.Inventors: Imran A. Chaudhri, Bethany Bongiorno, Shahzad Chaudhri
-
Patent number: 12217755Abstract: A voice conversion device is provided with a linguistic information extraction unit that extracts linguistic information corresponding to utterance content from a conversion source voice signal, an appearance feature extraction unit that extracts appearance features expressing features related to the look of a person's face from a captured image of the person, and a converted voice generation unit that generates a converted voice on a basis of the linguistic information and the appearance features.Type: GrantFiled: September 4, 2020Date of Patent: February 4, 2025Assignee: Nippon Telegraph and Telephone CorporationInventors: Hirokazu Kameoka, Ko Tanaka, Yasunori Oishi, Takuhiro Kaneko, Aaron Valero Puche
-
Patent number: 12217002Abstract: Apparatuses, systems, and techniques to parse textual data using parallel computing devices. In at least one embodiment, text is parsed by a plurality of parallel processing units using a finite state machine and logical stack to convert the text to a tree data structure. Data is extracted from the tree by the plurality of parallel processors and stored.Type: GrantFiled: May 11, 2022Date of Patent: February 4, 2025Assignee: NVIDIA CorporationInventors: Elias Stehle, Gregory Michael Kimball
-
Patent number: 12197862Abstract: Disclosed is a method for identifying a word corresponding to a target word in text information, which is performed by one or more processors of a computing device. The method may include: determining a target word; determining a threshold for an edit distance associated with the target word; determining a word of which the edit distance from the target word among words included in text information is equal to or less than the threshold; and identifying the word corresponding to the target word based on the determined word.Type: GrantFiled: November 3, 2022Date of Patent: January 14, 2025Assignee: ActionPower Corp.Inventor: Seongmin Park
-
Patent number: 12197812Abstract: Systems, methods, and devices may generate speech files that reflect emotion of text-based content. An example process includes selecting a first text from a first source of text content and selecting a second text from a second source of text content. The first text and the second text are aggregated into an aggregated text, and the aggregated text includes a first emotion associated with content of the first text. The aggregated text also includes a second emotion associated with content of the second text. The aggregated text is converted into a speech stored in an audio file. The speech replicates human expression of the first emotion and of the second emotion.Type: GrantFiled: April 13, 2023Date of Patent: January 14, 2025Assignee: DISH Technologies L.L.C.Inventor: John C. Calef, III
-
Patent number: 12190018Abstract: A system and method for generating, triggering and playing a sequence of audio files with cues for delivering a presentation for a presenter using a personal audio device coupled to a computing device. The system comprising the comprising a computer device that is coupled to a presentation data analysis server through a network. The method includes (i) generating a sequence of audio files with cues for delivering a presentation, (ii) triggering playing an audio file from the sequence of audio files, and (iii) playing the sequence of audio files one by one, on the computing device, using the personal audio device coupled to a computing device to enable the presenter to recall and speak the content based on the sequence of the audio files.Type: GrantFiled: January 4, 2021Date of Patent: January 7, 2025Inventor: Arjun Karthik Bala
-
Patent number: 12190872Abstract: An electronic apparatus includes: a memory storing one or more commands; and a processor connected to the memory and configured to control the electronic apparatus, wherein the processor is configured, by executing the one or more instructions, to: identify a first intention word and a first target word from first speech, acquire second speech received after the first speech based on at least one of the identified first intention word or the identified first target word not matching a word stored in the memory, acquire a similarity between the first speech and the second speech, and acquire response information based on the first speech and the second speech based on the similarity being a threshold value or more.Type: GrantFiled: April 5, 2022Date of Patent: January 7, 2025Assignee: Samsung Electronics Co., Ltd.Inventors: Ali Ahmad Adel Al Massaeed, Abdelrahman Mustafa Mahmoud Yaseen, Shrouq Walid Jamil Sabah
-
Patent number: 12183340Abstract: Systems and methods of intent identification for customized dialogue support in virtual environments are provided. Dialogue intent models stored in memory may each specify one or more intents each associated with a dialogue filter. Input data may be received from a user device of a user. Such input data may be captured during an interactive session of an interactive title that provides a virtual environment to the user device. The input data may be analyzed based on the intent models in response to a detected dialogue trigger and may be determined to correspond to one of the stored intents. The dialogue filter associated with the determined intent may be applied to a plurality of available dialogue outputs associated with the detected dialogue filter. A customized dialogue output may be generated in accordance with a filtered subset of the available dialogue outputs.Type: GrantFiled: July 21, 2022Date of Patent: December 31, 2024Assignees: SONY INTERACTIVE ENTERTAINMENT LLC, SONY INTERACTIVE ENTERTAINMENT INC.Inventors: Benaisha Patel, Alessandra Luizello, Olga Rudi
-
Patent number: 12183320Abstract: A method for generating synthetic speech for text through a user interface is provided. The method may include receiving one or more sentences, determining a speech style characteristic for the received one or more sentences, and outputting a synthetic speech for the one or more sentences that reflects the determined speech style characteristic. The one or more sentences and the determined speech style characteristic may be inputted to an artificial neural network text-to-speech synthesis model and the synthetic speech may be generated based on the speech data outputted from the artificial neural network text-to-speech synthesis model.Type: GrantFiled: January 20, 2021Date of Patent: December 31, 2024Assignee: NEOSAPIENCE, INC.Inventors: Taesu Kim, Younggun Lee
-
Patent number: 12175394Abstract: The present invention is directed to interactive training, and in particular, to methods and systems for computerized interactive skill training. An example embodiment provides a method and system for providing skill training using a computerized system. The computerized system receives a selection of a first training subject. Several related training components can be invoked, such as reading, watching, performing, and/or reviewing components. In addition, a scored challenge session is provided, wherein a training challenge is provided to a user via a terminal, optionally in video form.Type: GrantFiled: April 12, 2023Date of Patent: December 24, 2024Assignee: Breakthrough PerformanceTech, LLCInventors: Martin L. Cohen, Edward G. Brown
-
Patent number: 12165647Abstract: A computer-implemented method is disclosed. A search query of a text transcription is received. The search query includes a word or words having a specified spelling. A sequence of search phonemes corresponding to the specified spelling is generated. A sequence of transcript phonemes corresponding to the text transcription is generated from the text transcription. A search alignment in which the sequence of search phonemes is aligned to a transcript phoneme fragment is generated. Based at least on the search alignment having a quality score exceeding a quality score threshold, the transcript phoneme fragment and an associated portion of the text transcription is determined to result from an utterance of the specified spelling in an audio session corresponding to the text transcription. A search result indicating that the transcript phoneme fragment and the associated portion of the text transcription is determined to have resulted from the utterance is output.Type: GrantFiled: May 27, 2022Date of Patent: December 10, 2024Assignee: Microsoft Technology Licensing, LLCInventor: Yuchen Li
-
Patent number: 12154563Abstract: An electronic apparatus, based on a text sentence being input, obtains prosody information of the text sentence, segments the text sentence into a plurality of sentence elements, obtains a speech in which prosody information is reflected to each of the plurality of sentence elements in parallel by inputting the plurality of sentence elements and the prosody information of the text sentence to a text to speech (TTS) module, and merges the speech for the plurality of sentence elements that are obtained in parallel to output speech for the text sentence.Type: GrantFiled: February 24, 2022Date of Patent: November 26, 2024Assignee: SAMSUNG ELECTRONICS CO., LTD.Inventors: Jonghoon Jeong, Hosang Sung, Doohwa Hong, Kyoungbo Min, Eunmi Oh, Kihyun Choo
-
Patent number: 12147646Abstract: Disclosed herein are system, method, and computer program product embodiments for unifying graphical user interface (GUI) displays across different device types. In an embodiment, a unification system may convert various GUI view appearing on, for example, a desktop device into a GUI view on a mobile device. Both devices may be accessing the same application and/or may use a cloud computing platform to access the application. The unification system may aid in reproducing GUI modifications performed on one user device onto other user devices. In this manner, the unification system may maintain a consistent look-and-feel for a user across different computing device type.Type: GrantFiled: December 14, 2018Date of Patent: November 19, 2024Assignee: Salesforce, Inc.Inventors: Eric Jacobson, Michael Gonzalez, Wayne Cho, Adheip Varadarajan, John Vollmer, Benjamin Snyder
-
Patent number: 12149781Abstract: Systems and methods for determining whether a first electronic device detects a media item that is to be output by a second electronic device is described herein. In some embodiments, an individual may request, using a first electronic device, that a media item be played on a second electronic device. The backend system may send first audio data representing a first response to the first electronic device, along with instructions to delay outputting the first response, as well as to continue sending audio data of additional audio captured thereby. The backend system may also send second audio data representing a second response to the second electronic device along with the media item. Text data may be generated representing the captured audio, which may then be compared with text data representing the second response to determine whether or not they match.Type: GrantFiled: September 16, 2022Date of Patent: November 19, 2024Assignee: Amazon Technologies, Inc.Inventor: Dennis Francis Cwik
-
Patent number: 12142257Abstract: Systems and methods are provided for providing emotion-based text to speech. The systems and methods perform operations comprising accessing a text string; storing a plurality of embeddings associated with a plurality of speakers, a first embedding for a first speaker being associated with a first emotion and a second embedding for a second speaker of the plurality of speakers being associated with a second emotion; selecting the first speaker to speak one or more words of the text string; determining that the one or more words are associated with the second emotion; generating, based on the first embedding and the second embedding, a third embedding for the first speaker associated with the second emotion; and applying the third embedding and the text string to a vocoder to generate an audio stream comprising the one or more words being spoken by the first speaker with the second emotion.Type: GrantFiled: February 8, 2022Date of Patent: November 12, 2024Assignee: SNAP INC.Inventors: Liron Harazi, Jacob Assa, Alan Bekker
-
Patent number: 12142256Abstract: An information processing device according to embodiments includes a communication unit configured to receive audio data of content and text data corresponding to the audio data, an audio data reproduction unit configured to perform reproduction of the audio data, a text data reproduction unit configured to perform the reproduction by audio synthesis of the text data, and a controller that controls the reproduction of the audio data or the text data. The controller causes the text data reproduction unit to perform the reproduction of the text data when the audio data reproduction unit is unable to perform the reproduction of the audio data.Type: GrantFiled: October 20, 2023Date of Patent: November 12, 2024Assignee: TOYOTA JIDOSHA KABUSHIKI KAISHAInventor: Jun Tsukamoto
-
Patent number: 12136259Abstract: A method for training a neural network, including: determining a neural network; training the neural network at a first learning rate according to a first optimization mode, where the first learning rate is updated each time the neural network is trained; mapping the first learning rate of the first optimization mode to a second learning rate of a second optimization mode in the same vector space; determining the second learning rate satisfies a preset update condition; and continuing to train the neural network at the second learning rate according to the second optimization mode.Type: GrantFiled: August 20, 2020Date of Patent: November 5, 2024Assignee: BIGO TECHNOLOGY PTE. LTD.Inventors: Wei Xiang, Chao Pei
-
Patent number: 12100386Abstract: A speech processing system for generating translated speech, the system comprising: an input for receiving a first speech signal comprising a second language; an output for outputting a second speech signal comprising a first language; and a processor configured to: generate a first text signal from a segment of the first speech signal, the first text signal comprising the second language; generate a second text signal from the first text signal, the second text signal comprising the first language; extract a plurality of first feature vectors from the segment of the first speech signal, wherein the first feature vectors comprise information relating to audio data corresponding to the segment of the first speech signal; generate a speaker vector using a first trained algorithm taking one or more of the first feature vectors as input, wherein the speaker vector represents a set of features corresponding to a speaker; generate a second speech signal segment using a second trained algorithm taking information reType: GrantFiled: March 13, 2019Date of Patent: September 24, 2024Assignee: Papercup Technologies LimitedInventor: Jiameng Gao
-
Patent number: 12096068Abstract: A method and a device for enabling a spatio-temporal navigation of content. In response to a request, from a client, for a content comprising a first content of a first type, the device transmits to the client: the first content; a second content of a second type generated from the first content; synchronization metadata associating each word of one of the contents with a time marker to the other content; and a script, for execution at the client, configured to re-establish at least one of the contents and offer a navigation of said content depending on the type of content re-established by using at least one among: the first content, the second content, and the synchronization metadata.Type: GrantFiled: December 13, 2019Date of Patent: September 17, 2024Assignee: ORANGEInventors: Fabrice Boudin, Frederic Herledan
-
Patent number: 12087270Abstract: Techniques for generating customized synthetic voices personalized to a user, based on user-provided feedback, are described. A system may determine embedding data representing a user-provided description of a desired synthetic voice and profile data associated with the user, and generate synthetic voice embedding data using synthetic voice embedding data corresponding a profile associated with a user determined to be similar to the current user. Based on user-provided feedback with respect to a customized synthetic voice, generated using synthetic voice characteristics corresponding to the synthetic voice embedding data and presented to the user, and the synthetic voice embedding data, the system may generate new synthetic voice embedding data, corresponding to a new customized synthetic voice. The system may be configured to assign the customized synthetic voice to the user, such that a subsequent user may not be presented with the same customized synthetic voice.Type: GrantFiled: September 29, 2022Date of Patent: September 10, 2024Assignee: Amazon Technologies, Inc.Inventors: Sebastian Dariusz Cygert, Daniel Korzekwa, Kamil Pokora, Piotr Tadeusz Bilinski, Kayoko Yanagisawa, Abdelhamid Ezzerg, Thomas Edward Merritt, Raghu Ram Sreepada Srinivas, Nikhil Sharma
-
Patent number: 12087273Abstract: A method includes receiving an input text sequence to be synthesized into speech in a first language and obtaining a speaker embedding, the speaker embedding specifying specific voice characteristics of a target speaker for synthesizing the input text sequence into speech that clones a voice of the target speaker. The target speaker includes a native speaker of a second language different than the first language. The method also includes generating, using a text-to-speech (TTS) model, an output audio feature representation of the input text by processing the input text sequence and the speaker embedding. The output audio feature representation includes the voice characteristics of the target speaker specified by the speaker embedding.Type: GrantFiled: January 30, 2023Date of Patent: September 10, 2024Assignee: Google LLCInventors: Yu Zhang, Ron J. Weiss, Byungha Chun, Yonghui Wu, Zhifeng Chen, Russell John Wyatt Skerry-Ryan, Ye Jia, Andrew M. Rosenberg, Bhuvana Ramabhadran
-
Patent number: 12087272Abstract: A method (800) of training a text-to-speech (TTS) model (108) includes obtaining training data (150) including reference input text (104) that includes a sequence of characters, a sequence of reference audio features (402) representative of the sequence of characters, and a sequence of reference phone labels (502) representative of distinct speech sounds of the reference audio features. For each of a plurality of time steps, the method includes generating a corresponding predicted audio feature (120) based on a respective portion of the reference input text for the time step and generating, using a phone label mapping network (510), a corresponding predicted phone label (520) associated with the predicted audio feature. The method also includes aligning the predicted phone label with the reference phone label to determine a corresponding predicted phone label loss (622) and updating the TTS model based on the corresponding predicted phone label loss.Type: GrantFiled: December 13, 2019Date of Patent: September 10, 2024Assignee: Google LLCInventors: Andrew Rosenberg, Bhuvana Ramabhadran, Fadi Biadsy, Yu Zhang
-
Patent number: 12080089Abstract: A computer-implemented method, a computer system and a computer program product enhance machine translation of a document. The method includes capturing an image of the document. The document includes a plurality of characters that are arranged in a character layout. The method also includes classifying the image by a document type based on the character layout. The method further includes determining a strategy for an intelligent character recognition (ICR) algorithm with the image based on the character layout of the image. Lastly, the method includes generating a translated document by applying the intelligent character recognition (ICR) algorithm to the plurality of characters in the image using the strategy. The translated document includes a plurality of translated characters that are arranged in the character layout.Type: GrantFiled: December 8, 2021Date of Patent: September 3, 2024Assignee: INTERNATIONAL BUSINESS MACHINES CORPORATIONInventors: Barton Wayne Emanuel, Nadiya Kochura, Su Liu, Tetsuya Shimada
-
Patent number: 12080270Abstract: An apparatus for synthesizing speech according to an embodiment is a computing apparatus that includes one or more processors and a memory storing one or more programs executed by the one or more processors. The apparatus for synthesizing speech includes a pre-processing module that marks a preset classification symbol on each of unit texts input; and a speech synthesis module that receives each unit text marked with the classification symbol and synthesizes speech uttering the unit text based on the input unit text.Type: GrantFiled: December 22, 2020Date of Patent: September 3, 2024Assignee: DEEPBRAIN AI INC.Inventors: Gyeongsu Chae, Dalhyun Kim
-
Patent number: 12080272Abstract: A method (400) for representing an intended prosody in synthesized speech includes receiving a text utterance (310) having at least one word (240), and selecting an utterance embedding (204) for the text utterance. Each word in the text utterance has at least one syllable (230) and each syllable has at least one phoneme (220). The utterance embedding represents an intended prosody. For each syllable, using the selected utterance embedding, the method also includes: predicting a duration (238) of the syllable by decoding a prosodic syllable embedding (232, 234) for the syllable based on attention by an attention mechanism (340) to linguistic features (222) of each phoneme of the syllable and generating a plurality of fixed-length predicted frames (260) based on the predicted duration for the syllable.Type: GrantFiled: December 10, 2019Date of Patent: September 3, 2024Assignee: Google LLCInventors: Robert Clark, Chun-an Chan, Vincent Wan
-
Patent number: 12076850Abstract: A monitor is installed in an eye of a robot, and an eye image is displayed on the monitor. The robot extracts a feature quantity of an eye of a user from a filmed image of the user. The feature quantity of the eye of the user is reflected in the eye image. For example, a feature quantity is a size of a pupillary region and a pupil image, and a form of an eyelid image. Also, a blinking frequency or the like may also be reflected as a feature quantity. Familiarity with respect to each user is set, and which user's feature quantity is to be reflected may be determined in accordance with the familiarity.Type: GrantFiled: April 7, 2023Date of Patent: September 3, 2024Assignee: GROOVE X, INC.Inventor: Kaname Hayashi
-
Patent number: 12073838Abstract: A speech-processing system may provide access to multiple virtual assistants via one or more voice-controlled devices. Each assistant may leverage language processing and language generation features of the speech-processing system, while handling different commands and/or providing access to different back applications. Each assistant may be associated with its own voice and/or speech style, and thus be perceived as having a particular “personality.” Different assistants may be available for use with a particular voice-controlled device based on time, location, the particular user, etc. In some situations, language processing may be improved by leveraging data such as intent data, grammars, lexicons, entities, etc., associated with assistants available for use with the particular voice-controlled device.Type: GrantFiled: December 7, 2020Date of Patent: August 27, 2024Assignee: Amazon Technologies, Inc.Inventors: Naveen Bobbili, David Henry, Mark Vincent Mattione, Richard Du, Jyoti Chhabra
-
Patent number: 12067130Abstract: The disclosed exemplary embodiments include computer-implemented systems, devices, apparatuses, and processes that maintain data confidentiality in communications involving voice-enabled devices in a distributed computing environment using homomorphic encryption. By way of example, an apparatus may receive encrypted command data from a computing system, decrypt the encrypted command data using a homomorphic private key, and perform operations that associate the decrypted command data with a request for an element of data. Using a public cryptographic key associated with a device, the apparatus generate an encrypted response that includes the requested data element, and transmit the encrypted response to the device. The device may decrypt the encrypted response using a private cryptographic key and to perform operations that present first audio content representative of the requested data element through an acoustic interface.Type: GrantFiled: November 12, 2021Date of Patent: August 20, 2024Assignee: The Toronto-Dominion BankInventors: Alexey Shpurov, Milos Dunjic, Brian Andrew Lam
-
Patent number: 12062357Abstract: A method of registering an attribute in a speech synthesis model, an apparatus of registering an attribute in a speech synthesis model, an electronic device, and a medium are provided, which relate to a field of an artificial intelligence technology such as a deep learning and intelligent speech technology. The method includes: acquiring a plurality of data associated with an attribute to be registered; and registering the attribute in the speech synthesis model by using the plurality of data associated with the attribute, wherein the speech synthesis model is trained in advance by using a training data in a training data set.Type: GrantFiled: November 16, 2021Date of Patent: August 13, 2024Assignee: Beijing Baidu Netcom Science Technology Co., Ltd.Inventors: Wenfu Wang, Xilei Wang, Tao Sun, Han Yuan, Zhengkun Gao, Lei Jia
-
Patent number: 12057106Abstract: A method of authorizing content for use, e.g., in association with a conversational bot. The method begins by configuring a conversational bot using a machine learning model trained to classify utterances into topics. Utterances that are not recognized by the machine learning model (e.g., according to some configurable threshold) are then identified. Using a clustering algorithm, one or more of the identified utterances are then processed into a grouping. Information identifying a topic associated with the grouping is then received and, in response, the machine learning model is updated to include the topic.Type: GrantFiled: March 15, 2023Date of Patent: August 6, 2024Assignee: Drift.com, Inc.Inventors: Maria C. Moya, Natalie Duerr, Jeffrey D. Orkin, Jane S. Taschman, Carolina M. Caprile, Christopher M. Ward
-
Patent number: 12056854Abstract: Embodiments of the present invention provide end-to-end frame time synchronization designed to improve smoothness for displaying images of 3D applications, such as PC gaming applications. Traditionally, an application that renders 3D graphics functions based on the assumption that the average render time will be used as the animation time for a given frame. When this condition is not met, and the render time for a frame does not match the average render time of prior frames, the frames are not captured or displayed at a consistent rate. This invention enables feedback to be provided to the rendering application for adjusting the animation times used to produce new frames, and a post-render queue is used to store completed frames for mitigating stutter and hitches. Flip control is used to sync the display of a rendered frame with the animation time used to generate the frame, thereby producing a smooth, consistent image.Type: GrantFiled: March 28, 2022Date of Patent: August 6, 2024Assignee: NVIDIA CORPORATIONInventors: Thomas Albert Petersen, Ankan Banerjee, Shishir Goyal, Sau Yan Keith Li, Lars Nordskog, Rouslan Dimitrov
-
Patent number: 12046236Abstract: Training data can be received, which can include pairs of speech and meaning representation associated with the speech as ground truth data. The meaning representation includes at least semantic entities associated with the speech, where the spoken order of the semantic entities is unknown. The semantic entities of the meaning representation in the training data can be reordered into spoken order of the associated speech using an alignment technique. A spoken language understanding machine learning model can be trained using the pairs of speech and meaning representation having the reordered semantic entities. The meaning representation, e.g., semantic entities, in the received training data can be perturbed to create random order sequence variations of the semantic entities associated with speech. Perturbed meaning representation with associated speech can augment the training data.Type: GrantFiled: August 27, 2021Date of Patent: July 23, 2024Assignee: International Business Machines CorporationInventors: Hong-Kwang Kuo, Zoltan Tueske, Samuel Thomas, Brian E. D. Kingsbury, George Andrei Saon
-
Patent number: 12039969Abstract: A speech processing system for generating translated speech, the system comprising: an input for receiving a first speech signal comprising a second language; an output for outputting a second speech signal comprising a first language; and a processor configured to: generate a first text signal from a segment of the first speech signal, the first text signal comprising the second language; generate a second text signal from the first text signal, the second text signal comprising the first language; extract a plurality of first feature vectors from the segment of the first speech signal, wherein the first feature vectors comprise information relating to audio data corresponding to the segment of the first speech signal; generate a speaker vector using a first trained algorithm taking one or more of the first feature vectors as input, wherein the speaker vector represents a set of features corresponding to a speaker; generate a second speech signal segment using a second trained algorithm taking information reType: GrantFiled: October 28, 2021Date of Patent: July 16, 2024Assignee: PAPERCUP TECHNOLOGIES LIMITEDInventor: Jiameng Gao
-
Patent number: 12026456Abstract: Systems and methods for using optical character recognition (OCR) with voice recognition commands are provided. Some embodiments include receiving a user interface that includes a text field, capturing an image of at least a portion of the user interface, and performing OCR on the image to identify the text field and a word in the user interface. Some embodiments include mapping a coordinate of the word in the text field, receiving a voice command that includes the word, and navigating, by the computing device, a cursor to the coordinate to execute the voice command.Type: GrantFiled: August 7, 2017Date of Patent: July 2, 2024Assignee: DOLBEY & COMPANY, INC.Inventor: Curtis A. Weeks
-
Patent number: 12027165Abstract: A non-transitory computer readable medium stores computer executable instructions which, when executed by at least one processor, cause the at least one processor to acquire a speech signal of speech of a user; perform a signal processing on the speech signal to acquire at least one feature of the speech of the user; and control display of information, related to each of one or more first candidate converters having a feature corresponding to the at least one feature, to present the one or more first candidate converters for selection by the user.Type: GrantFiled: July 9, 2021Date of Patent: July 2, 2024Assignee: GREE, INC.Inventor: Akihiko Shirai
-
Patent number: 12014720Abstract: This application relates to a speech synthesis method and apparatus, a model training method and apparatus, and a computer device. The method includes: obtaining to-be-processed linguistic data; encoding the linguistic data, to obtain encoded linguistic data; obtaining an embedded vector for speech feature conversion, the embedded vector being generated according to a residual between synthesized reference speech data and reference speech data that correspond to the same reference linguistic data; and decoding the encoded linguistic data according to the embedded vector, to obtain target synthesized speech data on which the speech feature conversion is performed. The solution provided in this application can prevent quality of a synthesized speech from being affected by a semantic feature in a mel-frequency cepstrum.Type: GrantFiled: August 21, 2020Date of Patent: June 18, 2024Assignee: TENCENT TECHNOLOGY (SHENZHEN) COMPANY LIMITEDInventors: Xixin Wu, Mu Wang, Shiyin Kang, Dan Su, Dong Yu
-
Patent number: 12010399Abstract: Methods, systems, and computer-readable media for generating videos with characters indicating regions of images are provided. For example, an image containing a first region may be received. At least one characteristic of a character may be obtained. A script containing a first segment of the script may be received. The first segment of the script may be related to the first region of the image. The at least one characteristic of a character and the script may be used to generate a video of the character presenting the script and at least part of the image, where the character visually indicates the first region of the image while presenting the first segment of the script.Type: GrantFiled: January 17, 2023Date of Patent: June 11, 2024Inventors: Ben Avi Ingel, Ron Zass
-
Patent number: 11997344Abstract: Systems and methods are described herein for generating alternate audio for a media stream. The media system receives media that is requested by the user. The media comprises a video and audio. The audio includes words spoken in a first language. The media system stores the received media in a buffer as it is received. The media system separates the audio from the buffered media and determines an emotional state expressed by spoken words of the first language. The media system translates the words spoken in the first language into words spoken in a second language. Using the translated words of the second language, the media system synthesizes speech having the emotional state previously determined. The media system then retrieves the video of the received media from the buffer and synchronizes the synthesized speech with the video to generate the media content in a second language.Type: GrantFiled: October 25, 2021Date of Patent: May 28, 2024Assignee: Rovi Guides, Inc.Inventors: Vijay Kumar, Rajendran Pichaimurthy, Madhusudhan Seetharam
-
Patent number: 11996083Abstract: A computer-implemented method is provided of using a machine learning model for disentanglement of prosody in spoken natural language. The method includes encoding, by a computing device, the spoken natural language to produce content code. The method further includes resampling, by the computing device without text transcriptions, the content code to obscure the prosody by applying an unsupervised technique to the machine learning model to generate prosody-obscured content code. The method additionally includes decoding, by the computing device, the prosody-obscured content code to synthesize speech indirectly based upon the content code.Type: GrantFiled: June 3, 2021Date of Patent: May 28, 2024Assignee: INTERNATIONAL BUSINESS MACHINES CORPORATIONInventors: Kaizhi Qian, Yang Zhang, Shiyu Chang, Jinjun Xiong, Chuang Gan, David Cox
-
Patent number: 11990118Abstract: During text-to-speech processing, a speech model creates output audio data, including speech, that corresponds to input text data that includes a representation of the speech. A spectrogram estimator estimates a frequency spectrogram of the speech; the corresponding frequency-spectrogram data is used to condition the speech model. A plurality of acoustic features corresponding to different segments of the input text data, such as phonemes, syllable-level features, and/or word-level features, may be separately encoded into context vectors; the spectrogram estimator uses these separate context vectors to create the frequency spectrogram.Type: GrantFiled: June 6, 2023Date of Patent: May 21, 2024Assignee: Amazon Technologies, Inc.Inventors: Jaime Lorenzo Trueba, Thomas Renaud Drugman, Viacheslav Klimkov, Srikanth Ronanki, Thomas Edward Merritt, Andrew Paul Breen, Roberto Barra-Chicote
-
Patent number: 11981261Abstract: A vehicle projection control device includes: a vehicle information acquisition unit that acquires vehicle information including a vehicle speed of a vehicle; an identification information acquisition unit that acquires identification information identifying a tunnel through which the vehicle travels; a virtual vehicle video generation unit that generates a virtual moving body video of a virtual moving body that moves ahead of the vehicle in a direction that is the same as the vehicle, the virtual moving body video being for projection by a projection unit of a head-up display device; and a projection control unit that controls projection of the virtual moving body video, such that a virtual image of the generated virtual moving body video is visually recognized ahead of the vehicle with use of a projection unit. The projection control unit controls the projection of the virtual moving body video based on the acquired identification information.Type: GrantFiled: March 6, 2020Date of Patent: May 14, 2024Assignee: JVCKENWOOD CorporationInventors: Kenichi Matsuo, Makoto Kurihara
-
Patent number: 11977575Abstract: Techniques for recommending media are described. A character preference function comprising a plurality of preference coefficients is accessed. A first character model comprises a first set of attribute values for the plurality of attributes of a first character. The first and second characters are associated with a first and second salience value, respectively. A second character model comprises a second set of attribute values for the plurality of attributes of a second character of the plurality of characters. A first character rating is calculated using the plurality of preference coefficients and the first set of attribute values. A second character rating of the second character is calculated using the plurality of preference coefficients with the second set of attribute values. A media rating is calculated based on the first and second salience values and the first and second character ratings. A media is recommended based on the media rating.Type: GrantFiled: May 12, 2022Date of Patent: May 7, 2024Assignee: The Nielsen Company (US), LLCInventors: Rachel Payne, Meghana Bhatt, Natasha Mohanty
-
Patent number: 11971926Abstract: Disclosed computer-based systems and methods for analyzing a plurality of audio files corresponding to text-based news stories and received from a plurality of audio file creators are configured to (i) compare quality and/or accuracy metrics of individual audio files against corresponding quality and/or accuracy thresholds, and (ii) based on the comparison: (a) accept audio files meeting the quality and/or accuracy thresholds for distribution to a plurality of subscribers for playback, (b) reject audio files failing to meet one or more certain quality and/or accuracy thresholds, (c) remediate audio files failing to meet certain quality thresholds, and (d) designate for human review, audio files failing to meet one or more certain quality and/or accuracy thresholds by a predetermined margin.Type: GrantFiled: August 17, 2020Date of Patent: April 30, 2024Assignee: Gracenote Digital Ventures, LLCInventors: Gregory P. Defouw, Venkatarama Anilkumar Panguluri
-
Patent number: 11960852Abstract: A direct speech-to-speech translation (S2ST) model includes an encoder configured to receive an input speech representation that to an utterance spoken by a source speaker in a first language and encode the input speech representation into a hidden feature representation. The S2ST model also includes an attention module configured to generate a context vector that attends to the hidden representation encoded by the encoder. The S2ST model also includes a decoder configured to receive the context vector generated by the attention module and predict a phoneme representation that corresponds to a translation of the utterance in a second different language. The S2ST model also includes a synthesizer configured to receive the context vector and the phoneme representation and generate a translated synthesized speech representation that corresponds to a translation of the utterance spoken in the different second language.Type: GrantFiled: December 15, 2021Date of Patent: April 16, 2024Assignee: Google LLCInventors: Ye Jia, Michelle Tadmor Ramanovich, Tal Remez, Roi Pomerantz
-
Patent number: 11948559Abstract: Various embodiments include methods and devices for implementing automatic grammar augmentation for improving voice command recognition accuracy in systems with a small footprint acoustic model. Alternative expressions that may capture acoustic model decoding variations may be added to a grammar set. An acoustic model-specific statistical pronunciation dictionary may be derived by running the acoustic model through a large general speech dataset and constructing a command-specific candidate set containing potential grammar expressions. Greedy based and cross-entropy-method (CEM) based algorithms may be utilized to search the candidate set for augmentations with improved recognition accuracy.Type: GrantFiled: March 21, 2022Date of Patent: April 2, 2024Assignee: QUALCOMM IncorporatedInventors: Yang Yang, Anusha Lalitha, Jin Won Lee, Christopher Lott
-
Patent number: 11922194Abstract: A method of operating a computing device in support of improved accessibility includes displaying a user interface to an application on a display screen of the computing device, wherein the computing device includes an accessibility assistant that reads an audible description of an element of the user interface; initiating, on the computing device, a virtual assistant that conducts an audible conversation between a user and the virtual assistant through at least a microphone and a speaker associated with the computing device, wherein the virtual assistant is not integrated with an operating system of the computing device; inhibiting an ability of the accessibility assistant to read the audible description of the element of the user interface; and upon transition of the virtual assistant from an active state, enabling the ability of the accessibility assistant.Type: GrantFiled: May 19, 2022Date of Patent: March 5, 2024Assignee: Microsoft Technology Licensing, LLCInventors: Jaclyn Carley Knapp, Lisa Stifelman, André Roberto Lima Tapajós, Jin Xu, Steven DiCarlo, Kaichun Wu, Yuhua Guan
-
Patent number: 11915714Abstract: Methods for modifying audio data include operations for accessing audio data having a first prosody, receiving a target prosody differing from the first prosody, and computing acoustic features representing samples. Computing respective acoustic features for a sample includes computing a pitch feature as a quantized pitch value of the sample by assigning a pitch value, of the target prosody or the audio data, to at least one of a set of pitch bins having equal widths in cents. Computing the respective acoustic features further includes computing a periodicity feature from the audio data. The respective acoustic features for the sample include the pitch feature, the periodicity feature, and other acoustic features. A neural vocoder is applied to the acoustic features to pitch-shift and time-stretch the audio data from the first prosody toward the target prosody.Type: GrantFiled: December 21, 2021Date of Patent: February 27, 2024Assignees: Adobe Inc., Northwestern UniversityInventors: Maxwell Morrison, Juan Pablo Caceres Chomali, Zeyu Jin, Nicholas Bryan, Bryan A. Pardo
-
Patent number: 11888797Abstract: Emoji-first messaging where text messaging is automatically converted to emojis by an emoji-first application so that only emojis are communicated from one client device to another client device. Each client device has a library of emojis that are mapped to words, which libraries are customizable and unique to the users of the client devices, such that the users can communicate secretly in code. Upon receipt of a string of emojis, a user can select the emoji string to convert to text if desired, for a predetermined period of time.Type: GrantFiled: April 20, 2021Date of Patent: January 30, 2024Assignee: Snap Inc.Inventors: Karl Bayer, Prerna Chikersal, Shree K. Nayar, Brian Anthony Smith