Patents by Inventor Boris Ginsburg

Boris Ginsburg has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

TEXT NORMALIZATION AND INVERSE TEXT NORMALIZATION USING WEIGHTED FINITE-STATE TRANSDUCERS AND NEURAL LANGUAGE MODELS

Publication number: 20250140236

Abstract: Systems and methods provide for text normalization or inverse text normalization using a hybrid language system that combines rule-based processing with neural or learned processing. For example, a hybrid rule-based and neural approach identifies semiotic tokens within a textual input and generates a set of potential plain-text conversions of the semiotic tokens. The plain-text conversions are weighted and evaluated by a trained language model that rescores the plain-text conversion based on context to identify a highest scoring plain-text conversion for further processing within a language system pipeline.

Type: Application

Filed: December 30, 2024

Publication date: May 1, 2025

Inventors: Evelina Bakhturina, Yang Zhang, Boris Ginsburg
SPEECH-TO-TEXT PROCESSING ASSISTED WITH LANGUAGE MODELS FOR CONVERSATIONAL AI SYSTEMS AND APPLICATIONS

Publication number: 20250095652

Abstract: Disclosed are apparatuses, systems, and techniques that implement training and deployment of speech-augmented language models for efficient capturing and processing of speech inputs. The techniques include processing, using a speech model, an audio input in a first language to generate a first portion of an input into a language model (LM). A second portion of the input into the LM includes represents a text context associated with the audio input. The techniques further include receiving, from the LM, an output that includes a speech-to-text conversion of the audio input.

Type: Application

Filed: January 9, 2024

Publication date: March 20, 2025

Inventors: Zhehuai Chen, He Huang, Oleksii Hrinchuk, Andrei Andrusenko, Venkata Naga Krishna Chaitanya Puvvada, Subhankar Ghosh, Jing Yao Li, Jagadeesh Balam, Boris Ginsburg
PRONUNCIATION-AWARE EMBEDDING GENERATION FOR CONVERSATIONAL AI SYSTEMS AND APPLICATIONS

Publication number: 20250078827

Abstract: One or more embodiments include: receiving a first frame of acoustic input and one or more prior textual tokens associated with a prior frame of the acoustic input, wherein the prior textual token represents one or more spoken word included in the acoustic input; generating a multi-dimensional embedding associated with the prior textual token, wherein each dimension of the embedding represents a different characteristic of the prior textual token, and at least one dimension of the embedding represents pronunciation information associated with the prior textual token; and generating a textual token associated with the first frame based at least on an encoded representation of the first frame and the multi-dimensional embedding associated with the prior textual token.

Type: Application

Filed: January 25, 2024

Publication date: March 6, 2025

Inventors: Hainan XU, Boris GINSBURG, Zhehuai CHEN, Fei JIA
Text normalization and inverse text normalization using weighted finite-state transducers and neural language models

Patent number: 12230245

Abstract: Systems and methods provide for text normalization or inverse text normalization using a hybrid language system that combines rule-based processing with neural or learned processing. For example, a hybrid rule-based and neural approach identifies semiotic tokens within a textual input and generates a set of potential plain-text conversions of the semiotic tokens. The plain-text conversions are weighted and evaluated by a trained language model that rescores the plain-text conversion based on context to identify a highest scoring plain-text conversion for further processing within a language system pipeline.

Type: Grant

Filed: August 31, 2022

Date of Patent: February 18, 2025

Assignee: Nvidia Corporation

Inventors: Evelina Bakhturina, Yang Zhang, Boris Ginsburg
LOCATION-AWARE NEURAL AUDIO PROCESSING IN CONTENT GENERATION SYSTEMS AND APPLICATIONS

Publication number: 20250016517

Abstract: Approaches presented herein provide for identification of sound from a sound source relative to an array of microphones of a potentially unknown configuration using, in part, differences in the audio signals received by the microphones. In at least one embodiment, audio signals are captured using an array of microphones and audio features are extracted from those signals. The audio features can be processed using a first neural network to generate a feature vector representing a spatial location of an audio source with respect to the plurality of microphones, where the spatial location is inferred based on audio differences and independent of an availability of information indicating a physical configuration of the plurality of microphones. The feature vector can be provided to a task-specific model to perform at least one audio-related task based in part on the spatial location.

Type: Application

Filed: July 5, 2023

Publication date: January 9, 2025

Inventors: Ante Jukic, Jagadeesh Balam, Boris Ginsburg
WEIGHTED FINITE STATE TRANSDUCER FRAMEWORKS FOR CONVERSATIONAL AI SYSTEMS AND APPLICATIONS

Publication number: 20240265913

Abstract: Systems and methods provide for a machine learning system to train a machine learning model to output a penalty-free emission when processing an auditory input. For example, as the system generates paths through a probability lattice, one or more paths may include a penalty-free emission that skips at least one frame associated with the probability lattice, but that does not add a cost to a final path cost. The use of the penalty-free emissions may be represented through one or more graphical representations used for training in order to develop loss functions for models. One or more of these frameworks may be incorporated into automatic speech recognition pipelines to improve training while also reducing coding requirements to simplify debugging operations.

Type: Application

Filed: July 20, 2023

Publication date: August 8, 2024

Inventors: Aleksandr Laptev, Vladimir Bataev, Igor Gitman, Boris Ginsburg
WEIGHTED FINITE STATE TRANSDUCER FRAMEWORKS FOR CONVERSATIONAL AI SYSTEMS AND APPLICATIONS

Publication number: 20240265912

Abstract: Systems and methods provide for a machine learning system to train a machine learning model to output a penalty-free emission when processing an auditory input. For example, as the system generates paths through a probability lattice, one or more paths may include a penalty-free emission that skips at least one frame associated with the probability lattice, but that does not add a cost to a final path cost. The use of the penalty-free emissions may be represented through one or more graphical representations used for training in order to develop loss functions for models. One or more of these frameworks may be incorporated into automatic speech recognition pipelines to improve training while also reducing coding requirements to simplify debugging operations.

Type: Application

Filed: July 20, 2023

Publication date: August 8, 2024

Inventors: Aleksandr Laptev, Vladimir Bataev, Igor Gitman, Boris Ginsburg
HYBRID LANGUAGE MODELS FOR CONVERSATIONAL AI SYSTEMS AND APPLICATIONS

Publication number: 20240233714

Abstract: In various examples, first textual data may be applied to a first MLM to generate an intermediate speech representation (e.g., a frequency-domain representation), the intermediate audio representation and a second MLM may be used to generate output data indicating second textual data, and parameters of the second MLM may be updated using the output data and ground truth data associated with the first textual data. The first MLM may include a trained Text-To-Speech (TTS) model and the second MLM may include an Automatic Speech Recognition (ASR) model. A generator from a generative adversarial networks may be used to enhance an initial intermediate audio representation generated using the first MLM and the enhanced intermediate audio representation may be provided to the second MLM. The generator may include generator blocks that receive the initial intermediate audio representation to sequentially generate the enhanced intermediate audio representation.

Type: Application

Filed: September 15, 2023

Publication date: July 11, 2024

Inventors: Vladimir Bataev, Roman Korostik, Evgenii Shabalin, Vitaly Sergeyevich Lavrukhin, Boris Ginsburg
WATERMARKING FOR SPEECH IN CONVERSATIONAL AI AND COLLABORATIVE SYNTHETIC CONTENT GENERATION SYSTEMS AND APPLICATIONS

Publication number: 20240221763

Abstract: Approaches presented herein provide for insertion of watermarks into synthesized content, such as audio content that may include synthesized speech to appear to be spoken by a digital avatar in a 3D virtual environment. A Text-to-Speech (TTS) generator, such as a trained neural network, can be used to produce synthetic speech audio, which can have an audio watermark inserted therein. This watermark can be detected by a process of a collaborative content generation platform, for example, and an indication can be provided that the content contains synthesized speech. The presence of the audio watermark will generally not be detectable by the human ear during presentation. To make it difficult to remove or modify the watermark, the watermark can be generated using a key or other unique piece of data known only to authorized entities.

Type: Application

Filed: December 29, 2022

Publication date: July 4, 2024

Inventor: Boris Ginsburg
SYNTHETIC SPEECH GENERATION FOR CONVERSATIONAL AI SYSTEMS AND APPLICATIONS

Publication number: 20240161728

Abstract: Disclosed are apparatuses, systems, and techniques that may use machine learning for generating artificial speech. The techniques include obtaining a synthetic embedding using learned embeddings associated with different speakers. At least one learned embedding may be generated using a multi-stage training of a machine learning model (MLM) with progressively increasing quality of training speech utterances. The techniques may further include using the MLM and the synthetic embedding to generate synthetic audio data.

Type: Application

Filed: November 10, 2022

Publication date: May 16, 2024

Inventors: Subhankar Ghosh, Boris Ginsburg
MULTI-SCALE SPEAKER DIARIZATION FOR CONVERSATIONAL AI SYSTEMS AND APPLICATIONS

Publication number: 20240153531

Abstract: Disclosed are apparatuses, systems, and techniques that may use machine learning for implementing speaker diarization. The techniques include obtaining a speaker embedding for various reference times of a speech and for various differently-sized time intervals, identifying a plurality of clusters, each cluster associated with a different speaker of the speech. The techniques further include computing, using the speaker embeddings, a set of embedding weights for various differently-sized time intervals, and identifying, using the computed set of the embedding weights, one or more speakers speaking at a respective reference time.

Type: Application

Filed: November 3, 2022

Publication date: May 9, 2024

Inventors: Taejin Park, Nithin Rao Koluguri, Jagadeesh Balam, Boris Ginsburg
HYBRID LANGUAGE MODELS FOR CONVERSATIONAL AI SYSTEMS AND APPLICATIONS

Publication number: 20240135920

Abstract: In various examples, first textual data may be applied to a first MLM to generate an intermediate speech representation (e.g., a frequency-domain representation), the intermediate audio representation and a second MLM may be used to generate output data indicating second textual data, and parameters of the second MLM may be updated using the output data and ground truth data associated with the first textual data. The first MLM may include a trained Text-To-Speech (TTS) model and the second MLM may include an Automatic Speech Recognition (ASR) model. A generator from a generative adversarial networks may be used to enhance an initial intermediate audio representation generated using the first MLM and the enhanced intermediate audio representation may be provided to the second MLM. The generator may include generator blocks that receive the initial intermediate audio representation to sequentially generate the enhanced intermediate audio representation.

Type: Application

Filed: September 14, 2023

Publication date: April 25, 2024

Inventors: Vladimir Bataev, Roman Korostik, Evgenii Shabalin, Vitaly Sergeyevich Lavrukhin, Boris Ginsburg
CUSTOMIZING TEXT-TO-SPEECH LANGUAGE MODELS USING ADAPTERS FOR CONVERSATIONAL AI SYSTEMS AND APPLICATIONS

Publication number: 20240127788

Abstract: In various examples, one or more text-to-speech machine learning models may be customized or adapted to accommodate new or additional speakers or speaker voices without requiring a full re-training of the models. For example, a base model may be trained on a set of one or more speakers and, after training or deployment, the model may be adapted to support one or more other speakers. To do this, one or more additional layers (e.g., adapter layers) may be added to the model, and the model may be re-trained or updated—e.g., by freezing parameters of the base model while updating parameters of the adapter layers—to generate an adapted model that can support the one or more original speakers of the base model in addition to the one or more additional speakers corresponding to the adapter layers.

Type: Application

Filed: October 13, 2022

Publication date: April 18, 2024

Inventors: Cheng-Ping HSIEH, Subhankar GHOSH, Boris GINSBURG
SPEAKER IDENTIFICATION, VERIFICATION, AND DIARIZATION USING NEURAL NETWORKS FOR CONVERSATIONAL AI SYSTEMS AND APPLICATIONS

Publication number: 20240119927

Abstract: Disclosed are apparatuses, systems, and techniques that may use machine learning for implementing speaker recognition, verification, and/or diarization. The techniques include applying a neural network (NN) to a speech data to obtain a speaker embedding representative of an association between the speech data and a speaker that produced the speech. The speech data includes a plurality of frames and a plurality of channels representative of spectral content of the speech data. The NN has one or more blocks of neurons that include a first branch performing convolutions of the speech data across the plurality of channels and across the plurality of frames and a second branch performing convolutions of the speech data across the plurality of channels. Obtained speaker embeddings may be used for various tasks of speaker identification, verification, and/or diarization.

Type: Application

Filed: October 7, 2022

Publication date: April 11, 2024

Inventors: Nithin Rao Koluguri, Taejin Park, Boris Ginsburg
AUTOMATIC SPEECH RECOGNITION WITH MULTI-FRAME BLANK DECODING USING NEURAL NETWORKS FOR CONVERSATIONAL AI SYSTEMS AND APPLICATIONS

Publication number: 20240112021

Abstract: Systems and methods provide for a machine learning system to train a machine learning model to output a multi-frame blank symbol when processing an auditory input. For example, as the system generates paths through a probability lattice, one or more paths include a multi-frame blank that skips at least one frame associated with the probability lattice. The inclusion of the multi-frame blank symbol may increase a total number of potential paths through the probability lattice, and may allow the machine learning model to more quickly and accurately process audio frames, while disregarding audio frames of less value. In deployment, when an output of the machine learning model indicates a multi-frame blank symbol or token, one or more frames of the auditory input may be omitted from processing.

Type: Application

Filed: October 4, 2022

Publication date: April 4, 2024

Inventors: Hainan Xu, Boris Ginsburg
TEXT NORMALIZATION AND INVERSE TEXT NORMALIZATION USING WEIGHTED FINITE-STATE TRANSDUCERS AND NEURAL LANGUAGE MODELS

Publication number: 20240071366

Abstract: Systems and methods provide for text normalization or inverse text normalization using a hybrid language system that combines rule-based processing with neural or learned processing. For example, a hybrid rule-based and neural approach identifies semiotic tokens within a textual input and generates a set of potential plain-text conversions of the semiotic tokens. The plain-text conversions are weighted and evaluated by a trained language model that rescores the plain-text conversion based on context to identify a highest scoring plain-text conversion for further processing within a language system pipeline.

Type: Application

Filed: August 31, 2022

Publication date: February 29, 2024

Inventors: Evelina Bakhturina, Yang Zhang, Boris Ginsburg
DYNAMIC DIRECTIONAL ROUNDING

Publication number: 20210232366

Abstract: A method, computer readable medium, and system are disclosed for rounding floating point values. Dynamic directional rounding is a rounding technique for floating point operations. A floating point operation (addition, subtraction, multiplication, etc.) is performed on an operand to compute a floating point result. A sign (positive or negative) of the operand is identified. In one embodiment, the sign determines a direction in which the floating point result is rounded (towards negative or positive infinity). When used for updating parameters of a neural network during backpropagation, dynamic directional rounding ensures that rounding is performed in the direction of the gradient.

Type: Application

Filed: February 1, 2021

Publication date: July 29, 2021

Inventors: Alex Fit-Florea, Boris Ginsburg, Pooya Davoodi, Amir Gholaminejad
Dynamic directional rounding

Patent number: 10908878

Abstract: A method, computer readable medium, and system are disclosed for rounding floating point values. Dynamic directional rounding is a rounding technique for floating point operations. A floating point operation (addition, subtraction, multiplication, etc.) is performed on an operand to compute a floating point result. A sign (positive or negative) of the operand is identified. In one embodiment, the sign determines a direction in which the floating point result is rounded (towards negative or positive infinity). When used for updating parameters of a neural network during backpropagation, dynamic directional rounding ensures that rounding is performed in the direction of the gradient.

Type: Grant

Filed: November 26, 2018

Date of Patent: February 2, 2021

Assignee: NVIDIA Corporation

Inventors: Alex Fit-Florea, Boris Ginsburg, Pooya Davoodi, Amir Gholaminejad
DYNAMIC DIRECTIONAL ROUNDING

Publication number: 20200167125

Abstract: A method, computer readable medium, and system are disclosed for rounding floating point values. Dynamic directional rounding is a rounding technique for floating point operations. A floating point operation (addition, subtraction, multiplication, etc.) is performed on an operand to compute a floating point result. A sign (positive or negative) of the operand is identified. In one embodiment, the sign determines a direction in which the floating point result is rounded (towards negative or positive infinity). When used for updating parameters of a neural network during backpropagation, dynamic directional rounding ensures that rounding is performed in the direction of the gradient.

Type: Application

Filed: November 26, 2018

Publication date: May 28, 2020

Inventors: Alex Fit-Florea, Boris Ginsburg, Pooya Davoodi, Amir Gholaminejad
TENSOR PROCESSING USING LOW PRECISION FORMAT

Publication number: 20170372202

Abstract: Aspects of the present invention are directed to computer-implemented techniques for improving the training of artificial neural networks using a reduced precision (e.g., float16) data format. Embodiments of the present invention rescale tensor values prior to performing matrix operations (such as matrix multiplication or matrix addition) to prevent overflow and underflow. To preserve accuracy throughout the performance of the matrix operations, the scale factors are defined using a novel data format to represent tensors, wherein a matrix is represented by the tuple X, where X=(a, v[.]), wherein a is a float scale factor and v[.] are scaled values stored in the float16 format. The value of any element X[i] according to this data format would be equal to a*v[i].

Type: Application

Filed: June 15, 2017

Publication date: December 28, 2017

Inventors: Boris GINSBURG, Sergei NIKOLAEV, Ahmad KISWANI, Hao WU, Amir GHOLAMINEJAD, Slawomir KIERAT, Michael HOUSTON, Alex FIT-FLOREA

1 2 next