Patents by Inventor Boris Ginsburg

Boris Ginsburg has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

  • Publication number: 20250140236
    Abstract: Systems and methods provide for text normalization or inverse text normalization using a hybrid language system that combines rule-based processing with neural or learned processing. For example, a hybrid rule-based and neural approach identifies semiotic tokens within a textual input and generates a set of potential plain-text conversions of the semiotic tokens. The plain-text conversions are weighted and evaluated by a trained language model that rescores the plain-text conversion based on context to identify a highest scoring plain-text conversion for further processing within a language system pipeline.
    Type: Application
    Filed: December 30, 2024
    Publication date: May 1, 2025
    Inventors: Evelina Bakhturina, Yang Zhang, Boris Ginsburg
  • Publication number: 20250095652
    Abstract: Disclosed are apparatuses, systems, and techniques that implement training and deployment of speech-augmented language models for efficient capturing and processing of speech inputs. The techniques include processing, using a speech model, an audio input in a first language to generate a first portion of an input into a language model (LM). A second portion of the input into the LM includes represents a text context associated with the audio input. The techniques further include receiving, from the LM, an output that includes a speech-to-text conversion of the audio input.
    Type: Application
    Filed: January 9, 2024
    Publication date: March 20, 2025
    Inventors: Zhehuai Chen, He Huang, Oleksii Hrinchuk, Andrei Andrusenko, Venkata Naga Krishna Chaitanya Puvvada, Subhankar Ghosh, Jing Yao Li, Jagadeesh Balam, Boris Ginsburg
  • Publication number: 20250078827
    Abstract: One or more embodiments include: receiving a first frame of acoustic input and one or more prior textual tokens associated with a prior frame of the acoustic input, wherein the prior textual token represents one or more spoken word included in the acoustic input; generating a multi-dimensional embedding associated with the prior textual token, wherein each dimension of the embedding represents a different characteristic of the prior textual token, and at least one dimension of the embedding represents pronunciation information associated with the prior textual token; and generating a textual token associated with the first frame based at least on an encoded representation of the first frame and the multi-dimensional embedding associated with the prior textual token.
    Type: Application
    Filed: January 25, 2024
    Publication date: March 6, 2025
    Inventors: Hainan XU, Boris GINSBURG, Zhehuai CHEN, Fei JIA
  • Patent number: 12230245
    Abstract: Systems and methods provide for text normalization or inverse text normalization using a hybrid language system that combines rule-based processing with neural or learned processing. For example, a hybrid rule-based and neural approach identifies semiotic tokens within a textual input and generates a set of potential plain-text conversions of the semiotic tokens. The plain-text conversions are weighted and evaluated by a trained language model that rescores the plain-text conversion based on context to identify a highest scoring plain-text conversion for further processing within a language system pipeline.
    Type: Grant
    Filed: August 31, 2022
    Date of Patent: February 18, 2025
    Assignee: Nvidia Corporation
    Inventors: Evelina Bakhturina, Yang Zhang, Boris Ginsburg
  • Publication number: 20250016517
    Abstract: Approaches presented herein provide for identification of sound from a sound source relative to an array of microphones of a potentially unknown configuration using, in part, differences in the audio signals received by the microphones. In at least one embodiment, audio signals are captured using an array of microphones and audio features are extracted from those signals. The audio features can be processed using a first neural network to generate a feature vector representing a spatial location of an audio source with respect to the plurality of microphones, where the spatial location is inferred based on audio differences and independent of an availability of information indicating a physical configuration of the plurality of microphones. The feature vector can be provided to a task-specific model to perform at least one audio-related task based in part on the spatial location.
    Type: Application
    Filed: July 5, 2023
    Publication date: January 9, 2025
    Inventors: Ante Jukic, Jagadeesh Balam, Boris Ginsburg
  • Publication number: 20240265913
    Abstract: Systems and methods provide for a machine learning system to train a machine learning model to output a penalty-free emission when processing an auditory input. For example, as the system generates paths through a probability lattice, one or more paths may include a penalty-free emission that skips at least one frame associated with the probability lattice, but that does not add a cost to a final path cost. The use of the penalty-free emissions may be represented through one or more graphical representations used for training in order to develop loss functions for models. One or more of these frameworks may be incorporated into automatic speech recognition pipelines to improve training while also reducing coding requirements to simplify debugging operations.
    Type: Application
    Filed: July 20, 2023
    Publication date: August 8, 2024
    Inventors: Aleksandr Laptev, Vladimir Bataev, Igor Gitman, Boris Ginsburg
  • Publication number: 20240265912
    Abstract: Systems and methods provide for a machine learning system to train a machine learning model to output a penalty-free emission when processing an auditory input. For example, as the system generates paths through a probability lattice, one or more paths may include a penalty-free emission that skips at least one frame associated with the probability lattice, but that does not add a cost to a final path cost. The use of the penalty-free emissions may be represented through one or more graphical representations used for training in order to develop loss functions for models. One or more of these frameworks may be incorporated into automatic speech recognition pipelines to improve training while also reducing coding requirements to simplify debugging operations.
    Type: Application
    Filed: July 20, 2023
    Publication date: August 8, 2024
    Inventors: Aleksandr Laptev, Vladimir Bataev, Igor Gitman, Boris Ginsburg
  • Publication number: 20240233714
    Abstract: In various examples, first textual data may be applied to a first MLM to generate an intermediate speech representation (e.g., a frequency-domain representation), the intermediate audio representation and a second MLM may be used to generate output data indicating second textual data, and parameters of the second MLM may be updated using the output data and ground truth data associated with the first textual data. The first MLM may include a trained Text-To-Speech (TTS) model and the second MLM may include an Automatic Speech Recognition (ASR) model. A generator from a generative adversarial networks may be used to enhance an initial intermediate audio representation generated using the first MLM and the enhanced intermediate audio representation may be provided to the second MLM. The generator may include generator blocks that receive the initial intermediate audio representation to sequentially generate the enhanced intermediate audio representation.
    Type: Application
    Filed: September 15, 2023
    Publication date: July 11, 2024
    Inventors: Vladimir Bataev, Roman Korostik, Evgenii Shabalin, Vitaly Sergeyevich Lavrukhin, Boris Ginsburg
  • Publication number: 20240221763
    Abstract: Approaches presented herein provide for insertion of watermarks into synthesized content, such as audio content that may include synthesized speech to appear to be spoken by a digital avatar in a 3D virtual environment. A Text-to-Speech (TTS) generator, such as a trained neural network, can be used to produce synthetic speech audio, which can have an audio watermark inserted therein. This watermark can be detected by a process of a collaborative content generation platform, for example, and an indication can be provided that the content contains synthesized speech. The presence of the audio watermark will generally not be detectable by the human ear during presentation. To make it difficult to remove or modify the watermark, the watermark can be generated using a key or other unique piece of data known only to authorized entities.
    Type: Application
    Filed: December 29, 2022
    Publication date: July 4, 2024
    Inventor: Boris Ginsburg
  • Publication number: 20240161728
    Abstract: Disclosed are apparatuses, systems, and techniques that may use machine learning for generating artificial speech. The techniques include obtaining a synthetic embedding using learned embeddings associated with different speakers. At least one learned embedding may be generated using a multi-stage training of a machine learning model (MLM) with progressively increasing quality of training speech utterances. The techniques may further include using the MLM and the synthetic embedding to generate synthetic audio data.
    Type: Application
    Filed: November 10, 2022
    Publication date: May 16, 2024
    Inventors: Subhankar Ghosh, Boris Ginsburg
  • Publication number: 20240153531
    Abstract: Disclosed are apparatuses, systems, and techniques that may use machine learning for implementing speaker diarization. The techniques include obtaining a speaker embedding for various reference times of a speech and for various differently-sized time intervals, identifying a plurality of clusters, each cluster associated with a different speaker of the speech. The techniques further include computing, using the speaker embeddings, a set of embedding weights for various differently-sized time intervals, and identifying, using the computed set of the embedding weights, one or more speakers speaking at a respective reference time.
    Type: Application
    Filed: November 3, 2022
    Publication date: May 9, 2024
    Inventors: Taejin Park, Nithin Rao Koluguri, Jagadeesh Balam, Boris Ginsburg
  • Publication number: 20240135920
    Abstract: In various examples, first textual data may be applied to a first MLM to generate an intermediate speech representation (e.g., a frequency-domain representation), the intermediate audio representation and a second MLM may be used to generate output data indicating second textual data, and parameters of the second MLM may be updated using the output data and ground truth data associated with the first textual data. The first MLM may include a trained Text-To-Speech (TTS) model and the second MLM may include an Automatic Speech Recognition (ASR) model. A generator from a generative adversarial networks may be used to enhance an initial intermediate audio representation generated using the first MLM and the enhanced intermediate audio representation may be provided to the second MLM. The generator may include generator blocks that receive the initial intermediate audio representation to sequentially generate the enhanced intermediate audio representation.
    Type: Application
    Filed: September 14, 2023
    Publication date: April 25, 2024
    Inventors: Vladimir Bataev, Roman Korostik, Evgenii Shabalin, Vitaly Sergeyevich Lavrukhin, Boris Ginsburg
  • Publication number: 20240127788
    Abstract: In various examples, one or more text-to-speech machine learning models may be customized or adapted to accommodate new or additional speakers or speaker voices without requiring a full re-training of the models. For example, a base model may be trained on a set of one or more speakers and, after training or deployment, the model may be adapted to support one or more other speakers. To do this, one or more additional layers (e.g., adapter layers) may be added to the model, and the model may be re-trained or updated—e.g., by freezing parameters of the base model while updating parameters of the adapter layers—to generate an adapted model that can support the one or more original speakers of the base model in addition to the one or more additional speakers corresponding to the adapter layers.
    Type: Application
    Filed: October 13, 2022
    Publication date: April 18, 2024
    Inventors: Cheng-Ping HSIEH, Subhankar GHOSH, Boris GINSBURG
  • Publication number: 20240119927
    Abstract: Disclosed are apparatuses, systems, and techniques that may use machine learning for implementing speaker recognition, verification, and/or diarization. The techniques include applying a neural network (NN) to a speech data to obtain a speaker embedding representative of an association between the speech data and a speaker that produced the speech. The speech data includes a plurality of frames and a plurality of channels representative of spectral content of the speech data. The NN has one or more blocks of neurons that include a first branch performing convolutions of the speech data across the plurality of channels and across the plurality of frames and a second branch performing convolutions of the speech data across the plurality of channels. Obtained speaker embeddings may be used for various tasks of speaker identification, verification, and/or diarization.
    Type: Application
    Filed: October 7, 2022
    Publication date: April 11, 2024
    Inventors: Nithin Rao Koluguri, Taejin Park, Boris Ginsburg
  • Publication number: 20240112021
    Abstract: Systems and methods provide for a machine learning system to train a machine learning model to output a multi-frame blank symbol when processing an auditory input. For example, as the system generates paths through a probability lattice, one or more paths include a multi-frame blank that skips at least one frame associated with the probability lattice. The inclusion of the multi-frame blank symbol may increase a total number of potential paths through the probability lattice, and may allow the machine learning model to more quickly and accurately process audio frames, while disregarding audio frames of less value. In deployment, when an output of the machine learning model indicates a multi-frame blank symbol or token, one or more frames of the auditory input may be omitted from processing.
    Type: Application
    Filed: October 4, 2022
    Publication date: April 4, 2024
    Inventors: Hainan Xu, Boris Ginsburg
  • Publication number: 20240071366
    Abstract: Systems and methods provide for text normalization or inverse text normalization using a hybrid language system that combines rule-based processing with neural or learned processing. For example, a hybrid rule-based and neural approach identifies semiotic tokens within a textual input and generates a set of potential plain-text conversions of the semiotic tokens. The plain-text conversions are weighted and evaluated by a trained language model that rescores the plain-text conversion based on context to identify a highest scoring plain-text conversion for further processing within a language system pipeline.
    Type: Application
    Filed: August 31, 2022
    Publication date: February 29, 2024
    Inventors: Evelina Bakhturina, Yang Zhang, Boris Ginsburg
  • Publication number: 20210232366
    Abstract: A method, computer readable medium, and system are disclosed for rounding floating point values. Dynamic directional rounding is a rounding technique for floating point operations. A floating point operation (addition, subtraction, multiplication, etc.) is performed on an operand to compute a floating point result. A sign (positive or negative) of the operand is identified. In one embodiment, the sign determines a direction in which the floating point result is rounded (towards negative or positive infinity). When used for updating parameters of a neural network during backpropagation, dynamic directional rounding ensures that rounding is performed in the direction of the gradient.
    Type: Application
    Filed: February 1, 2021
    Publication date: July 29, 2021
    Inventors: Alex Fit-Florea, Boris Ginsburg, Pooya Davoodi, Amir Gholaminejad
  • Patent number: 10908878
    Abstract: A method, computer readable medium, and system are disclosed for rounding floating point values. Dynamic directional rounding is a rounding technique for floating point operations. A floating point operation (addition, subtraction, multiplication, etc.) is performed on an operand to compute a floating point result. A sign (positive or negative) of the operand is identified. In one embodiment, the sign determines a direction in which the floating point result is rounded (towards negative or positive infinity). When used for updating parameters of a neural network during backpropagation, dynamic directional rounding ensures that rounding is performed in the direction of the gradient.
    Type: Grant
    Filed: November 26, 2018
    Date of Patent: February 2, 2021
    Assignee: NVIDIA Corporation
    Inventors: Alex Fit-Florea, Boris Ginsburg, Pooya Davoodi, Amir Gholaminejad
  • Publication number: 20200167125
    Abstract: A method, computer readable medium, and system are disclosed for rounding floating point values. Dynamic directional rounding is a rounding technique for floating point operations. A floating point operation (addition, subtraction, multiplication, etc.) is performed on an operand to compute a floating point result. A sign (positive or negative) of the operand is identified. In one embodiment, the sign determines a direction in which the floating point result is rounded (towards negative or positive infinity). When used for updating parameters of a neural network during backpropagation, dynamic directional rounding ensures that rounding is performed in the direction of the gradient.
    Type: Application
    Filed: November 26, 2018
    Publication date: May 28, 2020
    Inventors: Alex Fit-Florea, Boris Ginsburg, Pooya Davoodi, Amir Gholaminejad
  • Publication number: 20170372202
    Abstract: Aspects of the present invention are directed to computer-implemented techniques for improving the training of artificial neural networks using a reduced precision (e.g., float16) data format. Embodiments of the present invention rescale tensor values prior to performing matrix operations (such as matrix multiplication or matrix addition) to prevent overflow and underflow. To preserve accuracy throughout the performance of the matrix operations, the scale factors are defined using a novel data format to represent tensors, wherein a matrix is represented by the tuple X, where X=(a, v[.]), wherein a is a float scale factor and v[.] are scaled values stored in the float16 format. The value of any element X[i] according to this data format would be equal to a*v[i].
    Type: Application
    Filed: June 15, 2017
    Publication date: December 28, 2017
    Inventors: Boris GINSBURG, Sergei NIKOLAEV, Ahmad KISWANI, Hao WU, Amir GHOLAMINEJAD, Slawomir KIERAT, Michael HOUSTON, Alex FIT-FLOREA