Patents by Inventor Steve Haihang Dai

Steve Haihang Dai has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

Fine-grained per-vector scaling for neural network quantization

Patent number: 12045307

Abstract: Today neural networks are used to enable autonomous vehicles and improve the quality of speech recognition, real-time language translation, and online search optimizations. However, operation of the neural networks for these applications consumes energy. Quantization of parameters used by the neural networks reduces the amount of memory needed to store the parameters while also reducing the power consumed during operation of the neural network. Matrix operations performed by the neural networks require many multiplication calculations, so reducing the number of bits that are multiplied reduces the energy that is consumed. Quantizing smaller sets of the parameters using a shared scale factor improves accuracy compared with quantizing larger sets of the parameters. Accuracy of the calculations may be maintained by quantizing and scaling the parameters using fine-grained per-vector scale factors. A vector includes one or more elements within a single dimension of a multi-dimensional matrix.

Type: Grant

Filed: October 30, 2020

Date of Patent: July 23, 2024

Assignee: NVIDIA Corporation

Inventors: Brucek Kurdo Khailany, Steve Haihang Dai, Rangharajan Venkatesan, Haoxing Ren
OPTIMALLY CLIPPED TENSORS AND VECTORS

Publication number: 20230237308

Abstract: Quantizing tensors and vectors processed within a neural network reduces power consumption and may accelerate processing. Quantization reduces the number of bits used to represent a value, where decreasing the number of bits used can decrease the accuracy of computations that use the value. Ideally, quantization is performed without reducing accuracy. Quantization-aware training (QAT) is performed by dynamically quantizing tensors (weights and activations) using optimal clipping scalars. “Optimal” in that the mean squared error (MSE) of the quantized operation is minimized and the clipping scalars define the degree or amount of quantization for various tensors of the operation. Conventional techniques that quantize tensors during training suffer from high amounts of noise (error). Other techniques compute the clipping scalars offline through a brute force search to provide high accuracy.

Type: Application

Filed: July 26, 2022

Publication date: July 27, 2023

Inventors: Charbel Sakr, Steve Haihang Dai, Brucek Kurdo Khailany, William James Dally, Rangharajan Venkatesan, Brian Matthew Zimmer
MACHINE LEARNING TRAINING IN LOGARITHMIC NUMBER SYSTEM

Publication number: 20220261650

Abstract: An end-to-end low-precision training system based on a multi-base logarithmic number system and a multiplicative weight update algorithm. The multi-base logarithmic number system is applied to update weights of the neural network, with different bases of the multi-base logarithmic number system utilized between calculation of weight updates, calculation of feed-forward signals, and calculation of feedback signals. The LNS expresses a high dynamic range and computational energy efficiency, making it advantageous for on-board training in energy-constrained edge devices.

Type: Application

Filed: June 11, 2021

Publication date: August 18, 2022

Applicant: NVIDIA Corp.

Inventors: Jiawei Zhao, Steve Haihang Dai, Rangharajan Venkatesan, Ming-Yu Liu, William James Dally, Anima Anandkumar
FINE-GRAINED PER-VECTOR SCALING FOR NEURAL NETWORK QUANTIZATION

Publication number: 20220067530

Abstract: Today neural networks are used to enable autonomous vehicles and improve the quality of speech recognition, real-time language translation, and online search optimizations. However, operation of the neural networks for these applications consumes energy. Quantization of parameters used by the neural networks reduces the amount of memory needed to store the parameters while also reducing the power consumed during operation of the neural network. Matrix operations performed by the neural networks require many multiplication calculations, so reducing the number of bits that are multiplied reduces the energy that is consumed. Quantizing smaller sets of the parameters using a shared scale factor improves accuracy compared with quantizing larger sets of the parameters. Accuracy of the calculations may be maintained by quantizing and scaling the parameters using fine-grained per-vector scale factors. A vector includes one or more elements within a single dimension of a multi-dimensional matrix.

Type: Application

Filed: October 30, 2020

Publication date: March 3, 2022

Inventors: Brucek Kurdo Khailany, Steve Haihang Dai, Rangharajan Venkatesan, Haoxing Ren
EFFICIENT SOFTMAX COMPUTATION

Publication number: 20220067513

Abstract: Solutions improving efficiency of Softmax computation applied for efficient deep learning inference in transformers and other neural networks. The solutions utilize a reduced-precision implementation of various operations in Softmax, replacing ex with 2x to reduce instruction overhead associated with computing ex, and replacing floating point max computation with integer max computation. Further described is a scalable implementation that decomposes Softmax into UnNormalized Softmax and Normalization operations.

Type: Application

Filed: December 4, 2020

Publication date: March 3, 2022

Applicant: NVIDIA Corp.

Inventors: Jacob Robert Stevens, Rangharajan Venkatesan, Steve Haihang Dai, Brucek Khailany
FINE-GRAINED PER-VECTOR SCALING FOR NEURAL NETWORK QUANTIZATION

Publication number: 20220067512

Abstract: Today neural networks are used to enable autonomous vehicles and improve the quality of speech recognition, real-time language translation, and online search optimizations. However, operation of the neural networks for these applications consumes energy. Quantization of parameters used by the neural networks reduces the amount of memory needed to store the parameters while also reducing the power consumed during operation of the neural network. Matrix operations performed by the neural networks require many multiplication calculations, so reducing the number of bits that are multiplied reduces the energy that is consumed. Quantizing smaller sets of the parameters using a shared scale factor improves accuracy compared with quantizing larger sets of the parameters. Accuracy of the calculations may be maintained by quantizing and scaling the parameters using fine-grained per-vector scale factors. A vector includes one or more elements within a single dimension of a multi-dimensional matrix.

Type: Application

Filed: October 30, 2020

Publication date: March 3, 2022

Inventors: Brucek Kurdo Khailany, Steve Haihang Dai, Rangharajan Venkatesan, Haoxing Ren

Fine-grained per-vector scaling for neural network quantization

OPTIMALLY CLIPPED TENSORS AND VECTORS

MACHINE LEARNING TRAINING IN LOGARITHMIC NUMBER SYSTEM

FINE-GRAINED PER-VECTOR SCALING FOR NEURAL NETWORK QUANTIZATION

EFFICIENT SOFTMAX COMPUTATION

FINE-GRAINED PER-VECTOR SCALING FOR NEURAL NETWORK QUANTIZATION