Patents by Inventor William James Dally

William James Dally has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

Hierarchical network for stacked memory system

Patent number: 11977766

Abstract: A hierarchical network enables access for a stacked memory system including or more memory dies that each include multiple memory tiles. The processor die includes multiple processing tiles that are stacked with the one or more memory die. The memory tiles that are vertically aligned with a processing tile are directly coupled to the processing tile and comprise the local memory block for the processing tile. The hierarchical network provides access paths for each processing tile to access the processing tile's local memory block, the local memory block coupled to a different processing tile within the same processing die, memory tiles in a different die stack, and memory tiles in a different device. The ratio of memory bandwidth (byte) to floating-point operation (B:F) may improve 50× for accessing the local memory block compared with conventional memory. Additionally, the energy consumed to transfer each bit may be reduced by 10×.

Type: Grant

Filed: February 28, 2022

Date of Patent: May 7, 2024

Assignee: NVIDIA Corporation

Inventors: William James Dally, Carl Thomas Gray, Stephen W. Keckler, James Michael O'Connor
NEURAL NETWORK ACCELERATOR USING LOGARITHMIC-BASED ARITHMETIC

Publication number: 20240112007

Abstract: Neural networks, in many cases, include convolution layers that are configured to perform many convolution operations that require multiplication and addition operations. Compared with performing multiplication on integer, fixed-point, or floating-point format values, performing multiplication on logarithmic format values is straightforward and energy efficient as the exponents are simply added. However, performing addition on logarithmic format values is more complex. Conventionally, addition is performed by converting the logarithmic format values to integers, computing the sum, and then converting the sum back into the logarithmic format. Instead, logarithmic format values may be added by decomposing the exponents into separate quotient and remainder components, sorting the quotient components based on the remainder components, summing the sorted quotient components to produce partial sums, and multiplying the partial sums by the remainder components to produce a sum.

Type: Application

Filed: December 12, 2023

Publication date: April 4, 2024

Inventors: William James Dally, Rangharajan Venkatesan, Brucek Kurdo Khailany
Neural network accelerator using logarithmic-based arithmetic

Patent number: 11886980

Abstract: Neural networks, in many cases, include convolution layers that are configured to perform many convolution operations that require multiplication and addition operations. Compared with performing multiplication on integer, fixed-point, or floating-point format values, performing multiplication on logarithmic format values is straightforward and energy efficient as the exponents are simply added. However, performing addition on logarithmic format values is more complex. Conventionally, addition is performed by converting the logarithmic format values to integers, computing the sum, and then converting the sum back into the logarithmic format. Instead, logarithmic format values may be added by decomposing the exponents into separate quotient and remainder components, sorting the quotient components based on the remainder components, summing the sorted quotient components to produce partial sums, and multiplying the partial sums by the remainder components to produce a sum.

Type: Grant

Filed: August 23, 2019

Date of Patent: January 30, 2024

Assignee: NVIDIA Corporation

Inventors: William James Dally, Rangharajan Venkatesan, Brucek Kurdo Khailany
MAPPING LOGICAL AND PHYSICAL PROCESSORS AND LOGICAL AND PHYSICAL MEMORY

Publication number: 20230385232

Abstract: A mapping may be made between an array of physical processors and an array of functional logical processors. Also, a mapping may be made between logical memory channels (associated with the logical processors) and functional physical memory channels (associated with the physical processors). These mappings may be stored within one or more tables, which may then be used to bypass faulty processors and memory channels when implementing memory accesses, while optimizing locality (e.g., by minimizing the proximity of memory channels to processors).

Type: Application

Filed: July 27, 2023

Publication date: November 30, 2023

Inventor: William James Dally
Preventing glitch propagation

Patent number: 11809989

Abstract: When a signal glitches, logic receiving the signal may change in response, thereby charging and/or discharging nodes within the logic and dissipating power. Providing a glitch-free signal may reduce the number of times the nodes are charged and/or discharged, thereby reducing the power dissipation. A technique for eliminating glitches in a signal is to insert a storage element that samples the signal after it is done changing to produce a glitch-free output signal. The storage element is enabled by a “ready” signal having a delay that matches the delay of circuitry generating the signal. The technique prevents the output signal from changing until the final value of the signal is achieved. The output signal changes only once, typically reducing the number of times nodes in the logic receiving the signal are charged and/or discharged so that power dissipation is also reduced.

Type: Grant

Filed: July 2, 2020

Date of Patent: November 7, 2023

Assignee: NVIDIA Corporation

Inventor: William James Dally
APPLICATION PARTITIONING FOR LOCALITY IN A STACKED MEMORY SYSTEM

Publication number: 20230315651

Abstract: Embodiments of the present disclosure relate to application partitioning for locality in a stacked memory system. In an embodiment, one or more memory dies are stacked on the processor die. The processor die includes multiple processing tiles and each memory die includes multiple memory tiles. Vertically aligned memory tiles are directly coupled to and comprise the local memory block for a corresponding processing tile. An application program that operates on dense multi-dimensional arrays (matrices) may partition the dense arrays into sub-arrays associated with program tiles. Each program tile is executed by a processing tile using the processing tile's local memory block to process the associated sub-array. Data associated with each sub-array is stored in a local memory block and the processing tile corresponding to the local memory block executes the program tile to process the sub-array data.

Type: Application

Filed: March 30, 2022

Publication date: October 5, 2023

Inventors: William James Dally, Carl Thomas Gray, Stephen W. Keckler, James Michael O'Connor
Scalable multi-die deep learning system

Patent number: 11769040

Abstract: A distributed deep neural net (DNN) utilizing a distributed, tile-based architecture implemented on a semiconductor package. The package includes multiple chips, each with a central processing element, a global memory buffer, and processing elements. Each processing element includes a weight buffer, an activation buffer, and multiply-accumulate units to combine, in parallel, the weight values and the activation values.

Type: Grant

Filed: July 19, 2019

Date of Patent: September 26, 2023

Assignee: NVIDIA CORP.

Inventors: Yakun Shao, Rangharajan Venkatesan, Nan Jiang, Brian Matthew Zimmer, Jason Clemons, Nathaniel Pinckney, Matthew R Fojtik, William James Dally, Joel S. Emer, Stephen W. Keckler, Brucek Khailany
HIERARCHICAL NETWORK FOR STACKED MEMORY SYSTEM

Publication number: 20230297269

Abstract: A hierarchical network enables access for a stacked memory system including or more memory dies that each include multiple memory tiles. The processor die includes multiple processing tiles that are stacked with the one or more memory die. The memory tiles that are vertically aligned with a processing tile are directly coupled to the processing tile and comprise the local memory block for the processing tile. The hierarchical network provides access paths for each processing tile to access the processing tile’s local memory block, the local memory block coupled to a different processing tile within the same processing die, memory tiles in a different die stack, and memory tiles in a different device. The ratio of memory bandwidth (byte) to floating-point operation (B:F) may improve 50x for accessing the local memory block compared with conventional memory. Additionally, the energy consumed to transfer each bit may be reduced by 10x.

Type: Application

Filed: February 28, 2022

Publication date: September 21, 2023

Inventors: William James Dally, Carl Thomas Gray, Stephen W. Keckler, James Michael O’Connor
LOCATING A MEMORY UNIT ASSOCIATED WITH A MEMORY ADDRESS UTILIZING A MAPPER

Publication number: 20230297499

Abstract: A mapper within a single-level memory system may facilitate memory localization to reduce the energy and latency of memory accesses within the single-level memory system. The mapper may translate a memory request received from a processor for implementation at a data storage entity, where the translating identifies a data storage entity and a starting location within the data storage entity where the data associated with the memory request is located. This data storage entity may be co-located with the processor that sent the request, which may enable the localization of memory and significantly improve the performance of memory usage by reducing an energy of data access and increasing data bandwidth.

Type: Application

Filed: January 21, 2022

Publication date: September 21, 2023

Inventors: William James Dally, Stephen William Keckler, Carl Thomas Gray, James Michael O’Connor
MEMORY STACKED ON PROCESSOR FOR HIGH BANDWIDTH

Publication number: 20230275068

Abstract: Embodiments of the present disclosure relate to memory stacked on processor for high bandwidth. Systems and methods are disclosed for providing a one-level memory for a processing system by stacking bulk memory on a processor die. In an embodiment, one or more memory dies are stacked on the processor die. The processor die includes multiple processing tiles, where each tile includes a processing unit, mapper, and tile network. Each memory die includes multiple memory tiles. The processing tile is coupled to each memory tile that is above or below the processing tile. The vertically aligned memory tiles comprise the local memory block for the processing tile. The ratio of memory bandwidth (byte) to floating-point operation (B:F) may improve 50× for accessing the local memory block compared with conventional memory. Additionally, the energy consumed to transfer each bit may be reduced by 10×.

Type: Application

Filed: February 28, 2022

Publication date: August 31, 2023

Inventors: William James Dally, Carl Thomas Gray, Stephen W. Keckler, James Michael O'Connor
Processor for performing dynamic programming according to an instruction, and a method for configuring a processor for dynamic programming via an instruction

Patent number: 11726757

Abstract: The disclosure provides processors that are configured to perform dynamic programming according to an instruction, a method for configuring a processor for dynamic programming according to an instruction and a method of computing a modified Smith Waterman algorithm employing an instruction for configuring a parallel processing unit. In one example, the method for configuring includes: (1) receiving, by execution cores of the processor, an instruction that directs the execution cores to compute a set of recurrence equations employing a matrix, (2) configuring the execution cores, according to the set of recurrence equations, to compute states for elements of the matrix, and (3) storing the computed states for current elements of the matrix in registers of the execution cores, wherein the computed states are determined based on the set of recurrence equations and input data.

Type: Grant

Filed: March 6, 2020

Date of Patent: August 15, 2023

Assignee: NVIDIA Corporation

Inventor: William James Dally
MAPPING LOGICAL AND PHYSICAL PROCESSORS AND LOGICAL AND PHYSICAL MEMORY

Publication number: 20230237011

Abstract: A mapping may be made between an array of physical processors and an array of functional logical processors. Also, a mapping may be made between logical memory channels (associated with the logical processors) and functional physical memory channels (associated with the physical processors). These mappings may be stored within one or more tables, which may then be used to bypass faulty processors and memory channels when implementing memory accesses, while optimizing locality (e.g., by minimizing the proximity of memory channels to processors).

Type: Application

Filed: January 21, 2022

Publication date: July 27, 2023

Inventor: William James Dally
OPTIMALLY CLIPPED TENSORS AND VECTORS

Publication number: 20230237308

Abstract: Quantizing tensors and vectors processed within a neural network reduces power consumption and may accelerate processing. Quantization reduces the number of bits used to represent a value, where decreasing the number of bits used can decrease the accuracy of computations that use the value. Ideally, quantization is performed without reducing accuracy. Quantization-aware training (QAT) is performed by dynamically quantizing tensors (weights and activations) using optimal clipping scalars. “Optimal” in that the mean squared error (MSE) of the quantized operation is minimized and the clipping scalars define the degree or amount of quantization for various tensors of the operation. Conventional techniques that quantize tensors during training suffer from high amounts of noise (error). Other techniques compute the clipping scalars offline through a brute force search to provide high accuracy.

Type: Application

Filed: July 26, 2022

Publication date: July 27, 2023

Inventors: Charbel Sakr, Steve Haihang Dai, Brucek Kurdo Khailany, William James Dally, Rangharajan Venkatesan, Brian Matthew Zimmer
Glitch-free multiplexer

Patent number: 11476852

Abstract: When a signal glitches, logic receiving the signal may change in response, thereby charging and/or discharging nodes within the logic and dissipating power. Providing a glitch-free signal may reduce the number of times the nodes are charged and/or discharged, thereby reducing the power dissipation. A technique for eliminating glitches in a signal is to insert a storage element that samples the signal after it is done changing to produce a glitch-free output signal. The storage element is enabled by a “ready” signal having a delay that matches the delay of circuitry generating the signal. The technique prevents the output signal from changing until the final value of the signal is achieved. The output signal changes only once, typically reducing the number of times nodes in the logic receiving the signal are charged and/or discharged so that power dissipation is also reduced.

Type: Grant

Filed: May 19, 2021

Date of Patent: October 18, 2022

Assignee: NVIDIA Corporation

Inventor: William James Dally
MACHINE LEARNING TRAINING IN LOGARITHMIC NUMBER SYSTEM

Publication number: 20220261650

Abstract: An end-to-end low-precision training system based on a multi-base logarithmic number system and a multiplicative weight update algorithm. The multi-base logarithmic number system is applied to update weights of the neural network, with different bases of the multi-base logarithmic number system utilized between calculation of weight updates, calculation of feed-forward signals, and calculation of feedback signals. The LNS expresses a high dynamic range and computational energy efficiency, making it advantageous for on-board training in energy-constrained edge devices.

Type: Application

Filed: June 11, 2021

Publication date: August 18, 2022

Applicant: NVIDIA Corp.

Inventors: Jiawei Zhao, Steve Haihang Dai, Rangharajan Venkatesan, Ming-Yu Liu, William James Dally, Anima Anandkumar
Efficient Neural Network Accelerator Dataflows

Publication number: 20220076110

Abstract: A distributed deep neural net (DNN) utilizing a distributed, tile-based architecture includes multiple chips, each with a central processing element, a global memory buffer, and a plurality of additional processing elements. Each additional processing element includes a weight buffer, an activation buffer, and vector multiply-accumulate units to combine, in parallel, the weight values and the activation values using stationary data flows.

Type: Application

Filed: November 19, 2021

Publication date: March 10, 2022

Applicant: NVIDIA Corp.

Inventors: Yakun Shao, Rangharajan Venkatesan, Miaorong Wang, Daniel Smith, William James Dally, Joel Emer, Stephen W. Keckler, Brucek Khailany
Efficient neural network accelerator dataflows

Patent number: 11270197

Abstract: A distributed deep neural net (DNN) utilizing a distributed, tile-based architecture includes multiple chips, each with a central processing element, a global memory buffer, and a plurality of additional processing elements. Each additional processing element includes a weight buffer, an activation buffer, and vector multiply-accumulate units to combine, in parallel, the weight values and the activation values using stationary data flows.

Type: Grant

Filed: November 4, 2019

Date of Patent: March 8, 2022

Assignee: NVIDIA Corp.

Inventors: Yakun Shao, Rangharajan Venkatesan, Miaorong Wang, Daniel Smith, William James Dally, Joel Emer, Stephen W. Keckler, Brucek Khailany
GLITCH-FREE MULTIPLEXER

Publication number: 20220006457

Abstract: When a signal glitches, logic receiving the signal may change in response, thereby charging and/or discharging nodes within the logic and dissipating power. Providing a glitch-free signal may reduce the number of times the nodes are charged and/or discharged, thereby reducing the power dissipation. A technique for eliminating glitches in a signal is to insert a storage element that samples the signal after it is done changing to produce a glitch-free output signal. The storage element is enabled by a “ready” signal having a delay that matches the delay of circuitry generating the signal. The technique prevents the output signal from changing until the final value of the signal is achieved. The output signal changes only once, typically reducing the number of times nodes in the logic receiving the signal are charged and/or discharged so that power dissipation is also reduced.

Type: Application

Filed: May 19, 2021

Publication date: January 6, 2022

Inventor: William James Dally
PREVENTING GLITCH PROPAGATION

Publication number: 20220004864

Abstract: When a signal glitches, logic receiving the signal may change in response, thereby charging and/or discharging nodes within the logic and dissipating power. Providing a glitch-free signal may reduce the number of times the nodes are charged and/or discharged, thereby reducing the power dissipation. A technique for eliminating glitches in a signal is to insert a storage element that samples the signal after it is done changing to produce a glitch-free output signal. The storage element is enabled by a “ready” signal having a delay that matches the delay of circuitry generating the signal. The technique prevents the output signal from changing until the final value of the signal is achieved. The output signal changes only once, typically reducing the number of times nodes in the logic receiving the signal are charged and/or discharged so that power dissipation is also reduced.

Type: Application

Filed: July 2, 2020

Publication date: January 6, 2022

Inventor: William James Dally
Glitch-free multiplexer

Patent number: 11070205

Abstract: When a signal glitches, logic receiving the signal may change in response, thereby charging and/or discharging nodes within the logic and dissipating power. Providing a glitch-free signal may reduce the number of times the nodes are charged and/or discharged, thereby reducing the power dissipation. A technique for eliminating glitches in a signal is to insert a storage element that samples the signal after it is done changing to produce a glitch-free output signal. The storage element is enabled by a “ready” signal having a delay that matches the delay of circuitry generating the signal. The technique prevents the output signal from changing until the final value of the signal is achieved. The output signal changes only once, typically reducing the number of times nodes in the logic receiving the signal are charged and/or discharged so that power dissipation is also reduced.

Type: Grant

Filed: July 2, 2020

Date of Patent: July 20, 2021

Assignee: NVIDIA Corporation

Inventor: William James Dally

1 2 3 4 next