Patents by Inventor Milind N. NEMLEKAR

Milind N. NEMLEKAR has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

SYSTEMS AND METHODS FOR CHIPLET SYNCHRONIZATION

Publication number: 20240202047

Abstract: The disclosed computer-implemented method can include reaching, by a chiplet involved in carrying out an operation for a process, a synchronization barrier. The method can additionally include receiving, by the chiplet, dedicated control messages pushed to the chiplet by other chiplets involved in carrying out the operation for the process, wherein the dedicated control messages are pushed over a control network by the other chiplets. The method can also include advancing, by the chiplet, the synchronization barrier in response to receipt of the dedicated control messages. Various other methods, systems, and computer-readable media are also disclosed.

Type: Application

Filed: December 16, 2022

Publication date: June 20, 2024

Applicants: Advanced Micro Devices, Inc., ATI Technologies ULC

Inventors: Joseph L. Greathouse, Alan D. Smith, Anthony Asaro, Kostantinos Danny Christidis, Alexander Fuad Ashkar, Milind N. Nemlekar
DMA engines configured to perform first portion data transfer commands with a first DMA engine and second portion data transfer commands with second DMA engine

Patent number: 11995351

Abstract: A method for hardware management of DMA transfer commands includes accessing, by a first DMA engine, a DMA transfer command and determining a first portion of a data transfer requested by the DMA transfer command. Transfer of a first portion of the data transfer by the first DMA engine is initiated based at least in part on the DMA transfer command. Similarly, a second portion of the data transfer by a second DMA engine is initiated based at least in part on the DMA transfer command. After transferring the first portion and the second portion of the data transfer, an indication is generated that signals completion of the data transfer requested by the DMA transfer command.

Type: Grant

Filed: November 1, 2021

Date of Patent: May 28, 2024

Assignees: Advanced Micro Devices, Inc., ATI TECHNOLOGIES ULC

Inventors: Joseph L Greathouse, Sean Keely, Alan D. Smith, Anthony Asaro, Ling-Ling Wang, Milind N Nemlekar, Hari Thangirala, Felix Kuehling
Relaxed invalidation for cache coherence

Patent number: 11960399

Abstract: Methods, systems, and devices maintain state information in a shadow tag memory for a plurality of cachelines in each of a plurality of private caches, with each of the private caches being associated with a corresponding one of multiple processing cores. One or more cache probes are generated based on a write operation associated with one or more cachelines of the plurality of cachelines, such that each of the cache probes is associated with cachelines of a particular private cache of the multiple private caches, the particular private cache being associated with an indicated processing core. Transmission of the cache probes to the particular private cache is prevented until, responsive to a scope acquire operation from the indicated processing core, the cache probes are released for transmission to the respectively associated cachelines in the particular private cache.

Type: Grant

Filed: December 21, 2021

Date of Patent: April 16, 2024

Assignee: Advanced Micro Devices, Inc.

Inventors: Akhil Arunkumar, Tarun Nakra, Maxim V. Kazakov, Milind N. Nemlekar
MULTI-ACCELERATOR COMPUTE DISPATCH

Publication number: 20240029336

Abstract: Techniques for executing computing work by a plurality of chiplets are provided. The techniques include assigning workgroups of a kernel dispatch packet to the chiplets; by each chiplet, executing the workgroups assigned to that chiplet; for each chiplet, upon completion of all workgroups assigned to that chiplet for the kernel dispatch packet, notifying the other chiplets of such completion; and upon completion of all workgroups of the kernel dispatch packet, notifying a client of such completion and proceeding to a subsequent kernel dispatch packet.

Type: Application

Filed: October 3, 2023

Publication date: January 25, 2024

Applicant: Advanced Micro Devices, Inc.

Inventors: Milind N. Nemlekar, Maxim V. Kazakov, Prerit Dak
Multi-accelerator compute dispatch

Patent number: 11790590

Abstract: Techniques for executing computing work by a plurality of chiplets are provided. The techniques include assigning workgroups of a kernel dispatch packet to the chiplets; by each chiplet, executing the workgroups assigned to that chiplet; for each chiplet, upon completion of all workgroups assigned to that chiplet for the kernel dispatch packet, notifying the other chiplets of such completion; and upon completion of all workgroups of the kernel dispatch packet, notifying a client of such completion and proceeding to a subsequent kernel dispatch packet.

Type: Grant

Filed: March 31, 2021

Date of Patent: October 17, 2023

Assignee: Advanced Micro Devices, Inc.

Inventors: Milind N. Nemlekar, Maxim V. Kazakov, Prerit Dak
RELAXED INVALIDATION FOR CACHE COHERENCE

Publication number: 20230195628

Abstract: Methods, systems, and devices maintain state information in a shadow tag memory for a plurality of cachelines in each of a plurality of private caches, with each of the private caches being associated with a corresponding one of multiple processing cores. One or more cache probes are generated based on a write operation associated with one or more cachelines of the plurality of cachelines, such that each of the cache probes is associated with cachelines of a particular private cache of the multiple private caches, the particular private cache being associated with an indicated processing core. Transmission of the cache probes to the particular private cache is prevented until, responsive to a scope acquire operation from the indicated processing core, the cache probes are released for transmission to the respectively associated cachelines in the particular private cache.

Type: Application

Filed: December 21, 2021

Publication date: June 22, 2023

Inventors: Akhil Arunkumar, Tarun Nakra, Maxim V. Kazakov, Milind N. Nemlekar
SOFTWARE MANAGEMENT OF DIRECT MEMORY ACCESS COMMANDS

Publication number: 20230195664

Abstract: A method for software management of DMA transfer commands includes receiving a DMA transfer command instructing a data transfer by a first processor device. Based at least in part on a determination of runtime system resource availability, a device different from the first processor device is assigned to assist in transfer of at least a first portion of the data transfer. In some embodiments, the DMA transfer command instructs the first processor device to write a copy of data to a third processor device. Software analyzes network bus congestion at a shared communications bus and initiates DMA transfer via a multi-hop communications path to bypass the congested network bus.

Type: Application

Filed: December 22, 2021

Publication date: June 22, 2023

Inventors: Sean KEELY, Joseph L. GREATHOUSE, Hari THANGIRALA, Alan D. SMITH, Milind N. NEMLEKAR
HARDWARE MANAGEMENT OF DIRECT MEMORY ACCESS COMMANDS

Publication number: 20230132931

Abstract: A method for hardware management of DMA transfer commands includes accessing, by a first DMA engine, a DMA transfer command and determining a first portion of a data transfer requested by the DMA transfer command. Transfer of a first portion of the data transfer by the first DMA engine is initiated based at least in part on the DMA transfer command. Similarly, a second portion of the data transfer by a second DMA engine is initiated based at least in part on the DMA transfer command. After transferring the first portion and the second portion of the data transfer, an indication is generated that signals completion of the data transfer requested by the DMA transfer command.

Type: Application

Filed: November 1, 2021

Publication date: May 4, 2023

Inventors: Joseph L. Greathouse, Sean Keely, Alan D. Smith, Anthony Asaro, Ling-Ling Wang, Milind N. Nemlekar, Hari Thangirala, Felix Kuehling
Dynamic modification of coherent atomic memory operations

Patent number: 11604737

Abstract: A processing device determines a scope indicating at least a portion of the processing system and target data from atomic memory operation to be performed. Based on the scope, the processing device determines one or more hardware parameters for at least a portion of the processing system. The processing device then compares the hardware parameters to the scope and target data to determine one or more corrections. The processing device then provides the scope, target data, hardware parameters, and corrections to a plurality of hardware lookup tables. The hardware lookup tables are configured to receive the scope, target data, hardware parameters, and corrections as inputs and output values indicating one or more coherency actions and one or more orderings. The processing device then executes one or more of the indicated coherency actions and the atomic memory operation based on the indicated ordering.

Type: Grant

Filed: November 2, 2021

Date of Patent: March 14, 2023

Assignee: Advanced Micro Devices, Inc.

Inventors: Joseph L. Greathouse, Steven Tony Tye, Mark Fowler, Milind N. Nemlekar
Fused convolution and batch normalization for neural networks

Patent number: 11573765

Abstract: A processing unit implements a convolutional neural network (CNN) by fusing at least a portion of a convolution phase of the CNN with at least a portion of a batch normalization phase. The processing unit convolves two input matrices representing inputs and weights of a portion of the CNN to generate an output matrix. The processing unit performs the convolution via a series of multiplication operations, with each multiplication operation generating a corresponding submatrix (or “tile”) of the output matrix at an output register of the processing unit. While an output submatrix is stored at the output register, the processing unit performs a reduction phase and an update phase of the batch normalization phase for the CNN. The processing unit thus fuses at least a portion of the batch normalization phase of the CNN with a portion of the convolution.

Type: Grant

Filed: December 13, 2018

Date of Patent: February 7, 2023

Assignee: Advanced Micro Devices, Inc.

Inventors: Milind N. Nemlekar, Prerit Dak
MACHINE LEARNING CLUSTER PIPELINE FUSION

Publication number: 20230004871

Abstract: Methods, systems, and devices for pipeline fusion of a plurality of kernels. In some implementations, a first batch of a first kernel is executed on a first processing device to generate a first output of the first kernel based on an input. A first batch of a second kernel is executed on a second processing device to generate a first output of the second kernel based on the first output of the first kernel. A second batch of the first kernel is executed on the first processing device to generate a second output of the first kernel based on the input. The execution of the second batch of the first kernel overlaps at least partially in time with executing the first batch of the second kernel.

Type: Application

Filed: June 30, 2021

Publication date: January 5, 2023

Applicant: Advanced Micro Devices, Inc.

Inventors: Swapnil P. Sakharshete, Maxim V. Kazakov, Milind N. Nemlekar, Samuel Lawrence Wasmundt
MULTI-ACCELERATOR COMPUTE DISPATCH

Publication number: 20220319089

Abstract: Techniques for executing computing work by a plurality of chiplets are provided. The techniques include assigning workgroups of a kernel dispatch packet to the chiplets; by each chiplet, executing the workgroups assigned to that chiplet; for each chiplet, upon completion of all workgroups assigned to that chiplet for the kernel dispatch packet, notifying the other chiplets of such completion; and upon completion of all workgroups of the kernel dispatch packet, notifying a client of such completion and proceeding to a subsequent kernel dispatch packet.

Type: Application

Filed: March 31, 2021

Publication date: October 6, 2022

Applicant: Advanced Micro Devices, Inc.

Inventors: Milind N. Nemlekar, Maxim V. Kazakov, Prerit Dak
DYNAMICALLY RECONFIGURABLE REGISTER FILE

Publication number: 20220309606

Abstract: Techniques for managing register allocation are provided. The techniques include detecting a first request to allocate first registers for a first wavefront; first determining, based on allocation information, that allocating the first registers to the first wavefront would result in a condition in which a deadlock is possible; in response to the first determining, refraining from allocating the first registers to the first wavefront; detecting a second request to allocate second registers for a second wavefront; second determining, based on the allocation information, that allocating the second registers to the second wavefront would result in a condition in which deadlock is not possible; and in response to the second determining, allocating the second registers to the second wavefront.

Type: Application

Filed: March 26, 2021

Publication date: September 29, 2022

Applicant: Advanced Micro Devices, Inc.

Inventors: Pramod Vasant Argade, Martin G. Sarov, Milind N. Nemlekar
CLUSTERING OF MACHINE LEARNING (ML) FUNCTIONAL COMPONENTS

Publication number: 20220207411

Abstract: A graphics processing unit (GPU) for clustering of machine learning (ML) functional components, including: a plurality of compute units; a plurality of ML clusters, wherein each of the ML clusters comprises at least one arithmetic logic unit (ALU), and wherein each of the ML clusters is associated with a respective subset of the compute units; and a plurality of memory modules each positioned on the GPU adjacent to a respective ML cluster of the plurality of ML clusters, wherein each ML cluster is configured to directly access one or more adjacent memory modules.

Type: Application

Filed: December 28, 2020

Publication date: June 30, 2022

Inventors: MAXIM V. KAZAKOV, MILIND N. NEMLEKAR, SWAPNIL SAKHARSHETE, VINEET GOEL
PIPELINED MATRIX MULTIPLICATION AT A GRAPHICS PROCESSING UNIT

Publication number: 20220138002

Abstract: A graphics processing unit (GPU) schedules recurrent matrix multiplication operations at different subsets of CUs of the GPU. The GPU includes a scheduler that receives sets of recurrent matrix multiplication operations, such as multiplication operations associated with a recurrent neural network (RNN). The multiple operations associated with, for example, an RNN layer are fused into a single kernel, which is scheduled by the scheduler such that one work group is assigned per compute unit, thus assigning different ones of the recurrent matrix multiplication operations to different subsets of the CUs of the GPU. In addition, via software synchronization of the different workgroups, the GPU pipelines the assigned matrix multiplication operations so that each subset of CUs provides corresponding multiplication results to a different subset, and so that each subset of CUs executes at least a portion of the multiplication operations concurrently.

Type: Application

Filed: October 12, 2021

Publication date: May 5, 2022

Inventor: Milind N. NEMLEKAR
STACKED DIES FOR MACHINE LEARNING ACCELERATOR

Publication number: 20210374607

Abstract: A device is disclosed. The device includes a machine learning die including a memory and one or more machine learning accelerators; and a processing core die stacked with the machine learning die, the processing core die being configured to execute shader programs for controlling operations on the machine learning die, wherein the memory is configurable as either or both of a cache and a directly accessible memory.

Type: Application

Filed: December 21, 2020

Publication date: December 2, 2021

Applicant: Advanced Micro Devices, Inc.

Inventors: Maxim V. Kazakov, Swapnil P. Sakharshete, Milind N. Nemlekar, Vineet Goel
Pipelined matrix multiplication at a graphics processing unit

Patent number: 11175946

Abstract: A graphics processing unit (GPU) schedules recurrent matrix multiplication operations at different subsets of CUs of the GPU. The GPU includes a scheduler that receives sets of recurrent matrix multiplication operations, such as multiplication operations associated with a recurrent neural network (RNN). The multiple operations associated with, for example, an RNN layer are fused into a single kernel, which is scheduled by the scheduler such that one work group is assigned per compute unit, thus assigning different ones of the recurrent matrix multiplication operations to different subsets of the CUs of the GPU. In addition, via software synchronization of the different workgroups, the GPU pipelines the assigned matrix multiplication operations so that each subset of CUs provides corresponding multiplication results to a different subset, and so that each subset of CUs executes at least a portion of the multiplication operations concurrently.

Type: Grant

Filed: December 6, 2018

Date of Patent: November 16, 2021

Assignee: Advanced Micro Devices, Inc.

Inventor: Milind N. Nemlekar
CHIPLET-INTEGRATED MACHINE LEARNING ACCELERATORS

Publication number: 20210026686

Abstract: Techniques for performing machine learning operations are provided. The techniques include configuring a first portion of a first chiplet as a cache; performing caching operations via the first portion; configuring at least a first sub-portion of the first portion of the chiplet as directly-accessible memory; and performing machine learning operations with the first sub-portion by a machine learning accelerator within the first chiplet.

Type: Application

Filed: July 20, 2020

Publication date: January 28, 2021

Applicant: Advanced Micro Devices, Inc.

Inventors: Swapnil P. Sakharshete, Andrew S. Pomianowski, Maxim V. Kazakov, Vineet Goel, Milind N. Nemlekar, Skyler Jonathon Saleh
FUSED CONVOLUTION AND BATCH NORMALIZATION FOR NEURAL NETWORKS

Publication number: 20200192631

Abstract: A processing unit implements a convolutional neural network (CNN) by fusing at least a portion of a convolution phase of the CNN with at least a portion of a batch normalization phase. The processing unit convolves two input matrices representing inputs and weights of a portion of the CNN to generate an output matrix. The processing unit performs the convolution via a series of multiplication operations, with each multiplication operation generating a corresponding submatrix (or “tile”) of the output matrix at an output register of the processing unit. While an output submatrix is stored at the output register, the processing unit performs a reduction phase and an update phase of the batch normalization phase for the CNN. The processing unit thus fuses at least a portion of the batch normalization phase of the CNN with a portion of the convolution.

Type: Application

Filed: December 13, 2018

Publication date: June 18, 2020

Inventors: Milind N. NEMLEKAR, Prerit DAK
PIPELINED MATRIX MULTIPLICATION AT A GRAPHICS PROCESSING UNIT

Publication number: 20200183734

Abstract: A graphics processing unit (GPU) schedules recurrent matrix multiplication operations at different subsets of CUs of the GPU. The GPU includes a scheduler that receives sets of recurrent matrix multiplication operations, such as multiplication operations associated with a recurrent neural network (RNN). The multiple operations associated with, for example, an RNN layer are fused into a single kernel, which is scheduled by the scheduler such that one work group is assigned per compute unit, thus assigning different ones of the recurrent matrix multiplication operations to different subsets of the CUs of the GPU. In addition, via software synchronization of the different workgroups, the GPU pipelines the assigned matrix multiplication operations so that each subset of CUs provides corresponding multiplication results to a different subset, and so that each subset of CUs executes at least a portion of the multiplication operations concurrently.

Type: Application

Filed: December 6, 2018

Publication date: June 11, 2020

Inventor: Milind N. NEMLEKAR