Patents by Inventor Khaled Hamidouche

Khaled Hamidouche has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

Processing Element-Centric All-to-All Communication

Publication number: 20240220336

Abstract: In accordance with described techniques for PE-centric all-to-all communication, a distributed computing system includes processing elements, such as graphics processing units, distributed in clusters. An all-to-all communication procedure is performed by the processing elements that are each configured to generate data packets in parallel for all-to-all data communication between the clusters. The all-to-all communication procedure includes a first stage of intra-cluster parallel data communication between respective processing elements of each of the clusters; a second stage of inter-cluster data exchange for all-to-all data communication between the clusters; and a third stage of intra-cluster data distribution to the respective processing elements of each of the clusters.

Type: Application

Filed: December 28, 2022

Publication date: July 4, 2024

Applicant: Advanced Micro Devices, Inc.

Inventors: Kishore Punniyamurthy, Khaled Hamidouche, Brandon K Potter, Rohit Shahaji Zambre
DISTRIBUTED CACHING POLICY FOR LARGE-SCALE DEEP LEARNING TRAINING DATA PRE-PROCESSING

Publication number: 20240211399

Abstract: A distributed cache network used for machine learning is provided which comprises a network fabric having file systems which store data and a plurality of processing devices, each comprising cache memory and a processor configured to execute a training of a machine learning model and selectively cache portions of the data based on a frequency with which the data is accessed by the processor. Each processing device stores metadata identifying portions of data which are cached in the cache memory and other portions of the data which are cached in other processing devices of the network. When requested data is not cached in another processing device, the portion of requested data is accessed from a network file system via a client to server channel and is accessed from another processing device via a client to client channel when the requested data is cached in the other processing device.

Type: Application

Filed: December 27, 2022

Publication date: June 27, 2024

Applicant: Advanced Micro Devices, Inc.

Inventors: Kishore Punniyamurthy, Khaled Hamidouche, Brandon Keith Potter
Network command coalescing on GPUs

Patent number: 11922207

Abstract: An approach is provided for coalescing network commands in a GPU that implements a SIMT architecture. Compatible next network operations from different threads are coalesced into a single network command packet. This reduces the number of network command packets generated and issued by threads, thereby increasing efficiency, and improving throughput. The approach is applicable to any number of threads and any thread organization methodology, such as wavefronts, warps, etc.

Type: Grant

Filed: August 13, 2020

Date of Patent: March 5, 2024

Assignee: Advanced Micro Devices, Inc

Inventors: Michael W. LeBeane, Khaled Hamidouche, Brandon K. Potter
Communication of Data for a Model Between Nodes in an Electronic Device

Publication number: 20240005126

Abstract: An electronic device includes one or more data producing nodes and a data consuming node. Each data producing node separately generates two or more portions of a respective block of data. Upon completing generating each portion of the two or more portions of the respective block of data, each data producing node communicates that portion of the respective block of data to the data consuming node. Upon receiving corresponding portions of the respective blocks of data from each of the one or more data producing nodes, the data consuming node performs operations for a model using the corresponding portions of the respective blocks of data.

Type: Application

Filed: June 29, 2022

Publication date: January 4, 2024

Inventors: Kishore Punniyamurthy, Khaled Hamidouche, Brandon K. Potter, Rohit Shahaji Zambre
EFFICIENT MEMORY-SEMANTIC NETWORKING USING SCOPED MEMORY MODELS

Publication number: 20230289070

Abstract: A framework disclosed herein extends a relaxed, scoped memory model to a system that includes nodes across a commodity network and maintains coherency across the system. A new scope, cluster scope, is defined, that allows for memory accesses at scopes less than cluster scope to operate on locally cached versions of remote data from across the commodity network without having to issue expensive network operations. Cluster scope operations generate network commands that are used to synchronize memory across the commodity network.

Type: Application

Filed: May 19, 2023

Publication date: September 14, 2023

Applicant: Advanced Micro Devices, Inc.

Inventors: Michael W. LeBeane, Khaled Hamidouche, Hari S. Thangirala, Brandon Keith Potter
Efficient memory-semantic networking using scoped memory models

Patent number: 11714559

Abstract: A framework disclosed herein extends a relaxed, scoped memory model to a system that includes nodes across a commodity network and maintains coherency across the system. A new scope, cluster scope, is defined, that allows for memory accesses at scopes less than cluster scope to operate on locally cached versions of remote data from across the commodity network without having to issue expensive network operations. Cluster scope operations generate network commands that are used to synchronize memory across the commodity network.

Type: Grant

Filed: September 25, 2020

Date of Patent: August 1, 2023

Assignee: Advanced Micro Devices, Inc.

Inventors: Michael W. LeBeane, Khaled Hamidouche, Hari S. Thangirala, Brandon Keith Potter
GPU NETWORKING USING AN INTEGRATED COMMAND PROCESSOR

Publication number: 20230120934

Abstract: Systems, apparatuses, and methods for generating network messages on a parallel processor are disclosed. A system includes at least a parallel processor, a general purpose processor, and a network interface unit. The parallel processor includes at least a plurality of compute units, a command processor, and a cache. A thread within a kernel executing on a compute unit of the parallel processor generates a network message and stores the network message and a corresponding indication in the cache. In response to detecting the indication of the network message in the cache, the command processor processes and conveys the network message to the network interface unit without involving the general purpose processor.

Type: Application

Filed: December 20, 2022

Publication date: April 20, 2023

Inventors: Michael Wayne LeBeane, Khaled Hamidouche, Walter B. Benton
Optimized asynchronous training of neural networks using a distributed parameter server with eager updates

Patent number: 11630994

Abstract: A method of training a neural network includes, at a local computing node, receiving remote parameters from a set of one or more remote computing nodes, initiating execution of a forward pass in a local neural network in the local computing node to determine a final output based on the remote parameters, initiating execution of a backward pass in the local neural network to determine updated parameters for the local neural network, and prior to completion of the backward pass, transmitting a subset of the updated parameters to the set of remote computing nodes.

Type: Grant

Filed: February 17, 2018

Date of Patent: April 18, 2023

Assignee: Advanced Micro Devices, Inc.

Inventors: Khaled Hamidouche, Michael W LeBeane, Walter B Benton, Michael L Chu
GPU networking using an integrated command processor

Patent number: 11544121

Abstract: Systems, apparatuses, and methods for generating network messages on a parallel processor are disclosed. A system includes at least a parallel processor, a general purpose processor, and a network interface unit. The parallel processor includes at least a plurality of compute units, a command processor, and a cache. A thread within a kernel executing on a compute unit of the parallel processor generates a network message and stores the network message and a corresponding indication in the cache. In response to detecting the indication of the network message in the cache, the command processor processes and conveys the network message to the network interface unit without involving the general purpose processor.

Type: Grant

Filed: November 16, 2017

Date of Patent: January 3, 2023

Assignee: Advanced Micro Devices, Inc.

Inventors: Michael Wayne LeBeane, Khaled Hamidouche, Walter B. Benton
EFFICIENT MEMORY-SEMANTIC NETWORKING USING SCOPED MEMORY MODELS

Publication number: 20220100391

Abstract: A framework disclosed herein extends a relaxed, scoped memory model to a system that includes nodes across a commodity network and maintains coherency across the system. A new scope, cluster scope, is defined, that allows for memory accesses at scopes less than cluster scope to operate on locally cached versions of remote data from across the commodity network without having to issue expensive network operations. Cluster scope operations generate network commands that are used to synchronize memory across the commodity network.

Type: Application

Filed: September 25, 2020

Publication date: March 31, 2022

Applicant: Advanced Micro Devices, Inc.

Inventors: Michael W. LeBeane, Khaled Hamidouche, Hari S. Thangirala, Brandon Keith Potter
NETWORK COMMAND COALESCING ON GPUs

Publication number: 20220050707

Abstract: An approach is provided for coalescing network commands in a GPU that implements a SIMT architecture. Compatible next network operations from different threads are coalesced into a single network command packet. This reduces the number of network command packets generated and issued by threads, thereby increasing efficiency, and improving throughput. The approach is applicable to any number of threads and any thread organization methodology, such as wavefronts, warps, etc.

Type: Application

Filed: August 13, 2020

Publication date: February 17, 2022

Inventors: Michael W. LeBeane, Khaled Hamidouche, Brandon K. Potter
SYSTEMS AND METHODS FOR REDUCING INSTRUCTION CODE MEMORY FOOTPRINT FOR MULTIPLE PROCESSES EXECUTED AT A COPROCESSOR

Publication number: 20210191641

Abstract: A processing system includes a first processor couplable to a first memory and a second memory. In response to a page migration trigger for a page in the first memory, the first processor is configured to, responsive to the page being a read-only page storing code for execution, initiate migration of the page to a code cache portion of a second memory associated with a second processor and shared by multiple processes executing at the second processor, and to configure each process of a set of processes executing at the second processor to access and execute the code from the code cache portion.

Type: Application

Filed: December 18, 2019

Publication date: June 24, 2021

Inventors: Khaled HAMIDOUCHE, Michael W. LEBEANE, Hari S. THANGIRALA
Optimized and scalable sparse triangular linear systems on networks of accelerators

Patent number: 10936697

Abstract: A method includes storing a first portion of a sparse triangular matrix in a local memory and launching a kernel for executing a set of workgroups. The first portion includes a plurality of row blocks, and each workgroup in the set of workgroups is associated with one of the plurality of row blocks. The method also includes, for each workgroup in the set of workgroups, solving the row block. The row block is solved by, for each row segment of a first subset of row segments in the row block, calculating a partial sum for the row segment based on one or more matrix elements in the row segment, and writing the partial sum to a remote memory of a first remote processing unit prior to terminating the kernel.

Type: Grant

Filed: July 24, 2018

Date of Patent: March 2, 2021

Assignee: Advanced Micro Devices, Inc.

Inventors: Khaled Hamidouche, Michael W. LeBeane, Nicholas P. Malaya, Joseph L. Greathouse
Network packet templating for GPU-initiated communication

Patent number: 10740163

Abstract: Systems, apparatuses, and methods for performing network packet templating for graphics processing unit (GPU)-initiated communication are disclosed. A central processing unit (CPU) creates a network packet according to a template and populates a first subset of fields of the network packet with static data. Next, the CPU stores the network packet in a memory. A GPU initiates execution of a kernel and detects a network communication request within the kernel and prior to the kernel completing execution. Responsive to this determination, the GPU populates a second subset of fields of the network packet with runtime data. Then, the GPU generates a notification that the network packet is ready to be processed. A network interface controller (NIC) processes the network packet using data retrieved from the first subset of fields and from the second subset of fields responsive to detecting the notification.

Type: Grant

Filed: June 28, 2018

Date of Patent: August 11, 2020

Assignee: Advanced Micro Devices, Inc.

Inventors: Khaled Hamidouche, Michael Wayne LeBeane, Walter B. Benton
OPTIMIZED AND SCALABLE SPARSE TRIANGULAR LINEAR SYSTEMS ON NETWORKS OF ACCELERATORS

Publication number: 20200034405

Abstract: A method includes storing a first portion of a sparse triangular matrix in a local memory and launching a kernel for executing a set of workgroups. The first portion includes a plurality of row blocks, and each workgroup in the set of workgroups is associated with one of the plurality of row blocks. The method also includes, for each workgroup in the set of workgroups, solving the row block. The row block is solved by, for each row segment of a first subset of row segments in the row block, calculating a partial sum for the row segment based on one or more matrix elements in the row segment, and writing the partial sum to a remote memory of a first remote processing unit prior to terminating the kernel.

Type: Application

Filed: July 24, 2018

Publication date: January 30, 2020

Inventors: Khaled Hamidouche, Michael W. LeBeane, Nicholas P. Malaya, Joseph L. Greathouse
NETWORK-RELATED PERFORMANCE FOR GPUS

Publication number: 20200034195

Abstract: Techniques for improved networking performance in systems where a graphics processing unit or other highly parallel non-central-processing-unit (referred to as an accelerated processing device or “APD” herein) has the ability to directly issue commands to a networking device such as a network interface controller (“NIC”) are disclosed. According to a first technique, the latency associated with loading certain metadata into NIC hardware memory is reduced or eliminated by pre-fetching network command queue metadata into hardware network command queue metadata slots of the NIC, thereby reducing the latency associated with fetching that metadata at a later time. A second technique involves reducing latency by prioritizing work on an APD when it is known that certain network traffic is soon to arrive over the network via a NIC.

Type: Application

Filed: July 30, 2018

Publication date: January 30, 2020

Applicant: Advanced Micro Devices, Inc.

Inventors: Michael W. LeBeane, Khaled Hamidouche, Bradford M. Beckmann
NETWORK PACKET TEMPLATING FOR GPU-INITIATED COMMUNICATION

Publication number: 20200004610

Abstract: Systems, apparatuses, and methods for performing network packet templating for graphics processing unit (GPU)-initiated communication are disclosed. A central processing unit (CPU) creates a network packet according to a template and populates a first subset of fields of the network packet with static data. Next, the CPU stores the network packet in a memory. A GPU initiates execution of a kernel and detects a network communication request within the kernel and prior to the kernel completing execution. Responsive to this determination, the GPU populates a second subset of fields of the network packet with runtime data. Then, the GPU generates a notification that the network packet is ready to be processed. A network interface controller (NIC) processes the network packet using data retrieved from the first subset of fields and from the second subset of fields responsive to detecting the notification.

Type: Application

Filed: June 28, 2018

Publication date: January 2, 2020

Inventors: Khaled Hamidouche, Michael Wayne LeBeane, Walter B. Benton
OPTIMIZED ASYNCHRONOUS TRAINING OF NEURAL NETWORKS USING A DISTRIBUTED PARAMETER SERVER WITH EAGER UPDATES

Publication number: 20190258924

Abstract: A method of training a neural network includes, at a local computing node, receiving remote parameters from a set of one or more remote computing nodes, initiating execution of a forward pass in a local neural network in the local computing node to determine a final output based on the remote parameters, initiating execution of a backward pass in the local neural network to determine updated parameters for the local neural network, and prior to completion of the backward pass, transmitting a subset of the updated parameters to the set of remote computing nodes.

Type: Application

Filed: February 17, 2018

Publication date: August 22, 2019

Inventors: Khaled Hamidouche, Michael W LeBeane, Walter B Benton, Michael L Chu
GPU NETWORKING USING AN INTEGRATED COMMAND PROCESSOR

Publication number: 20190146857

Abstract: Systems, apparatuses, and methods for generating network messages on a parallel processor are disclosed. A system includes at least a parallel processor, a general purpose processor, and a network interface unit. The parallel processor includes at least a plurality of compute units, a command processor, and a cache. A thread within a kernel executing on a compute unit of the parallel processor generates a network message and stores the network message and a corresponding indication in the cache. In response to detecting the indication of the network message in the cache, the command processor processes and conveys the network message to the network interface unit without involving the general purpose processor.

Type: Application

Filed: November 16, 2017

Publication date: May 16, 2019

Inventors: Michael Wayne LeBeane, Khaled Hamidouche, Walter B. Benton