Patents by Inventor Vydhyanathan Kalyanasundharam

Vydhyanathan Kalyanasundharam has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

Dynamic buffer management in multi-client token flow control routers

Patent number: 10608943

Abstract: Systems, apparatuses, and methods for dynamic buffer management in multi-client token flow control routers are disclosed. A system includes at least one or more processing units, a memory, and a communication fabric with a plurality of routers coupled to the processing unit(s) and the memory. A router servicing multiple active clients allocates a first number of tokens to each active client. The first number of tokens is less than a second number of tokens needed to saturate the bandwidth of each client to the router. The router also allocates a third number of tokens to a free pool, with tokens from the free pool being dynamically allocated to different clients. The third number of tokens is equal to the difference between the second number of tokens and the first number of tokens. An advantage of this approach is reducing the amount of buffer space needed at the router.

Type: Grant

Filed: October 27, 2017

Date of Patent: March 31, 2020

Assignee: Advanced Micro Devices, Inc.

Inventors: Alan Dodson Smith, Chintan S. Patel, Eric Christopher Morton, Vydhyanathan Kalyanasundharam, Narendra Kamat
MULTICAST IN THE PROBE CHANNEL

Publication number: 20200099993

Abstract: Systems, apparatuses, and methods for processing multi-cast messages are disclosed. A system includes at least one or more processing units, one or more memory controllers, and a communication fabric coupled to the processing unit(s) and the memory controller(s). The communication fabric includes a plurality of crossbars which connect various agents within the system. When a multi-cast message is received by a crossbar, the crossbar extracts a message type indicator and a recipient type indicator from the message. The crossbar uses the message type indicator to determine which set of masks to lookup using the recipient type indicator. Then, the crossbar determines which one or more masks to extract from the selected set of masks based on values of the recipient type indicator. The crossbar combines the one or more masks with a multi-cast route to create a port vector for determining on which ports to forward the multi-cast message.

Type: Application

Filed: September 21, 2018

Publication date: March 26, 2020

Inventors: Vydhyanathan Kalyanasundharam, Joe G. Cruz, Eric Christopher Morton, Alan Dodson Smith
Bandwidth matched scheduler

Patent number: 10601723

Abstract: A computing system uses a memory for storing data, one or more clients for generating network traffic and a communication fabric with network switches. The network switches include centralized storage structures, rather than separate input and output storage structures. The network switches store particular metadata corresponding to received packets in a single, centralized collapsing queue where the age of the packets corresponds to a queue entry position. The payload data of the packets are stored in a separate memory, so the relatively large amount of data is not shifted during the lifetime of the packet in the network switch. The network switches select sparse queue entries in the collapsible queue, deallocate the selected queue entries, and shift remaining allocated queue entries toward a first end of the queue with a delay proportional to the radix of the network switches.

Type: Grant

Filed: April 12, 2018

Date of Patent: March 24, 2020

Assignee: Advanced Micro Devices, Inc.

Inventors: Alan Dodson Smith, Vydhyanathan Kalyanasundharam, Bryan P. Broussard, Greggory D. Donley, Chintan S. Patel
ACCELERATING ACCESSES TO PRIVATE REGIONS IN A REGION-BASED CACHE DIRECTORY SCHEME

Publication number: 20200081844

Abstract: Systems, apparatuses, and methods for accelerating accesses to private regions in a region-based cache directory scheme are disclosed. A system includes multiple processing nodes, one or more memory devices, and one or more region-based cache directories to manage cache coherence among the nodes' cache subsystems. Region-based cache directories track coherence on a region basis rather than on a cache line basis, wherein a region includes multiple cache lines. The cache directory entries for regions that are only accessed by a single node are cached locally at the node. Updates to the reference count for these entries are made locally rather than sending updates to the cache directory. When a second node accesses a first node's private region, the region is now considered shared, and the entry for this region is transferred from the first node back to the cache directory.

Type: Application

Filed: September 12, 2018

Publication date: March 12, 2020

Inventors: Vydhyanathan Kalyanasundharam, Amit P. Apte, Ganesh Balakrishnan
REGION BASED SPLIT-DIRECTORY SCHEME TO ADAPT TO LARGE CACHE SIZES

Publication number: 20200073801

Abstract: Systems, apparatuses, and methods for maintaining region-based cache directories split between node and memory are disclosed. The system with multiple processing nodes includes cache directories split between the nodes and memory to help manage cache coherency among the nodes' cache subsystems. In order to reduce the number of entries in the cache directories, the cache directories track coherency on a region basis rather than on a cache line basis, wherein a region includes multiple cache lines. Each processing node includes a node-based cache directory to track regions which have at least one cache line cached in any cache subsystem in the node. The node-based cache directory includes a reference count field in each entry to track the aggregate number of cache lines that are cached per region. The memory-based cache directory includes entries for regions which have an entry stored in any node-based cache directory of the system.

Type: Application

Filed: August 31, 2018

Publication date: March 5, 2020

Inventors: Vydhyanathan Kalyanasundharam, Kevin M. Lepak, Amit P. Apte, Ganesh Balakrishnan
PROBE INTERRUPT DELIVERY

Publication number: 20200065275

Abstract: Systems, apparatuses, and methods for routing interrupts on a coherency probe network are disclosed. A computing system includes a plurality of processing nodes, a coherency probe network, and one or more control units. The coherency probe network carries coherency probe messages between coherent agents. Interrupts that are detected by a control unit are converted into messages that are compatible with coherency probe messages and then routed to a target destination via the coherency probe network. Interrupts are generated with a first encoding while coherency probe messages have a second encoding. Cache subsystems determine whether a message received via the coherency probe network is an interrupt message or a coherency probe message based on an encoding embedded in the received message. Interrupt messages are routed to interrupt controller(s) while coherency probe messages are processed in accordance with a coherence probe action field embedded in the message.

Type: Application

Filed: August 24, 2018

Publication date: February 27, 2020

Inventors: Vydhyanathan Kalyanasundharam, Eric Christopher Morton, Bryan P. Broussard, Paul James Moyer, William Louie Walker
LINK LAYER DATA PACKING AND PACKET FLOW CONTROL SCHEME

Publication number: 20200059437

Abstract: Systems, apparatuses, and methods for performing efficient data transfer in a computing system are disclosed. A computing system includes multiple fabric interfaces in clients and a fabric. A packet transmitter in the fabric interface includes multiple queues, each for storing packets of a respective type. The packet transmitter includes multiple queue arbiters, each for selecting a candidate packet from a respective one of the multiple queues. The packet transmitter includes a buffer for storing a link packet, which includes data storage space for storing multiple candidate packets. The packet transmitter selects qualified candidate packets from the multiple queues and inserts these candidate packets into the link packet. The packing arbiter avoids data collisions at the receiver by taking into consideration mismatches between the rate of inserting candidate packets into the link packet and the rate of creating available data storage space in a receiving queue in the receiver.

Type: Application

Filed: August 20, 2018

Publication date: February 20, 2020

Inventors: Greggory D. Donley, Bryan P. Broussard, Vydhyanathan Kalyanasundharam
Method and apparatus for in-band priority adjustment forwarding in a communication fabric

Patent number: 10558591

Abstract: Systems, apparatuses, and methods for implementing priority adjustment forwarding are disclosed. A system includes at least one or more processing units, a memory, and a communication fabric coupled to the processing unit(s) and the memory. The communication fabric includes a plurality of arbitration points. When a client determines that its bandwidth requirements are not being met, the client generates and sends an in-band priority adjustment request to the nearest arbitration point. This arbitration point receives the in-band priority adjustment request and then identifies any pending requests which are buffered at the arbitration point which meet the criteria specified by the in-band priority adjustment request. The arbitration point adjusts the priority of any identified requests, and then the arbitration point forwards the in-band priority adjustment request on the fabric to the next upstream arbitration point which processes the in-band priority adjustment request in the same manner.

Type: Grant

Filed: October 9, 2017

Date of Patent: February 11, 2020

Assignee: Advanced Micro Devices, Inc.

Inventors: Alan Dodson Smith, Eric Christopher Morton, Vydhyanathan Kalyanasundharam, Joe G. Cruz
Tag accelerator for low latency DRAM cache

Patent number: 10545875

Abstract: Systems, apparatuses, and methods for implementing a tag accelerator cache are disclosed. A system includes at least a data cache and a control unit coupled to the data cache via a memory controller. The control unit includes a tag accelerator cache (TAC) for caching tag blocks fetched from the data cache. The data cache is organized such that multiple tags are retrieved in a single access. This allows hiding the tag latency penalty for future accesses to neighboring tags and improves cache bandwidth. When a tag block is fetched from the data cache, the tag block is cached in the TAC. Memory requests received by the control unit first lookup the TAC before being forwarded to the data cache. Due to the presence of spatial locality in applications, the TAC can filter out a large percentage of tag accesses to the data cache, resulting in latency and bandwidth savings.

Type: Grant

Filed: December 27, 2017

Date of Patent: January 28, 2020

Assignee: Advanced Micro Devices, Inc.

Inventors: Vydhyanathan Kalyanasundharam, Kevin M. Lepak, Ganesh Balakrishnan, Ravindra N. Bhargava
Cancel and replay protocol scheme to improve ordered bandwidth

Patent number: 10540316

Abstract: Systems, apparatuses, and methods for implementing a cancel and replay mechanism for ordered requests are disclosed. A system includes at least an ordering master, a memory controller, a coherent slave coupled to the memory controller, and an interconnect fabric coupled to the ordering master and the coherent slave. The ordering master generates a write request which is forwarded to the coherent slave on the path to memory. The coherent slave sends invalidating probes to all processing nodes and then sends an indication that the write request is globally visible to the ordering master when all cached copies of the data targeted by the write request have been invalidated. In response to receiving the globally visible indication, the ordering master starts a timer. If the timer expires before all older requests have become globally visible, then the write request is cancelled and replayed to ensure forward progress in the fabric and avoid a potential deadlock scenario.

Type: Grant

Filed: December 28, 2017

Date of Patent: January 21, 2020

Assignee: Advanced Micro Devices, Inc.

Inventors: Vydhyanathan Kalyanasundharam, Eric Christopher Morton, Chen-Ping Yang, Amit P. Apte, Elizabeth M. Cooper
Cache to cache data transfer acceleration techniques

Patent number: 10503648

Abstract: Systems, apparatuses, and methods for accelerating cache to cache data transfers are disclosed. A system includes at least a plurality of processing nodes and prediction units, an interconnect fabric, and a memory. A first prediction unit is configured to receive memory requests generated by a first processing node as the requests traverse the interconnect fabric on the path to memory. When the first prediction unit receives a memory request, the first prediction unit generates a prediction of whether data targeted by the request is cached by another processing node. The first prediction unit is configured to cause a speculative probe to be sent to a second processing node responsive to predicting that the data targeted by the memory request is cached by the second processing node. The speculative probe accelerates the retrieval of the data from the second processing node if the prediction is correct.

Type: Grant

Filed: December 12, 2017

Date of Patent: December 10, 2019

Assignee: Advanced Micro Devices, Inc.

Inventors: Vydhyanathan Kalyanasundharam, Amit P. Apte, Ganesh Balakrishnan, Ann Ling, Ravindra N. Bhargava
Load balancing scheme

Patent number: 10491524

Abstract: A system for implementing load balancing schemes includes one or more processing units, a memory, and a communication fabric with a plurality of switches coupled to the processing unit(s) and the memory. A switch of the fabric determines a first number of streams on a first input port that are targeting a first output port. The switch also determines a second number of requestors, from all input ports, that are targeting the first output port. Then, the switch calculates a throttle factor for the first input port by dividing the first number of streams by the second number of streams. The switch applies the throttle factor to regulate bandwidth on the first input port for requestors targeting the first output port. The switch also calculates throttle factors for the other ports and applies the throttle factors when regulating bandwidth on the other ports.

Type: Grant

Filed: November 7, 2017

Date of Patent: November 26, 2019

Assignee: Advanced Micro Devices, Inc.

Inventors: Alan Dodson Smith, Chintan S. Patel, Eric Christopher Morton, Vydhyanathan Kalyanasundharam, Narendra Kamat
Caching policies for processing units on multiple sockets

Patent number: 10467138

Abstract: A processing system includes a first socket, a second socket, and an interface between the first socket and the second socket. A first memory is associated with the first socket and a second memory is associated with the second socket. The processing system also includes a controller for the first memory. The controller is to receive a first request for a first memory transaction with the second memory and perform the first memory transaction along a path that includes the interface and bypasses at least one second cache associated with the second memory.

Type: Grant

Filed: December 28, 2015

Date of Patent: November 5, 2019

Assignee: Advanced Micro Devices, Inc.

Inventors: Paul Blinzer, Ali Ibrahim, Benjamin T. Sander, Vydhyanathan Kalyanasundharam
BANDWIDTH MATCHED SCHEDULER

Publication number: 20190319891

Abstract: A computing system uses a memory for storing data, one or more clients for generating network traffic and a communication fabric with network switches. The network switches include centralized storage structures, rather than separate input and output storage structures. The network switches store particular metadata corresponding to received packets in a single, centralized collapsing queue where the age of the packets corresponds to a queue entry position. The payload data of the packets are stored in a separate memory, so the relatively large amount of data is not shifted during the lifetime of the packet in the network switch. The network switches select sparse queue entries in the collapsible queue, deallocate the selected queue entries, and shift remaining allocated queue entries toward a first end of the queue with a delay proportional to the radix of the network switches.

Type: Application

Filed: April 12, 2018

Publication date: October 17, 2019

Inventors: Alan Dodson Smith, Vydhyanathan Kalyanasundharam, Bryan P. Broussard, Greggory D. Donley, Chintan S. Patel
Tag and data organization in large memory caches

Patent number: 10366008

Abstract: A data processing system includes a processor and a cache controller coupled to the processor, and adapted to be coupled to a memory. The cache controller uses the memory to form a pseudo direct mapped cache having a plurality of groups of pages. The memory forms a first number of selected pages, including a first page for storing a plurality of sets of tags and a plurality of remaining pages for storing data. Each tag, of the plurality of sets of tags, stores tags for respective entries in a corresponding one of the plurality of remaining pages.

Type: Grant

Filed: December 12, 2016

Date of Patent: July 30, 2019

Assignee: ADVANCED MICRO DEVICES, INC.

Inventors: Ganesh Balakrishnan, Vydhyanathan Kalyanasundharam, Kevin M. Lepak
CANCEL AND REPLAY PROTOCOL SCHEME TO IMPROVE ORDERED BANDWIDTH

Publication number: 20190205280

Abstract: Systems, apparatuses, and methods for implementing a cancel and replay mechanism for ordered requests are disclosed. A system includes at least an ordering master, a memory controller, a coherent slave coupled to the memory controller, and an interconnect fabric coupled to the ordering master and the coherent slave. The ordering master generates a write request which is forwarded to the coherent slave on the path to memory. The coherent slave sends invalidating probes to all processing nodes and then sends an indication that the write request is globally visible to the ordering master when all cached copies of the data targeted by the write request have been invalidated. In response to receiving the globally visible indication, the ordering master starts a timer. If the timer expires before all older requests have become globally visible, then the write request is cancelled and replayed to ensure forward progress in the fabric and avoid a potential deadlock scenario.

Type: Application

Filed: December 28, 2017

Publication date: July 4, 2019

Inventors: Vydhyanathan Kalyanasundharam, Eric Christopher Morton, Chen-Ping Yang, Amit P. Apte, Elizabeth M. Cooper
TAG ACCELERATOR FOR LOW LATENCY DRAM CACHE

Publication number: 20190196974

Abstract: Systems, apparatuses, and methods for implementing a tag accelerator cache are disclosed. A system includes at least a data cache and a control unit coupled to the data cache via a memory controller. The control unit includes a tag accelerator cache (TAC) for caching tag blocks fetched from the data cache. The data cache is organized such that multiple tags are retrieved in a single access. This allows hiding the tag latency penalty for future accesses to neighboring tags and improves cache bandwidth. When a tag block is fetched from the data cache, the tag block is cached in the TAC. Memory requests received by the control unit first lookup the TAC before being forwarded to the data cache. Due to the presence of spatial locality in applications, the TAC can filter out a large percentage of tag accesses to the data cache, resulting in latency and bandwidth savings.

Type: Application

Filed: December 27, 2017

Publication date: June 27, 2019

Inventors: Vydhyanathan Kalyanasundharam, Kevin M. Lepak, Ganesh Balakrishnan, Ravindra N. Bhargava
MULTI-NODE SYSTEM LOW POWER MANAGEMENT

Publication number: 20190196574

Abstract: Systems, apparatuses, and methods for performing efficient power management for a multi-node computing system are disclosed. A computing system including multiple nodes utilizes a non-uniform memory access (NUMA) architecture. A first node receives a broadcast probe from a second node. The first node spoofs a miss response for a powered down third node, which prevents the third node from waking up to respond to the broadcast probe. Prior to powering down, the third node flushed its probe filter and caches, and updated its system memory with the received dirty cache lines. The computing system includes a master node for storing interrupt priorities of the multiple cores in the computing system for arbitrated interrupts. The cores store indications of fixed interrupt identifiers for each core in the computing system. Arbitrated and fixed interrupts are handled by cores with point-to-point unicast messages, rather than broadcast messages.

Type: Application

Filed: December 21, 2017

Publication date: June 27, 2019

Inventors: Benjamin Tsien, Bryan P. Broussard, Vydhyanathan Kalyanasundharam
SELF IDENTIFYING INTERCONNECT TOPOLOGY

Publication number: 20190199617

Abstract: A system for automatically discovering fabric topology includes at least one or more processing units, one or more memory devices, a security processor, and a communication fabric with an unknown topology coupled to the processing unit(s), memory device(s), and security processor. The security processor queries each component of the fabric to retrieve various attributes associated with the component. The security processor utilizes the retrieved attributes to create a network graph of the topology of the components within the fabric. The security processor generates routing tables from the network graph and programs the routing tables into the fabric components. Then, the fabric components utilize the routing tables to determine how to route incoming packets.

Type: Application

Filed: December 21, 2017

Publication date: June 27, 2019

Inventors: Vydhyanathan Kalyanasundharam, Eric Christopher Morton, Alan Dodson Smith, Joe G. Cruz
REGION BASED DIRECTORY SCHEME TO ADAPT TO LARGE CACHE SIZES

Publication number: 20190188137

Abstract: Systems, apparatuses, and methods for maintaining a region-based cache directory are disclosed. A system includes multiple processing nodes, with each processing node including a cache subsystem. The system also includes a cache directory to help manage cache coherency among the different cache subsystems of the system. In order to reduce the number of entries in the cache directory, the cache directory tracks coherency on a region basis rather than on a cache line basis, wherein a region includes multiple cache lines. Accordingly, the system includes a region-based cache directory to track regions which have at least one cache line cached in any cache subsystem in the system. The cache directory includes a reference count in each entry to track the aggregate number of cache lines that are cached per region. If a reference count of a given entry goes to zero, the cache directory reclaims the given entry.

Type: Application

Filed: December 18, 2017

Publication date: June 20, 2019

Inventors: Vydhyanathan Kalyanasundharam, Kevin M. Lepak, Amit P. Apte, Ganesh Balakrishnan, Eric Christopher Morton, Elizabeth M. Cooper, Ravindra N. Bhargava

prev 1 2 3 4 5 6 7 next