Patents by Inventor Sunil K. Shukla
Sunil K. Shukla has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).
-
Patent number: 11941111Abstract: Indices of non-zero weights may be stored in an index register file included within each of a plurality of processor elements in a systolic array. Non-zero weights may be stored in a register file associated with the index register file. Input values (e.g., dense input values) corresponding to a single block in a data structure may be sent to the plurality of processor elements. Those of the input values corresponding to the indices of non-zero weights in the index register file may be selected for performing multiply-accumulate (“MAC”) operation based on sending the plurality of input values to one or more of the plurality of processor elements. The indices of the plurality of non-zero weight are stored in an index data stick. The values of the plurality of non-zero weights are stored in a value data stick.Type: GrantFiled: July 31, 2021Date of Patent: March 26, 2024Assignee: International Business Machines CorporationInventors: Sanchari Sen, Swagath Venkataramani, Vijayalakshmi Srinivasan, Kailash Gopalakrishnan, Sunil K. Shukla
-
Patent number: 11831467Abstract: Embodiments for providing enhanced multicast data transfer for ring topology based artificial intelligence systems are disclosed. Multicast data is sent to a plurality of disjointed cores in a multicast group according to a first multicast mode, a second multicast mode, or a third multicast mode, where the first multicast mode sends a first half the multicast data on first multicast ring and a second half on a second multicast ring, the second multicast mode sends the multicast data on either the first multicast ring or the second multicast ring, and the third multicast mode replicates the multicast data and sends the multicast data to both the first multicast ring and the second multicast ring.Type: GrantFiled: May 13, 2022Date of Patent: November 28, 2023Assignee: INTERNATIONAL BUSINESS MACHINES CORPORATIONInventors: Shubham Jain, Swagath Venkataramani, Vijayalakshmi Srinivasan, Sunil K Shukla, Martin A Lutz
-
Publication number: 20230370304Abstract: Embodiments for providing enhanced multicast data transfer for ring topology based artificial intelligence systems are disclosed. Multicast data is sent to a plurality of disjointed cores in a multicast group according to a first multicast mode, a second multicast mode, or a third multicast mode, where the first multicast mode sends a first half the multicast data on first multicast ring and a second half on a second multicast ring, the second multicast mode sends the multicast data on either the first multicast ring or the second multicast ring, and the third multicast mode replicates the multicast data and sends the multicast data to both the first multicast ring and the second multicast ring.Type: ApplicationFiled: May 13, 2022Publication date: November 16, 2023Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATIONInventors: Shubham JAIN, Swagath VENKATARAMANI, Vijayalakshmi SRINIVASAN, Sunil K. SHUKLA, Martin A. LUTZ
-
Publication number: 20230344667Abstract: Embodiments for providing single-producer-multiple consumers synchronization and multicast data transfer by a processor are disclosed. Multicast data transfer is synchronized based on an identification tag and a request from each one of a plurality of recipients for the multicast data. The multicast data is transferred to each of the plurality of recipients based on the identification tag, the request from each one of the plurality of recipients, and a list of the plurality of recipients.Type: ApplicationFiled: April 22, 2022Publication date: October 26, 2023Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATIONInventors: Vijayalakshmi SRINIVASAN, Scot RIDER, Swagath VENKATARAMANI, Kailash GOPALAKRISHNAN, Sunil K. SHUKLA, Brian William CURRAN, Martin A. LUTZ
-
Publication number: 20230267003Abstract: Processing input data for transmittal to a data consumer such as an artificial intelligence engine is performed by arranging the input data into a uniform structure made up of sticks of data combined to form pages of sticks. A stick is any well-sized set of input data elements whereby the size of the stick is fixed. A masking pattern is established for sticks of data having certain ranges of invalid data for consumption of partial sticks while maintaining validity of the input data being transferred. The mask pattern is derived based on set-active-mask-and-value (SAMV) instructions. The derived mask pattern is carried forward for subsequent load instructions to the data consumer.Type: ApplicationFiled: February 23, 2022Publication date: August 24, 2023Inventors: Cedric Lichtenau, Vijayalakshmi Srinivasan, Sunil K Shukla, Swagath Venkataramani, Kailash Gopalakrishnan, Holger Horbach, Razvan Peter Figuli, Wei Wang, YULONG LI, Martin A Lutz
-
Patent number: 11669489Abstract: A systolic array can be configured to skip distributed operands that have zero-values, resulting in improved resource efficiency. A skip module is introduced to receive operands from memory, identify whether they have a zero value or not, and, if they are nonzero, generate an operand vector including an index before sending the operand vector to a processing element.Type: GrantFiled: September 30, 2021Date of Patent: June 6, 2023Assignee: International Business Machines CorporationInventors: Swagath Venkataramani, Sanchari Sen, Vijayalakshmi Srinivasan, Ankur Agrawal, Sunil K Shukla, Bruce Fleischer, Kailash Gopalakrishnan
-
Publication number: 20230109301Abstract: A systolic array can be configured to skip distributed operands that have zero-values, resulting in improved resource efficiency. A skip module is introduced to receive operands from memory, identify whether they have a zero value or not, and, if they are nonzero, generate an operand vector including an index before sending the operand vector to a processing element.Type: ApplicationFiled: September 30, 2021Publication date: April 6, 2023Inventors: Swagath Venkataramani, Sanchari Sen, Vijayalakshmi Srinivasan, Ankur Agrawal, Sunil K Shukla, Bruce Fleischer, Kailash Gopalakrishnan
-
Publication number: 20230030287Abstract: Indices of non-zero weights may be stored in an index register file included within each of a plurality of processor elements in a systolic array. Non-zero weights may be stored in a register file associated with the index register file. Input values (e.g., dense input values) corresponding to a single block in a data structure may be sent to the plurality of processor elements. Those of the input values corresponding to the indices of non-zero weights in the index register file may be selected for performing multiply-accumulate (“MAC”) operation based on sending the plurality of input values to one or more of the plurality of processor elements. The indices of the plurality of non-zero weight are stored in an index data stick. The values of the plurality of non-zero weights are stored in a value data stick.Type: ApplicationFiled: July 31, 2021Publication date: February 2, 2023Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATIONInventors: Sanchari SEN, Swagath VENKATARAMANI, Vijayalakshmi SRINIVASAN, Kailash GOPALAKRISHNAN, Sunil K. SHUKLA
-
Publication number: 20220405555Abstract: A combined function specified by an instruction is performed. The combined function includes a plurality of operations performed as part of one invocation of the combined function. The performing the combined function includes performing a convolution using a first tensor and a second tensor to obtain one or more intermediate results, in which the second tensor includes an adjusted weight tensor created using a plurality of multipliers. Values of a bias tensor are added to the one or more intermediate results to obtain one or more combined function results for the combined function.Type: ApplicationFiled: June 17, 2021Publication date: December 22, 2022Inventors: Cedric Lichtenau, Kailash Gopalakrishnan, Vijayalakshmi Srinivasan, Sunil K. Shukla, Swagath Venkataramani
-
Publication number: 20220405348Abstract: A tensor of a first select dimension is reformatted to provide one or more sub-tensors of a second select dimension. The reformatting includes determining a number of sub-tensors to be used to represent the tensor. The reformatting further includes creating the number of sub-tensors, in which a sub-tensor is to start on a boundary of a memory unit. Data of the tensor is rearranged to fit within the number of sub-tensors.Type: ApplicationFiled: June 17, 2021Publication date: December 22, 2022Inventors: Cedric Lichtenau, Kailash Gopalakrishnan, Vijayalakshmi Srinivasan, Anthony Saporito, Sunil K. Shukla, Swagath Venkataramani
-
Publication number: 20220405556Abstract: A combined function specified by an instruction is performed. The combined function includes a plurality of operations performed as part of one invocation of the combined function. The performing the combined function includes performing a matrix multiplication of a first tensor and a second tensor to obtain one or more intermediate results. The second tensor includes an adjusted weight tensor created using a multiplier. Values of a bias tensor are added to the one or more intermediate results to obtain one or more results for the combined function. The one or more results are at least a part of an output tensor.Type: ApplicationFiled: June 17, 2021Publication date: December 22, 2022Inventors: Cedric Lichtenau, Kailash Gopalakrishnan, Vijayalakshmi Srinivasan, Sunil K. Shukla, Swagath Venkataramani
-
Patent number: 11223703Abstract: Various embodiments are provided for implementing instruction initialization in a dataflow architecture in a computing environment. A data packet may be transmitted from a selected node to one or more of a plurality of nodes using one or more existing data paths in an initialization network. A determination operation is performed to determine whether one or more of a plurality of nodes is a target node intended for the data packet. Those of the plurality of nodes determined to be a target node initialize one or more components of the target node using the data packet. The data packet may be forwarded by each of the one or more of a plurality of nodes to a subsequent node in the initialization network.Type: GrantFiled: March 19, 2019Date of Patent: January 11, 2022Assignee: INTERNATIONAL BUSINESS MACHINES CORPORATIONInventors: Brian Curran, Bruce Fleischer, Kailash Gopalakrishnan, Sunil K Shukla
-
Patent number: 11138010Abstract: Embodiments of the present invention include a computer system that manages execution of one or more programs with one or more loops where each loop having a loop level. Embodiments that manage loops that can skip execution and the number of loops changing during execution are also disclosed. A loop level register (LLEV) stores the loop level for a currently executing loop. A Loop-Back Program Counter Register (LBPR) has a table of one or more Loop-Back Registers. Each Loop-Back Register stores the loop level for a LBPR respective loop and a loop back PC location for the LBPR respective loop. A Program Counter points back to the PC location for each iteration of the loop. A Loop Current Count Register table (LCCR) tracks a number of iterations remaining to executed for of the loop. A loop management process causes one of the CPUs to execute all the one or more instructions of an iteration of the currently executing program loop.Type: GrantFiled: October 1, 2020Date of Patent: October 5, 2021Assignee: International Business Machines CorporationInventors: Chia-Yu Chen, Jungwook Choi, Brian William Curran, Bruce Fleischer, Kailash Gopalakrishnan, Jinwook Oh, Sunil K Shukla, Vijayalakshmi Srinivasan
-
Patent number: 10838868Abstract: Embodiments for implementing a communicating memory between a plurality of computing components are provided. In one embodiment, an apparatus comprises a plurality of memory components residing on a processing chip, the plurality of memory components interconnected between a plurality of processing elements of at least one processing core of the processing chip and at least one external memory component external to the processing chip. The apparatus further comprises a plurality of load agents and a plurality of store agents on the processing chip, each interfacing with the plurality of memory components. Each of the plurality of load agents and the plurality of store agents execute an independent program specifying a destination of data transacted between the plurality of memory components, the at least one external memory component, and the plurality of processing elements.Type: GrantFiled: March 7, 2019Date of Patent: November 17, 2020Assignee: INTERNATIONAL BUSINESS MACHINES CORPORATIONInventors: Chia-Yu Chen, Jungwook Choi, Brian Curran, Bruce Fleischer, Kailash Gopalakrishan, Jinwook Oh, Sunil K Shukla, Vijayalakshmi Srinivasan, Swagath Venkataramani
-
Publication number: 20200304598Abstract: Various embodiments are provided for implementing instruction initialization in a dataflow architecture in a computing environment. A data packet may be transmitted from a selected node to one or more of a plurality of nodes using one or more existing data paths in an initialization network. A determination operation is performed to determine whether one or more of a plurality of nodes is a target node intended for the data packet. Those of the plurality of nodes determined to be a target node initialize one or more components of the target node using the data packet. The data packet may be forwarded by each of the one or more of a plurality of nodes to a subsequent node in the initialization network.Type: ApplicationFiled: March 19, 2019Publication date: September 24, 2020Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATIONInventors: Brian CURRAN, Bruce FLEISCHER, Kailash GOPALAKRISHNAN, Sunil K SHUKLA
-
Publication number: 20200285579Abstract: Embodiments for implementing a communicating memory between a plurality of computing components are provided. In one embodiment, an apparatus comprises a plurality of memory components residing on a processing chip, the plurality of memory components interconnected between a plurality of processing elements of at least one processing core of the processing chip and at least one external memory component external to the processing chip. The apparatus further comprises a plurality of load agents and a plurality of store agents on the processing chip, each interfacing with the plurality of memory components. Each of the plurality of load agents and the plurality of store agents execute an independent program specifying a destination of data transacted between the plurality of memory components, the at least one external memory component, and the plurality of processing elements.Type: ApplicationFiled: March 7, 2019Publication date: September 10, 2020Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATIONInventors: Chia-Yu CHEN, Jungwook CHOI, Brian CURRAN, Bruce FLEISCHER, Kailash GOPALAKRISHAN, Jinwook OH, Sunil K. SHUKLA, Vijayalakshmi SRINIVASAN, Swagath VENKATARAMANI
-
Patent number: 10528356Abstract: An apparatus and method for supporting simultaneous multiple iterations (SMI) and iteration level commits (ILC) in a course grained reconfigurable architecture (CGRA). The apparatus includes: Hardware structures that connect all of multiple processing engines (PEs) to a load-store unit (LSU) configured to keep track of which compiled program code iterations have completed, which ones are in flight and which are yet to begin, and a control unit including hardware structures that are used to maintain synchronization and initiate and terminate loops within the PEs. The PEs, LSU and control unit are configured to commit instructions, and save and restore context at loop iteration boundaries. In doing so, the apparatus tracks and buffers state of in-flight iterations, and detects conditions that prevent an iteration from completing.Type: GrantFiled: November 4, 2015Date of Patent: January 7, 2020Assignee: International Business Machines CorporationInventors: Chia-yu Chen, Kailash Gopalakrishnan, Jinwook Oh, Lee M. Saltzman, Sunil K. Shukla, Vijayalakshmi Srinivasan
-
Patent number: 10216626Abstract: Embodiments of the invention provide a method and system for dynamic memory management implemented in hardware. In an embodiment, the method comprises storing objects in a plurality of heaps, and operating a hardware garbage collector to free heap space. The hardware garbage collector traverses the heaps and marks selected objects, uses the marks to identify a plurality of the objects, and frees the identified objects. In an embodiment, the method comprises storing objects in a heap, each of at least some of the objects including a multitude of pointers; and operating a hardware garbage collector to free heap space. The hardware garbage collector traverses the heap, using the pointers of some of the objects to identify others of the objects; processes the objects to mark selected objects; and uses the marks to identify a group of the objects, and frees the identified objects.Type: GrantFiled: March 2, 2017Date of Patent: February 26, 2019Assignee: International Business Machines CorporationInventors: David F. Bacon, Perry S. Cheng, Sunil K. Shukla
-
Patent number: 10120685Abstract: An apparatus and method for supporting simultaneous multiple iterations (SMI) in a course grained reconfigurable architecture (CGRA). In support of SMI, the apparatus includes: Hardware structures that connect all of multiple processing engines (PEs) to a load-store unit (LSU) configured to keep track of which compiled program code iterations have completed, which ones are in flight and which are yet to begin, and a control unit including hardware structures that are used to maintain synchronization and initiate and terminate loops within the PEs. SMI permits execution of the next instruction within any iteration (in flight). If instructions from multiple iterations are ready for execution (and are pre-decoded), then the hardware selects the lowest iteration number ready for execution. If in a particular clock cycle, a loop iteration with a lower iteration number is stalled (i.e.Type: GrantFiled: November 4, 2015Date of Patent: November 6, 2018Assignee: International Business Machines CorporationInventors: Chia-yu Chen, Kailash Gopalakrishnan, Jinwook Oh, Sunil K. Shukla, Vijayalakshmi Srinivasan
-
Publication number: 20170177474Abstract: Embodiments of the invention provide a method and system for dynamic memory management implemented in hardware. In an embodiment, the method comprises storing objects in a plurality of heaps, and operating a hardware garbage collector to free heap space. The hardware garbage collector traverses the heaps and marks selected objects, uses the marks to identify a plurality of the objects, and frees the identified objects. In an embodiment, the method comprises storing objects in a heap, each of at least some of the objects including a multitude of pointers; and operating a hardware garbage collector to free heap space. The hardware garbage collector traverses the heap, using the pointers of some of the objects to identify others of the objects; processes the objects to mark selected objects; and uses the marks to identify a group of the objects, and frees the identified objects.Type: ApplicationFiled: March 2, 2017Publication date: June 22, 2017Inventors: David F. Bacon, Perry S. Cheng, Sunil K. Shukla