Patents by Inventor John Gunnels

John Gunnels has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

Method and structure for producing high performance linear algebra routines using register block data format routines

Patent number: 8316072

Abstract: A method (and structure) of executing a matrix operation, includes, for a matrix A, separating the matrix A into blocks, each block having a size p-by-q. The blocks of size p-by-q are then stored in a cache or memory in at least one of the two following ways. The elements in at least one of the blocks is stored in a format in which elements of the block occupy a location different from an original location in the block, and/or the blocks of size p-by-q are stored in a format in which at least one block occupies a position different relative to its original position in the matrix A.

Type: Grant

Filed: August 21, 2008

Date of Patent: November 20, 2012

Assignee: International Business Machines Corporation

Inventors: Fred Gehrung Gustavson, John A. Gunnels, James C. Sexton
Optimized Scalar Promotion with Load and Splat SIMD Instructions

Publication number: 20120290816

Abstract: Mechanisms for optimizing scalar code executed on a single instruction multiple data (SIMD) engine are provided. Placement of vector operation-splat operations may be determined based on an identification of scalar and SIMD operations in an original code representation. The original code representation may be modified to insert the vector operation-splat operations based on the determined placement of vector operation-splat operations to generate a first modified code representation. Placement of separate splat operations may be determined based on identification of scalar and SIMD operations in the first modified code representation. The first modified code representation may be modified to insert or delete separate splat operations based on the determined placement of the separate splat operations to generate a second modified code representation. SIMD code may be output based on the second modified code representation for execution by the SIMD engine.

Type: Application

Filed: July 23, 2012

Publication date: November 15, 2012

Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATION

Inventors: Alexandre E. Eichenberger, Michael K. Gschwind, John A. Gunnels
Optimized scalar promotion with load and splat SIMD instructions

Patent number: 8255884

Abstract: Mechanisms for optimizing scalar code executed on a single instruction multiple data (SIMD) engine are provided. Placement of vector operation-splat operations may be determined based on an identification of scalar and SIMD operations in an original code representation. The original code representation may be modified to insert the vector operation-splat operations based on the determined placement of vector operation-splat operations to generate a first modified code representation. Placement of separate splat operations may be determined based on identification of scalar and SIMD operations in the first modified code representation. The first modified code representation may be modified to insert or delete separate splat operations based on the determined placement of the separate splat operations to generate a second modified code representation. SIMD code may be output based on the second modified code representation for execution by the SIMD engine.

Type: Grant

Filed: June 6, 2008

Date of Patent: August 28, 2012

Assignee: International Business Machines Corporation

Inventors: Alexandre E. Eichenberger, Michael K. Gschwind, John A. Gunnels
Reducing bandwidth requirements for matrix multiplication

Patent number: 8250130

Abstract: A block matrix multiplication mechanism is provided for reversing the visitation order of blocks at corner turns when performing a block matrix multiplication operation in a data processing system. The mechanism increases block size and divides each block into sub-blocks. By reversing the visitation order, the mechanism eliminates a sub-block load at the corner turns. The mechanism performs sub-block matrix multiplication for each sub-block in a given block, and then repeats operation for a next block until all blocks are computed. The mechanism may determine block size and sub-block size to optimize load balancing and memory bandwidth. Therefore, the mechanism reduces maximum throughput and increases performance. In addition, the mechanism also reduces the number of multi-buffered local store buffers.

Type: Grant

Filed: May 30, 2008

Date of Patent: August 21, 2012

Assignee: International Business Machines Corporation

Inventors: Daniel A. Brokenshire, John A. Gunnels, Michael D. Kistler
Optimized Corner Turns for Local Storage and Bandwidth Reduction

Publication number: 20120203816

Abstract: A block matrix multiplication mechanism is provided for reversing the visitation order of blocks at corner turns when performing a block matrix multiplication operation in a data processing system. By reversing the visitation order, the mechanism eliminates a block load at the corner turns. In accordance with the illustrative embodiment, a corner return is referred to as a “bounce” corner turn and results in a serpentine patterned processing order of the matrix blocks. The mechanism allows the data processing system to perform a block matrix multiplication operation with a maximum of three block transfers per time step. Therefore, the mechanism reduces maximum throughput and increases performance. In addition, the mechanism also reduces the number of multi-buffered local store buffers.

Type: Application

Filed: April 20, 2012

Publication date: August 9, 2012

Applicant: International Business Machines Corporation

Inventors: Daniel A. Brokenshire, John A. Gunnels, Michael D. Kistler
Method and structure for producing high performance linear algebra routines using a hybrid full-packed storage format

Patent number: 8229990

Abstract: A signal-bearing medium tangibly embodying a program of machine-readable instructions executable by a digital processing apparatus to perform a method to at least one of reduce a memory space requirement and to increase a processing efficiency in a computerized method of linear algebra processing. A hybrid full-packed data structure is generated for processing data of a triangular matrix by one or more dense linear algebra (DLA) matrix subroutines designed to process matrix data in a full format, as modified to process matrix data using said hybrid full-packed data structure into a hybrid full-packed data structure, as follows. A portion of the triangular matrix data is determined that would comprise a square portion having a dimension approximately one half a dimension of the triangular matrix data.

Type: Grant

Filed: February 25, 2008

Date of Patent: July 24, 2012

Assignee: International Business Machines Corporation

Inventors: Fred Gehrung Gustavson, John A. Gunnels
Method and structure for producing high performance linear algebra routines using streaming

Patent number: 8200726

Abstract: A method (and structure) for executing a linear algebra subroutine on a computer having a cache, includes streaming data for matrices involved in processing the linear algebra subroutine such that data is processed using data for a first matrix stored in the cache as a matrix format and data from a second matrix and a third matrix is stored in a memory device at a higher level than the cache, the streaming providing data from the higher level as the streaming data is required for the processing.

Type: Grant

Filed: January 5, 2009

Date of Patent: June 12, 2012

Assignee: International Business Machines Corporation

Inventors: Fred Gehrung Gustavson, John A. Gunnels
Matrix Multiplication Operations Using Pair-Wise Load and Splat Operations

Publication number: 20120011348

Abstract: Mechanisms for performing a matrix multiplication operation are provided. A vector load operation is performed to load a first vector operand of the matrix multiplication operation to a first target vector register. A pair-wise load and splat operation is performed to load a pair of scalar values of a second vector operand and replicate the pair of scalar values within a second target vector register. An operation is performed on elements of the first target vector register and elements of the second target vector register to generate a partial product of the matrix multiplication operation. The partial product is accumulated with other partial products and a resulting accumulated partial product is stored. This operation may be repeated for a second pair of scalar values of the second vector operand.

Type: Application

Filed: July 12, 2010

Publication date: January 12, 2012

Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATION

Inventors: Alexandre E. Eichenberger, Michael K. Gschwind, John A. Gunnels, Valentina Salapura
Shared Prefetching to Reduce Execution Skew in Multi-Threaded Systems

Publication number: 20110276786

Abstract: Mechanisms are provided for optimizing code to perform prefetching of data into a shared memory of a computing device that is shared by a plurality of threads that execute on the computing device. A memory stream of a portion of code that is shared by the plurality of threads is identified. A set of prefetch instructions is distributed across the plurality of threads. Prefetch instructions are inserted into the instruction sequences of the plurality of threads such that each instruction sequence has a separate sub-portion of the set of prefetch instructions, thereby generating optimized code. Executable code is generated based on the optimized code and stored in a storage device. The executable code, when executed, performs the prefetches associated with the distributed set of prefetch instructions in a shared manner across the plurality of threads.

Type: Application

Filed: May 4, 2010

Publication date: November 10, 2011

Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATION

Inventors: Alexandre E. Eichenberger, John A. Gunnels
Method and structure for skewed block-cyclic distribution of lower-dimensional data arrays in higher-dimensional processor grids

Patent number: 8055878

Abstract: A method and structure of distributing elements of an array of data in a computer memory to a specific processor of a multi-dimensional mesh of parallel processors includes designating a distribution of elements of at least a portion of the array to be executed by specific processors in the multi-dimensional mesh of parallel processors. The pattern of the designating includes a cyclical repetitive pattern of the parallel processor mesh, as modified to have a skew in at least one dimension so that both a row of data in the array and a column of data in the array map to respective contiguous groupings of the processors such that a dimension of the contiguous groupings is greater than one.

Type: Grant

Filed: February 8, 2005

Date of Patent: November 8, 2011

Assignee: International Business Machines Corporation

Inventors: Siddhartha Chatterjee, John A. Gunnels
Performance evaluation of algorithmic tasks and dynamic parameterization on multi-core processing systems

Patent number: 8037215

Abstract: Apparatus for evaluating the performance of DMA-based algorithmic tasks on a target multi-core processing system includes a memory and at least one processor coupled to the memory. The processor is operative: to input a template for a specified task, the template including DMA-related parameters specifying DMA operations and computational operations to be performed; to evaluate performance for the specified task by running a benchmark on the target multi-core processing system, the benchmark being operative to generate data access patterns using DMA operations and invoking prescribed computation routines as specified by the input template; and to provide results of the benchmark indicative of a measure of performance of the specified task corresponding to the target multi-core processing system.

Type: Grant

Filed: May 30, 2008

Date of Patent: October 11, 2011

Assignee: International Business Machines Corporation

Inventors: John A. Gunnels, Shakti Kapoor, Ravi Kothari, Yogish Sabharwal, James C. Sexton
MULTI-PETASCALE HIGHLY EFFICIENT PARALLEL SUPERCOMPUTER

Publication number: 20110219208

Abstract: A Multi-Petascale Highly Efficient Parallel Supercomputer of 100 petaOPS-scale computing, at decreased cost, power and footprint, and that allows for a maximum packaging density of processing nodes from an interconnect point of view. The Supercomputer exploits technological advances in VLSI that enables a computing model where many processors can be integrated into a single Application Specific Integrated Circuit (ASIC).

Type: Application

Filed: January 10, 2011

Publication date: September 8, 2011

Applicant: International Business Machines Corporation

Inventors: Sameh Asaad, Ralph E. Bellofatto, Michael A. Blocksome, Matthias A. Blumrich, Peter Boyle, Jose R. Brunheroto, Dong Chen, Chen-Yong Cher, George L. Chiu, Norman Christ, Paul W. Coteus, Kristan D. Davis, Gabor J. Dozsa, Alexandre E. Eichenberger, Noel A. Eisley, Matthew R. Ellavsky, Kahn C. Evans, Bruce M. Fleischer, Thomas W. Fox, Alan Gara, Mark E. Giampapa, Thomas M. Gooding, Michael K. Gschwind, John A. Gunnels, Shawn A. Hall, Rudolf A. Haring, Philip Heidelberger, Todd A. Inglett, Brant L. Knudson, Gerard V. Kopcsay, Sameer Kumar, Amith R. Mamidala, James A. Marcella, Mark G. Megerian, Douglas R. Miller, Samuel J. Miller, Adam J. Muff, Michael B. Mundy, John K. O'Brien, Kathryn M. O'Brien, Martin Ohmacht, Jeffrey J. Parker, Ruth J. Poole, Joseph D. Ratterman, Valentina Salapura, David L. Satterfield, Robert M. Senger, Brian Smith, Burkhard Steinmacher-Burow, William M. Stockdell, Craig B. Stunkel, Krishnan Sugavanam, Yutaka Sugawara, Todd E. Takken, Barry M. Trager, James L. Van Oosten, Charles D. Wait, Robert E. Walkup, Alfred T. Watson, Robert W. Wisniewski, Peng Wu
METHOD AND STRUCTURE OF USING SIMD VECTOR ARCHITECTURES TO IMPLEMENT MATRIX MULTIPLICATION

Publication number: 20110055517

Abstract: A structure (and method) including a plurality of coprocessing units and a controller that selectively loads data for processing on the plurality of coprocessing units, using a compound loading instruction. The compound loading instruction includes a plurality of low-level software instructions that preliminarily processes input data in a manner predetermined to simulate an effect of a single hardware loading instruction that would provide optimal loading of complex matrix data by loading input data in accordance with the effect of multiplying i·i=?1.

Type: Application

Filed: August 26, 2009

Publication date: March 3, 2011

Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATION

Inventors: Alexandre E. Eichenberger, Michael Karl Gschwind, John A. Gunnels, Fred Gehrung Gustavson, Brett Olsson
Matrix Multiplication Operations with Data Pre-Conditioning in a High Performance Computing Architecture

Publication number: 20110040821

Abstract: Mechanisms for performing matrix multiplication operations with data pre-conditioning in a high performance computing architecture are provided. A vector load operation is performed to load a first vector operand of the matrix multiplication operation to a first target vector register. A load and splat operation is performed to load an element of a second vector operand and replicating the element to each of a plurality of elements of a second target vector register. A multiply add operation is performed on elements of the first target vector register and elements of the second target vector register to generate a partial product of the matrix multiplication operation. The partial product of the matrix multiplication operation is accumulated with other partial products of the matrix multiplication operation.

Type: Application

Filed: August 17, 2009

Publication date: February 17, 2011

Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATION

Inventors: Alexandre E. Eichenberger, Michael K. Gschwind, John A. Gunnels
Complex Matrix Multiplication Operations with Data Pre-Conditioning in a High Performance Computing Architecture

Publication number: 20110040822

Abstract: Mechanisms for performing a complex matrix multiplication operation are provided. A vector load operation is performed to load a first vector operand of the complex matrix multiplication operation to a first target vector register. The first vector operand comprises a real and imaginary part of a first complex vector value. A complex load and splat operation is performed to load a second complex vector value of a second vector operand and replicate the second complex vector value within a second target vector register. The second complex vector value has a real and imaginary part. A cross multiply add operation is performed on elements of the first target vector register and elements of the second target vector register to generate a partial product of the complex matrix multiplication operation. The partial product is accumulated with other partial products and a resulting accumulated partial product is stored in a result vector register.

Type: Application

Filed: August 17, 2009

Publication date: February 17, 2011

Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATION

Inventors: Alexandre E. Eichenberger, Michael K. Gschwind, John A. Gunnels
System and method for detecting a faulty object in a system

Patent number: 7853820

Abstract: A method (and system) for detecting at least one faulty object in a system including a plurality of objects in communication with each other in an n-dimensional architecture, includes probing a first plane of objects in the n-dimensional architecture and probing at least one other plane of objects in the n-dimensional architecture which would result in identifying a faulty object in the system.

Type: Grant

Filed: October 22, 2008

Date of Patent: December 14, 2010

Assignee: International Business Machines Corporation

Inventors: John A. Gunnels, Fred Gehrung Gustavson, Robert Daniel Engle
Method and structure for fast in-place transformation of standard full and packed matrix data formats

Patent number: 7844630

Abstract: A computerized method provides for an in-place transformation of matrix A data including a New Data Structure (NDS) format and a transformation T having a compact representation. The NDS represents data of the matrix A in a format other than a row major format or a column major format, such that the data for the matrix A is stored as contiguous sub matrices of size MB by NB in an order predetermined to provide the data for a matrix processing. The transformation T is applied to the MB by NB blocks, using an in-place transformation processing, thereby replacing data of the block A1 with the contents of T(A1).

Type: Grant

Filed: February 19, 2008

Date of Patent: November 30, 2010

Assignee: International Business Machines Corporation

Inventors: Fred Gehrung Gustavson, John A. Gunnels, James C. Sexton
Performance evaluation of algorithmic tasks and dynamic parameterization on multi-core processing systems

Patent number: 7793011

Abstract: A method for evaluating performance of DMA-based algorithmic tasks on a target multi-core processing system includes the steps of: inputting a template for a specified task, the template including DMA-related parameters specifying DMA operations and computational operations to be performed; evaluating performance for the specified task by running a benchmark on the target multi-core processing system, the benchmark being operative to generate data access patterns using DMA operations and invoking prescribed computation routines as specified by the input template; and providing results of the benchmark indicative of a measure of performance of the specified task corresponding to the target multi-core processing system.

Type: Grant

Filed: May 29, 2008

Date of Patent: September 7, 2010

Assignee: International Business Machines Corporation

Inventors: John A. Gunnels, Shakti Kapoor, Ravi Kothari, Yogish Sabharwal, James C. Sexton
Optimized Scalar Promotion with Load and Splat SIMD Instructions

Publication number: 20090307656

Abstract: Mechanisms for optimizing scalar code executed on a single instruction multiple data (SIMD) engine are provided. Placement of vector operation-splat operations may be determined based on an identification of scalar and SIMD operations in an original code representation. The original code representation may be modified to insert the vector operation-splat operations based on the determined placement of vector operation-splat operations to generate a first modified code representation. Placement of separate splat operations may be determined based on identification of scalar and SIMD operations in the first modified code representation. The first modified code representation may be modified to insert or delete separate splat operations based on the determined placement of the separate splat operations to generate a second modified code representation. SIMD code may be output based on the second modified code representation for execution by the SIMD engine.

Type: Application

Filed: June 6, 2008

Publication date: December 10, 2009

Applicant: International Business Machines Corporation

Inventors: Alexandre E. Eichenberger, Michael K. Gschwind, John A. Gunnels
Reducing Bandwidth Requirements for Matrix Multiplication

Publication number: 20090300091

Abstract: A block matrix multiplication mechanism is provided for reversing the visitation order of blocks at corner turns when performing a block matrix multiplication operation in a data processing system. The mechanism increases block size and divides each block into sub-blocks. By reversing the visitation order, the mechanism eliminates a sub-block load at the corner turns. The mechanism per forms sub-block matrix multiplication for each sub-block in a given block, and then repeats operation for a next block until all blocks are computed. The mechanism may determine block size and sub-block size to optimize load balancing and memory bandwidth. Therefore, the mechanism reduces maximum throughput and increases performance. In addition, the mechanism also reduces the number of multi-buffered local store buffers.

Type: Application

Filed: May 30, 2008

Publication date: December 3, 2009

Applicant: International Business Machines Corporation

Inventors: Daniel A. Brokenshire, John A. Gunnels, Michael D. Kistler

prev 1 2 3 4 5 6 7 next