Patents by Inventor John Gunnels

John Gunnels has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

  • Patent number: 8316072
    Abstract: A method (and structure) of executing a matrix operation, includes, for a matrix A, separating the matrix A into blocks, each block having a size p-by-q. The blocks of size p-by-q are then stored in a cache or memory in at least one of the two following ways. The elements in at least one of the blocks is stored in a format in which elements of the block occupy a location different from an original location in the block, and/or the blocks of size p-by-q are stored in a format in which at least one block occupies a position different relative to its original position in the matrix A.
    Type: Grant
    Filed: August 21, 2008
    Date of Patent: November 20, 2012
    Assignee: International Business Machines Corporation
    Inventors: Fred Gehrung Gustavson, John A. Gunnels, James C. Sexton
  • Publication number: 20120290816
    Abstract: Mechanisms for optimizing scalar code executed on a single instruction multiple data (SIMD) engine are provided. Placement of vector operation-splat operations may be determined based on an identification of scalar and SIMD operations in an original code representation. The original code representation may be modified to insert the vector operation-splat operations based on the determined placement of vector operation-splat operations to generate a first modified code representation. Placement of separate splat operations may be determined based on identification of scalar and SIMD operations in the first modified code representation. The first modified code representation may be modified to insert or delete separate splat operations based on the determined placement of the separate splat operations to generate a second modified code representation. SIMD code may be output based on the second modified code representation for execution by the SIMD engine.
    Type: Application
    Filed: July 23, 2012
    Publication date: November 15, 2012
    Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATION
    Inventors: Alexandre E. Eichenberger, Michael K. Gschwind, John A. Gunnels
  • Patent number: 8255884
    Abstract: Mechanisms for optimizing scalar code executed on a single instruction multiple data (SIMD) engine are provided. Placement of vector operation-splat operations may be determined based on an identification of scalar and SIMD operations in an original code representation. The original code representation may be modified to insert the vector operation-splat operations based on the determined placement of vector operation-splat operations to generate a first modified code representation. Placement of separate splat operations may be determined based on identification of scalar and SIMD operations in the first modified code representation. The first modified code representation may be modified to insert or delete separate splat operations based on the determined placement of the separate splat operations to generate a second modified code representation. SIMD code may be output based on the second modified code representation for execution by the SIMD engine.
    Type: Grant
    Filed: June 6, 2008
    Date of Patent: August 28, 2012
    Assignee: International Business Machines Corporation
    Inventors: Alexandre E. Eichenberger, Michael K. Gschwind, John A. Gunnels
  • Patent number: 8250130
    Abstract: A block matrix multiplication mechanism is provided for reversing the visitation order of blocks at corner turns when performing a block matrix multiplication operation in a data processing system. The mechanism increases block size and divides each block into sub-blocks. By reversing the visitation order, the mechanism eliminates a sub-block load at the corner turns. The mechanism performs sub-block matrix multiplication for each sub-block in a given block, and then repeats operation for a next block until all blocks are computed. The mechanism may determine block size and sub-block size to optimize load balancing and memory bandwidth. Therefore, the mechanism reduces maximum throughput and increases performance. In addition, the mechanism also reduces the number of multi-buffered local store buffers.
    Type: Grant
    Filed: May 30, 2008
    Date of Patent: August 21, 2012
    Assignee: International Business Machines Corporation
    Inventors: Daniel A. Brokenshire, John A. Gunnels, Michael D. Kistler
  • Publication number: 20120203816
    Abstract: A block matrix multiplication mechanism is provided for reversing the visitation order of blocks at corner turns when performing a block matrix multiplication operation in a data processing system. By reversing the visitation order, the mechanism eliminates a block load at the corner turns. In accordance with the illustrative embodiment, a corner return is referred to as a “bounce” corner turn and results in a serpentine patterned processing order of the matrix blocks. The mechanism allows the data processing system to perform a block matrix multiplication operation with a maximum of three block transfers per time step. Therefore, the mechanism reduces maximum throughput and increases performance. In addition, the mechanism also reduces the number of multi-buffered local store buffers.
    Type: Application
    Filed: April 20, 2012
    Publication date: August 9, 2012
    Applicant: International Business Machines Corporation
    Inventors: Daniel A. Brokenshire, John A. Gunnels, Michael D. Kistler
  • Patent number: 8229990
    Abstract: A signal-bearing medium tangibly embodying a program of machine-readable instructions executable by a digital processing apparatus to perform a method to at least one of reduce a memory space requirement and to increase a processing efficiency in a computerized method of linear algebra processing. A hybrid full-packed data structure is generated for processing data of a triangular matrix by one or more dense linear algebra (DLA) matrix subroutines designed to process matrix data in a full format, as modified to process matrix data using said hybrid full-packed data structure into a hybrid full-packed data structure, as follows. A portion of the triangular matrix data is determined that would comprise a square portion having a dimension approximately one half a dimension of the triangular matrix data.
    Type: Grant
    Filed: February 25, 2008
    Date of Patent: July 24, 2012
    Assignee: International Business Machines Corporation
    Inventors: Fred Gehrung Gustavson, John A. Gunnels
  • Patent number: 8200726
    Abstract: A method (and structure) for executing a linear algebra subroutine on a computer having a cache, includes streaming data for matrices involved in processing the linear algebra subroutine such that data is processed using data for a first matrix stored in the cache as a matrix format and data from a second matrix and a third matrix is stored in a memory device at a higher level than the cache, the streaming providing data from the higher level as the streaming data is required for the processing.
    Type: Grant
    Filed: January 5, 2009
    Date of Patent: June 12, 2012
    Assignee: International Business Machines Corporation
    Inventors: Fred Gehrung Gustavson, John A. Gunnels
  • Publication number: 20120011348
    Abstract: Mechanisms for performing a matrix multiplication operation are provided. A vector load operation is performed to load a first vector operand of the matrix multiplication operation to a first target vector register. A pair-wise load and splat operation is performed to load a pair of scalar values of a second vector operand and replicate the pair of scalar values within a second target vector register. An operation is performed on elements of the first target vector register and elements of the second target vector register to generate a partial product of the matrix multiplication operation. The partial product is accumulated with other partial products and a resulting accumulated partial product is stored. This operation may be repeated for a second pair of scalar values of the second vector operand.
    Type: Application
    Filed: July 12, 2010
    Publication date: January 12, 2012
    Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATION
    Inventors: Alexandre E. Eichenberger, Michael K. Gschwind, John A. Gunnels, Valentina Salapura
  • Publication number: 20110276786
    Abstract: Mechanisms are provided for optimizing code to perform prefetching of data into a shared memory of a computing device that is shared by a plurality of threads that execute on the computing device. A memory stream of a portion of code that is shared by the plurality of threads is identified. A set of prefetch instructions is distributed across the plurality of threads. Prefetch instructions are inserted into the instruction sequences of the plurality of threads such that each instruction sequence has a separate sub-portion of the set of prefetch instructions, thereby generating optimized code. Executable code is generated based on the optimized code and stored in a storage device. The executable code, when executed, performs the prefetches associated with the distributed set of prefetch instructions in a shared manner across the plurality of threads.
    Type: Application
    Filed: May 4, 2010
    Publication date: November 10, 2011
    Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATION
    Inventors: Alexandre E. Eichenberger, John A. Gunnels
  • Patent number: 8055878
    Abstract: A method and structure of distributing elements of an array of data in a computer memory to a specific processor of a multi-dimensional mesh of parallel processors includes designating a distribution of elements of at least a portion of the array to be executed by specific processors in the multi-dimensional mesh of parallel processors. The pattern of the designating includes a cyclical repetitive pattern of the parallel processor mesh, as modified to have a skew in at least one dimension so that both a row of data in the array and a column of data in the array map to respective contiguous groupings of the processors such that a dimension of the contiguous groupings is greater than one.
    Type: Grant
    Filed: February 8, 2005
    Date of Patent: November 8, 2011
    Assignee: International Business Machines Corporation
    Inventors: Siddhartha Chatterjee, John A. Gunnels
  • Patent number: 8037215
    Abstract: Apparatus for evaluating the performance of DMA-based algorithmic tasks on a target multi-core processing system includes a memory and at least one processor coupled to the memory. The processor is operative: to input a template for a specified task, the template including DMA-related parameters specifying DMA operations and computational operations to be performed; to evaluate performance for the specified task by running a benchmark on the target multi-core processing system, the benchmark being operative to generate data access patterns using DMA operations and invoking prescribed computation routines as specified by the input template; and to provide results of the benchmark indicative of a measure of performance of the specified task corresponding to the target multi-core processing system.
    Type: Grant
    Filed: May 30, 2008
    Date of Patent: October 11, 2011
    Assignee: International Business Machines Corporation
    Inventors: John A. Gunnels, Shakti Kapoor, Ravi Kothari, Yogish Sabharwal, James C. Sexton
  • Publication number: 20110219208
    Abstract: A Multi-Petascale Highly Efficient Parallel Supercomputer of 100 petaOPS-scale computing, at decreased cost, power and footprint, and that allows for a maximum packaging density of processing nodes from an interconnect point of view. The Supercomputer exploits technological advances in VLSI that enables a computing model where many processors can be integrated into a single Application Specific Integrated Circuit (ASIC).
    Type: Application
    Filed: January 10, 2011
    Publication date: September 8, 2011
    Applicant: International Business Machines Corporation
    Inventors: Sameh Asaad, Ralph E. Bellofatto, Michael A. Blocksome, Matthias A. Blumrich, Peter Boyle, Jose R. Brunheroto, Dong Chen, Chen-Yong Cher, George L. Chiu, Norman Christ, Paul W. Coteus, Kristan D. Davis, Gabor J. Dozsa, Alexandre E. Eichenberger, Noel A. Eisley, Matthew R. Ellavsky, Kahn C. Evans, Bruce M. Fleischer, Thomas W. Fox, Alan Gara, Mark E. Giampapa, Thomas M. Gooding, Michael K. Gschwind, John A. Gunnels, Shawn A. Hall, Rudolf A. Haring, Philip Heidelberger, Todd A. Inglett, Brant L. Knudson, Gerard V. Kopcsay, Sameer Kumar, Amith R. Mamidala, James A. Marcella, Mark G. Megerian, Douglas R. Miller, Samuel J. Miller, Adam J. Muff, Michael B. Mundy, John K. O'Brien, Kathryn M. O'Brien, Martin Ohmacht, Jeffrey J. Parker, Ruth J. Poole, Joseph D. Ratterman, Valentina Salapura, David L. Satterfield, Robert M. Senger, Brian Smith, Burkhard Steinmacher-Burow, William M. Stockdell, Craig B. Stunkel, Krishnan Sugavanam, Yutaka Sugawara, Todd E. Takken, Barry M. Trager, James L. Van Oosten, Charles D. Wait, Robert E. Walkup, Alfred T. Watson, Robert W. Wisniewski, Peng Wu
  • Publication number: 20110055517
    Abstract: A structure (and method) including a plurality of coprocessing units and a controller that selectively loads data for processing on the plurality of coprocessing units, using a compound loading instruction. The compound loading instruction includes a plurality of low-level software instructions that preliminarily processes input data in a manner predetermined to simulate an effect of a single hardware loading instruction that would provide optimal loading of complex matrix data by loading input data in accordance with the effect of multiplying i·i=?1.
    Type: Application
    Filed: August 26, 2009
    Publication date: March 3, 2011
    Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATION
    Inventors: Alexandre E. Eichenberger, Michael Karl Gschwind, John A. Gunnels, Fred Gehrung Gustavson, Brett Olsson
  • Publication number: 20110040821
    Abstract: Mechanisms for performing matrix multiplication operations with data pre-conditioning in a high performance computing architecture are provided. A vector load operation is performed to load a first vector operand of the matrix multiplication operation to a first target vector register. A load and splat operation is performed to load an element of a second vector operand and replicating the element to each of a plurality of elements of a second target vector register. A multiply add operation is performed on elements of the first target vector register and elements of the second target vector register to generate a partial product of the matrix multiplication operation. The partial product of the matrix multiplication operation is accumulated with other partial products of the matrix multiplication operation.
    Type: Application
    Filed: August 17, 2009
    Publication date: February 17, 2011
    Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATION
    Inventors: Alexandre E. Eichenberger, Michael K. Gschwind, John A. Gunnels
  • Publication number: 20110040822
    Abstract: Mechanisms for performing a complex matrix multiplication operation are provided. A vector load operation is performed to load a first vector operand of the complex matrix multiplication operation to a first target vector register. The first vector operand comprises a real and imaginary part of a first complex vector value. A complex load and splat operation is performed to load a second complex vector value of a second vector operand and replicate the second complex vector value within a second target vector register. The second complex vector value has a real and imaginary part. A cross multiply add operation is performed on elements of the first target vector register and elements of the second target vector register to generate a partial product of the complex matrix multiplication operation. The partial product is accumulated with other partial products and a resulting accumulated partial product is stored in a result vector register.
    Type: Application
    Filed: August 17, 2009
    Publication date: February 17, 2011
    Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATION
    Inventors: Alexandre E. Eichenberger, Michael K. Gschwind, John A. Gunnels
  • Patent number: 7853820
    Abstract: A method (and system) for detecting at least one faulty object in a system including a plurality of objects in communication with each other in an n-dimensional architecture, includes probing a first plane of objects in the n-dimensional architecture and probing at least one other plane of objects in the n-dimensional architecture which would result in identifying a faulty object in the system.
    Type: Grant
    Filed: October 22, 2008
    Date of Patent: December 14, 2010
    Assignee: International Business Machines Corporation
    Inventors: John A. Gunnels, Fred Gehrung Gustavson, Robert Daniel Engle
  • Patent number: 7844630
    Abstract: A computerized method provides for an in-place transformation of matrix A data including a New Data Structure (NDS) format and a transformation T having a compact representation. The NDS represents data of the matrix A in a format other than a row major format or a column major format, such that the data for the matrix A is stored as contiguous sub matrices of size MB by NB in an order predetermined to provide the data for a matrix processing. The transformation T is applied to the MB by NB blocks, using an in-place transformation processing, thereby replacing data of the block A1 with the contents of T(A1).
    Type: Grant
    Filed: February 19, 2008
    Date of Patent: November 30, 2010
    Assignee: International Business Machines Corporation
    Inventors: Fred Gehrung Gustavson, John A. Gunnels, James C. Sexton
  • Patent number: 7793011
    Abstract: A method for evaluating performance of DMA-based algorithmic tasks on a target multi-core processing system includes the steps of: inputting a template for a specified task, the template including DMA-related parameters specifying DMA operations and computational operations to be performed; evaluating performance for the specified task by running a benchmark on the target multi-core processing system, the benchmark being operative to generate data access patterns using DMA operations and invoking prescribed computation routines as specified by the input template; and providing results of the benchmark indicative of a measure of performance of the specified task corresponding to the target multi-core processing system.
    Type: Grant
    Filed: May 29, 2008
    Date of Patent: September 7, 2010
    Assignee: International Business Machines Corporation
    Inventors: John A. Gunnels, Shakti Kapoor, Ravi Kothari, Yogish Sabharwal, James C. Sexton
  • Publication number: 20090307656
    Abstract: Mechanisms for optimizing scalar code executed on a single instruction multiple data (SIMD) engine are provided. Placement of vector operation-splat operations may be determined based on an identification of scalar and SIMD operations in an original code representation. The original code representation may be modified to insert the vector operation-splat operations based on the determined placement of vector operation-splat operations to generate a first modified code representation. Placement of separate splat operations may be determined based on identification of scalar and SIMD operations in the first modified code representation. The first modified code representation may be modified to insert or delete separate splat operations based on the determined placement of the separate splat operations to generate a second modified code representation. SIMD code may be output based on the second modified code representation for execution by the SIMD engine.
    Type: Application
    Filed: June 6, 2008
    Publication date: December 10, 2009
    Applicant: International Business Machines Corporation
    Inventors: Alexandre E. Eichenberger, Michael K. Gschwind, John A. Gunnels
  • Publication number: 20090300091
    Abstract: A block matrix multiplication mechanism is provided for reversing the visitation order of blocks at corner turns when performing a block matrix multiplication operation in a data processing system. The mechanism increases block size and divides each block into sub-blocks. By reversing the visitation order, the mechanism eliminates a sub-block load at the corner turns. The mechanism per forms sub-block matrix multiplication for each sub-block in a given block, and then repeats operation for a next block until all blocks are computed. The mechanism may determine block size and sub-block size to optimize load balancing and memory bandwidth. Therefore, the mechanism reduces maximum throughput and increases performance. In addition, the mechanism also reduces the number of multi-buffered local store buffers.
    Type: Application
    Filed: May 30, 2008
    Publication date: December 3, 2009
    Applicant: International Business Machines Corporation
    Inventors: Daniel A. Brokenshire, John A. Gunnels, Michael D. Kistler