Patents by Inventor John Gunnels
John Gunnels has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).
-
Patent number: 8316072Abstract: A method (and structure) of executing a matrix operation, includes, for a matrix A, separating the matrix A into blocks, each block having a size p-by-q. The blocks of size p-by-q are then stored in a cache or memory in at least one of the two following ways. The elements in at least one of the blocks is stored in a format in which elements of the block occupy a location different from an original location in the block, and/or the blocks of size p-by-q are stored in a format in which at least one block occupies a position different relative to its original position in the matrix A.Type: GrantFiled: August 21, 2008Date of Patent: November 20, 2012Assignee: International Business Machines CorporationInventors: Fred Gehrung Gustavson, John A. Gunnels, James C. Sexton
-
Publication number: 20120290816Abstract: Mechanisms for optimizing scalar code executed on a single instruction multiple data (SIMD) engine are provided. Placement of vector operation-splat operations may be determined based on an identification of scalar and SIMD operations in an original code representation. The original code representation may be modified to insert the vector operation-splat operations based on the determined placement of vector operation-splat operations to generate a first modified code representation. Placement of separate splat operations may be determined based on identification of scalar and SIMD operations in the first modified code representation. The first modified code representation may be modified to insert or delete separate splat operations based on the determined placement of the separate splat operations to generate a second modified code representation. SIMD code may be output based on the second modified code representation for execution by the SIMD engine.Type: ApplicationFiled: July 23, 2012Publication date: November 15, 2012Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATIONInventors: Alexandre E. Eichenberger, Michael K. Gschwind, John A. Gunnels
-
Patent number: 8255884Abstract: Mechanisms for optimizing scalar code executed on a single instruction multiple data (SIMD) engine are provided. Placement of vector operation-splat operations may be determined based on an identification of scalar and SIMD operations in an original code representation. The original code representation may be modified to insert the vector operation-splat operations based on the determined placement of vector operation-splat operations to generate a first modified code representation. Placement of separate splat operations may be determined based on identification of scalar and SIMD operations in the first modified code representation. The first modified code representation may be modified to insert or delete separate splat operations based on the determined placement of the separate splat operations to generate a second modified code representation. SIMD code may be output based on the second modified code representation for execution by the SIMD engine.Type: GrantFiled: June 6, 2008Date of Patent: August 28, 2012Assignee: International Business Machines CorporationInventors: Alexandre E. Eichenberger, Michael K. Gschwind, John A. Gunnels
-
Patent number: 8250130Abstract: A block matrix multiplication mechanism is provided for reversing the visitation order of blocks at corner turns when performing a block matrix multiplication operation in a data processing system. The mechanism increases block size and divides each block into sub-blocks. By reversing the visitation order, the mechanism eliminates a sub-block load at the corner turns. The mechanism performs sub-block matrix multiplication for each sub-block in a given block, and then repeats operation for a next block until all blocks are computed. The mechanism may determine block size and sub-block size to optimize load balancing and memory bandwidth. Therefore, the mechanism reduces maximum throughput and increases performance. In addition, the mechanism also reduces the number of multi-buffered local store buffers.Type: GrantFiled: May 30, 2008Date of Patent: August 21, 2012Assignee: International Business Machines CorporationInventors: Daniel A. Brokenshire, John A. Gunnels, Michael D. Kistler
-
Publication number: 20120203816Abstract: A block matrix multiplication mechanism is provided for reversing the visitation order of blocks at corner turns when performing a block matrix multiplication operation in a data processing system. By reversing the visitation order, the mechanism eliminates a block load at the corner turns. In accordance with the illustrative embodiment, a corner return is referred to as a “bounce” corner turn and results in a serpentine patterned processing order of the matrix blocks. The mechanism allows the data processing system to perform a block matrix multiplication operation with a maximum of three block transfers per time step. Therefore, the mechanism reduces maximum throughput and increases performance. In addition, the mechanism also reduces the number of multi-buffered local store buffers.Type: ApplicationFiled: April 20, 2012Publication date: August 9, 2012Applicant: International Business Machines CorporationInventors: Daniel A. Brokenshire, John A. Gunnels, Michael D. Kistler
-
Patent number: 8229990Abstract: A signal-bearing medium tangibly embodying a program of machine-readable instructions executable by a digital processing apparatus to perform a method to at least one of reduce a memory space requirement and to increase a processing efficiency in a computerized method of linear algebra processing. A hybrid full-packed data structure is generated for processing data of a triangular matrix by one or more dense linear algebra (DLA) matrix subroutines designed to process matrix data in a full format, as modified to process matrix data using said hybrid full-packed data structure into a hybrid full-packed data structure, as follows. A portion of the triangular matrix data is determined that would comprise a square portion having a dimension approximately one half a dimension of the triangular matrix data.Type: GrantFiled: February 25, 2008Date of Patent: July 24, 2012Assignee: International Business Machines CorporationInventors: Fred Gehrung Gustavson, John A. Gunnels
-
Patent number: 8200726Abstract: A method (and structure) for executing a linear algebra subroutine on a computer having a cache, includes streaming data for matrices involved in processing the linear algebra subroutine such that data is processed using data for a first matrix stored in the cache as a matrix format and data from a second matrix and a third matrix is stored in a memory device at a higher level than the cache, the streaming providing data from the higher level as the streaming data is required for the processing.Type: GrantFiled: January 5, 2009Date of Patent: June 12, 2012Assignee: International Business Machines CorporationInventors: Fred Gehrung Gustavson, John A. Gunnels
-
Publication number: 20120011348Abstract: Mechanisms for performing a matrix multiplication operation are provided. A vector load operation is performed to load a first vector operand of the matrix multiplication operation to a first target vector register. A pair-wise load and splat operation is performed to load a pair of scalar values of a second vector operand and replicate the pair of scalar values within a second target vector register. An operation is performed on elements of the first target vector register and elements of the second target vector register to generate a partial product of the matrix multiplication operation. The partial product is accumulated with other partial products and a resulting accumulated partial product is stored. This operation may be repeated for a second pair of scalar values of the second vector operand.Type: ApplicationFiled: July 12, 2010Publication date: January 12, 2012Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATIONInventors: Alexandre E. Eichenberger, Michael K. Gschwind, John A. Gunnels, Valentina Salapura
-
Publication number: 20110276786Abstract: Mechanisms are provided for optimizing code to perform prefetching of data into a shared memory of a computing device that is shared by a plurality of threads that execute on the computing device. A memory stream of a portion of code that is shared by the plurality of threads is identified. A set of prefetch instructions is distributed across the plurality of threads. Prefetch instructions are inserted into the instruction sequences of the plurality of threads such that each instruction sequence has a separate sub-portion of the set of prefetch instructions, thereby generating optimized code. Executable code is generated based on the optimized code and stored in a storage device. The executable code, when executed, performs the prefetches associated with the distributed set of prefetch instructions in a shared manner across the plurality of threads.Type: ApplicationFiled: May 4, 2010Publication date: November 10, 2011Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATIONInventors: Alexandre E. Eichenberger, John A. Gunnels
-
Patent number: 8055878Abstract: A method and structure of distributing elements of an array of data in a computer memory to a specific processor of a multi-dimensional mesh of parallel processors includes designating a distribution of elements of at least a portion of the array to be executed by specific processors in the multi-dimensional mesh of parallel processors. The pattern of the designating includes a cyclical repetitive pattern of the parallel processor mesh, as modified to have a skew in at least one dimension so that both a row of data in the array and a column of data in the array map to respective contiguous groupings of the processors such that a dimension of the contiguous groupings is greater than one.Type: GrantFiled: February 8, 2005Date of Patent: November 8, 2011Assignee: International Business Machines CorporationInventors: Siddhartha Chatterjee, John A. Gunnels
-
Patent number: 8037215Abstract: Apparatus for evaluating the performance of DMA-based algorithmic tasks on a target multi-core processing system includes a memory and at least one processor coupled to the memory. The processor is operative: to input a template for a specified task, the template including DMA-related parameters specifying DMA operations and computational operations to be performed; to evaluate performance for the specified task by running a benchmark on the target multi-core processing system, the benchmark being operative to generate data access patterns using DMA operations and invoking prescribed computation routines as specified by the input template; and to provide results of the benchmark indicative of a measure of performance of the specified task corresponding to the target multi-core processing system.Type: GrantFiled: May 30, 2008Date of Patent: October 11, 2011Assignee: International Business Machines CorporationInventors: John A. Gunnels, Shakti Kapoor, Ravi Kothari, Yogish Sabharwal, James C. Sexton
-
Publication number: 20110219208Abstract: A Multi-Petascale Highly Efficient Parallel Supercomputer of 100 petaOPS-scale computing, at decreased cost, power and footprint, and that allows for a maximum packaging density of processing nodes from an interconnect point of view. The Supercomputer exploits technological advances in VLSI that enables a computing model where many processors can be integrated into a single Application Specific Integrated Circuit (ASIC).Type: ApplicationFiled: January 10, 2011Publication date: September 8, 2011Applicant: International Business Machines CorporationInventors: Sameh Asaad, Ralph E. Bellofatto, Michael A. Blocksome, Matthias A. Blumrich, Peter Boyle, Jose R. Brunheroto, Dong Chen, Chen-Yong Cher, George L. Chiu, Norman Christ, Paul W. Coteus, Kristan D. Davis, Gabor J. Dozsa, Alexandre E. Eichenberger, Noel A. Eisley, Matthew R. Ellavsky, Kahn C. Evans, Bruce M. Fleischer, Thomas W. Fox, Alan Gara, Mark E. Giampapa, Thomas M. Gooding, Michael K. Gschwind, John A. Gunnels, Shawn A. Hall, Rudolf A. Haring, Philip Heidelberger, Todd A. Inglett, Brant L. Knudson, Gerard V. Kopcsay, Sameer Kumar, Amith R. Mamidala, James A. Marcella, Mark G. Megerian, Douglas R. Miller, Samuel J. Miller, Adam J. Muff, Michael B. Mundy, John K. O'Brien, Kathryn M. O'Brien, Martin Ohmacht, Jeffrey J. Parker, Ruth J. Poole, Joseph D. Ratterman, Valentina Salapura, David L. Satterfield, Robert M. Senger, Brian Smith, Burkhard Steinmacher-Burow, William M. Stockdell, Craig B. Stunkel, Krishnan Sugavanam, Yutaka Sugawara, Todd E. Takken, Barry M. Trager, James L. Van Oosten, Charles D. Wait, Robert E. Walkup, Alfred T. Watson, Robert W. Wisniewski, Peng Wu
-
Publication number: 20110055517Abstract: A structure (and method) including a plurality of coprocessing units and a controller that selectively loads data for processing on the plurality of coprocessing units, using a compound loading instruction. The compound loading instruction includes a plurality of low-level software instructions that preliminarily processes input data in a manner predetermined to simulate an effect of a single hardware loading instruction that would provide optimal loading of complex matrix data by loading input data in accordance with the effect of multiplying i·i=?1.Type: ApplicationFiled: August 26, 2009Publication date: March 3, 2011Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATIONInventors: Alexandre E. Eichenberger, Michael Karl Gschwind, John A. Gunnels, Fred Gehrung Gustavson, Brett Olsson
-
Publication number: 20110040821Abstract: Mechanisms for performing matrix multiplication operations with data pre-conditioning in a high performance computing architecture are provided. A vector load operation is performed to load a first vector operand of the matrix multiplication operation to a first target vector register. A load and splat operation is performed to load an element of a second vector operand and replicating the element to each of a plurality of elements of a second target vector register. A multiply add operation is performed on elements of the first target vector register and elements of the second target vector register to generate a partial product of the matrix multiplication operation. The partial product of the matrix multiplication operation is accumulated with other partial products of the matrix multiplication operation.Type: ApplicationFiled: August 17, 2009Publication date: February 17, 2011Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATIONInventors: Alexandre E. Eichenberger, Michael K. Gschwind, John A. Gunnels
-
Publication number: 20110040822Abstract: Mechanisms for performing a complex matrix multiplication operation are provided. A vector load operation is performed to load a first vector operand of the complex matrix multiplication operation to a first target vector register. The first vector operand comprises a real and imaginary part of a first complex vector value. A complex load and splat operation is performed to load a second complex vector value of a second vector operand and replicate the second complex vector value within a second target vector register. The second complex vector value has a real and imaginary part. A cross multiply add operation is performed on elements of the first target vector register and elements of the second target vector register to generate a partial product of the complex matrix multiplication operation. The partial product is accumulated with other partial products and a resulting accumulated partial product is stored in a result vector register.Type: ApplicationFiled: August 17, 2009Publication date: February 17, 2011Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATIONInventors: Alexandre E. Eichenberger, Michael K. Gschwind, John A. Gunnels
-
Patent number: 7853820Abstract: A method (and system) for detecting at least one faulty object in a system including a plurality of objects in communication with each other in an n-dimensional architecture, includes probing a first plane of objects in the n-dimensional architecture and probing at least one other plane of objects in the n-dimensional architecture which would result in identifying a faulty object in the system.Type: GrantFiled: October 22, 2008Date of Patent: December 14, 2010Assignee: International Business Machines CorporationInventors: John A. Gunnels, Fred Gehrung Gustavson, Robert Daniel Engle
-
Patent number: 7844630Abstract: A computerized method provides for an in-place transformation of matrix A data including a New Data Structure (NDS) format and a transformation T having a compact representation. The NDS represents data of the matrix A in a format other than a row major format or a column major format, such that the data for the matrix A is stored as contiguous sub matrices of size MB by NB in an order predetermined to provide the data for a matrix processing. The transformation T is applied to the MB by NB blocks, using an in-place transformation processing, thereby replacing data of the block A1 with the contents of T(A1).Type: GrantFiled: February 19, 2008Date of Patent: November 30, 2010Assignee: International Business Machines CorporationInventors: Fred Gehrung Gustavson, John A. Gunnels, James C. Sexton
-
Patent number: 7793011Abstract: A method for evaluating performance of DMA-based algorithmic tasks on a target multi-core processing system includes the steps of: inputting a template for a specified task, the template including DMA-related parameters specifying DMA operations and computational operations to be performed; evaluating performance for the specified task by running a benchmark on the target multi-core processing system, the benchmark being operative to generate data access patterns using DMA operations and invoking prescribed computation routines as specified by the input template; and providing results of the benchmark indicative of a measure of performance of the specified task corresponding to the target multi-core processing system.Type: GrantFiled: May 29, 2008Date of Patent: September 7, 2010Assignee: International Business Machines CorporationInventors: John A. Gunnels, Shakti Kapoor, Ravi Kothari, Yogish Sabharwal, James C. Sexton
-
Publication number: 20090307656Abstract: Mechanisms for optimizing scalar code executed on a single instruction multiple data (SIMD) engine are provided. Placement of vector operation-splat operations may be determined based on an identification of scalar and SIMD operations in an original code representation. The original code representation may be modified to insert the vector operation-splat operations based on the determined placement of vector operation-splat operations to generate a first modified code representation. Placement of separate splat operations may be determined based on identification of scalar and SIMD operations in the first modified code representation. The first modified code representation may be modified to insert or delete separate splat operations based on the determined placement of the separate splat operations to generate a second modified code representation. SIMD code may be output based on the second modified code representation for execution by the SIMD engine.Type: ApplicationFiled: June 6, 2008Publication date: December 10, 2009Applicant: International Business Machines CorporationInventors: Alexandre E. Eichenberger, Michael K. Gschwind, John A. Gunnels
-
Publication number: 20090300091Abstract: A block matrix multiplication mechanism is provided for reversing the visitation order of blocks at corner turns when performing a block matrix multiplication operation in a data processing system. The mechanism increases block size and divides each block into sub-blocks. By reversing the visitation order, the mechanism eliminates a sub-block load at the corner turns. The mechanism per forms sub-block matrix multiplication for each sub-block in a given block, and then repeats operation for a next block until all blocks are computed. The mechanism may determine block size and sub-block size to optimize load balancing and memory bandwidth. Therefore, the mechanism reduces maximum throughput and increases performance. In addition, the mechanism also reduces the number of multi-buffered local store buffers.Type: ApplicationFiled: May 30, 2008Publication date: December 3, 2009Applicant: International Business Machines CorporationInventors: Daniel A. Brokenshire, John A. Gunnels, Michael D. Kistler