Patents by Inventor Matthias A. Blumrich

Matthias A. Blumrich has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

  • Patent number: 7305487
    Abstract: In a massively parallel computing system having a plurality of nodes configured in m multi-dimensions, each node including a computing device, a method for routing packets towards their destination nodes is provided which includes generating at least one of a 2m plurality of compact bit vectors containing information derived from downstream nodes. A multilevel arbitration process in which downstream information stored in the compact vectors, such as link status information and fullness of downstream buffers, is used to determine a preferred direction and virtual channel for packet transmission. Preferred direction ranges are encoded and virtual channels are selected by examining the plurality of compact bit vectors. This dynamic routing method eliminates the necessity of routing tables, thus enhancing scalability of the switch.
    Type: Grant
    Filed: February 25, 2002
    Date of Patent: December 4, 2007
    Assignee: International Business Machines Corporation
    Inventors: Matthias A. Blumrich, Dong Chen, Paul W. Coteus, Alan G. Gara, Mark E. Giampapa, Philip Heidelberger, Burkhard D. Steinmacher-Burow, Todd E. Takken, Pavlos M. Vranas
  • Publication number: 20070204112
    Abstract: A low latency memory system access is provided in association with a weakly-ordered multiprocessor system. Each processor in the multiprocessor shares resources, and each shared resource has an associated lock within a locking device that provides support for synchronization between the multiple processors in the multiprocessor and the orderly sharing of the resources. A processor only has permission to access a resource when it owns the lock associated with that resource, and an attempt by a processor to own a lock requires only a single load operation, rather than a traditional atomic load followed by store, such that the processor only performs a read operation and the hardware locking device performs a subsequent write operation rather than the processor. A simple prefetching for non-contiguous data structures is also disclosed.
    Type: Application
    Filed: December 28, 2006
    Publication date: August 30, 2007
    Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATION
    Inventors: Matthias Blumrich, Dong Chen, Paul Coteus, Alan Gara, Mark Giampapa, Philip Heidelberger, Dirk Hoenicke, Martin Ohmacht, Burkhard Steinmacher-Burow, Todd Takken, Pavlos Vranas
  • Publication number: 20070055825
    Abstract: A method and apparatus for managing coherence between two processors of a two processor node of a multi-processor computer system. Generally the present invention relates to a software algorithm that simplifies and significantly speeds the management of cache coherence in a message passing parallel computer, and to hardware apparatus that assists this cache coherence algorithm. The software algorithm uses the opening and closing of put/get windows to coordinate the activated required to achieve cache coherence. The hardware apparatus may be an extension to the hardware address decode, that creates, in the physical memory address space of the node, an area of virtual memory that (a) does not actually exist, and (b) is therefore able to respond instantly to read and write requests from the processing elements.
    Type: Application
    Filed: February 25, 2002
    Publication date: March 8, 2007
    Inventors: Matthias Blumrich, Dong Chan, Paul Coteus, Alan Gata, Mark Giampapa, Philip Heidelberger, Dirk Hoenicke, Martin Ohmacht
  • Patent number: 7174434
    Abstract: A low latency memory system access is provided in association with a weakly-ordered multiprocessor system. Each processor in the multiprocessor shares resources, and each shared resource has an associated lock within a locking device that provides support for synchronization between the multiple processors in the multiprocessor and the orderly sharing of the resources. A processor only has permission to access a resource when it owns the lock associated with that resource, and an attempt by a processor to own a lock requires only a single load operation, rather than a traditional atomic load followed by store, such that the processor only performs a read operation and the hardware locking device performs a subsequent write operation rather than the processor. A simple prefetching for non-contiguous data structures is also disclosed.
    Type: Grant
    Filed: February 25, 2002
    Date of Patent: February 6, 2007
    Assignee: International Business Machines Corporation
    Inventors: Matthias A. Blumrich, Dong Chen, Paul W. Coteus, Alan G. Gara, Mark E. Giampapa, Philip Heidelberger, Dirk Hoenicke, Martin Ohmacht, Burkhard D. Steinmacher-Burow, Todd E. Takken, Pavlos M. Vranas
  • Publication number: 20060248370
    Abstract: The present invention concerns methods and apparatus for performing fault isolation in multiple node computing systems using commutative error detection values—for example, checksums—to identify and to isolate faulty nodes. In the present invention nodes forming the multiple node computing system are networked together and during program execution communicate with one another by transmitting information through the network. When information associated with a reproducible portion of a computer program is injected into the network by a node, a commutative error detection value is calculated and stored in commutative error detection apparatus associated with the node. At intervals, node fault detection apparatus associated with the multiple node computer system retrieve commutative error detection values saved in the commutative error detection apparatus associated with the node and stores them in memory.
    Type: Application
    Filed: April 14, 2005
    Publication date: November 2, 2006
    Inventors: Gheorghe Almasi, Matthias Blumrich, Dong Chen, Paul Coteus, Alan Gara, Mark Giampapa, Philip Heidelberger, Dirk Hoenicke, Sarabjeet Singh, Burkhard Steinmacher-Burow, Todd Takken, Pavlos Vranas
  • Publication number: 20060230239
    Abstract: A method and apparatus for detecting a cache wrap condition in a computing environment having a processor and a cache. A cache wrap condition is detected when the entire contents of a cache have been replaced, relative to a particular starting state. A set-associative cache is considered to have wrapped when all of the sets within the cache have been replaced. The starting point for cache wrap detection is the state of the cache sets at the time of the previous cache wrap. The method and apparatus is preferably implemented in a snoop filter having filter mechanisms that rely upon detecting the cache wrap condition. These snoop filter mechanisms requiring this information are operatively coupled with cache wrap detection logic adapted to detect the cache wrap event, and perform an indication step to the snoop filter mechanisms. In the various embodiments, cache wrap detection logic is implemented using registers and comparators, loadable counters, or a scoreboard data structure.
    Type: Application
    Filed: March 29, 2005
    Publication date: October 12, 2006
    Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATION
    Inventors: Matthias Blumrich, Alan Gara, Mark Giampapa, Martin Ohmacht, Valentina Salapura
  • Publication number: 20060224837
    Abstract: A method and apparatus for supporting cache coherency in a multiprocessor computing environment having multiple processing units, each processing unit having one or more local cache memories associated and operatively connected therewith. The method comprises providing a snoop filter device associated with each processing unit, each snoop filter device having a plurality of dedicated input ports for receiving snoop requests from dedicated memory writing sources in the multiprocessor computing environment. Each of the memory writing sources is directly connected to the dedicated input ports of all other snoop filter devices associated with all other processing units in a point-to-point interconnect fashion.
    Type: Application
    Filed: March 29, 2005
    Publication date: October 5, 2006
    Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATION
    Inventors: Matthias Blumrich, Dong Chen, Alan Gara, Mark Giampapa, Philip Heidelberger, Dirk Hoenicke, Martin Ohmacht, Valentina Salapura, Pavlos Vranas
  • Publication number: 20060224840
    Abstract: An apparatus for implementing snooping cache coherence that locally reduces the number of snoop requests presented to each cache in a multiprocessor system. A snoop filter device associated with a single processor includes one or more “scoreboard” data structures that make snoop determinations, i.e., for each snoop request from another processor, to determine if a request is to be forwarded to the processor or, discarded. At least one scoreboard is active, and at least one scoreboard is determined to be historic at any point in time. A snoop determination of the queue indicates that an entry may be in the cache, but does not indicate its actual residence status. In addition, the snoop filter block implementing scoreboard data structures is operatively coupled with a cache wrap detection logic means whereby, upon detection of a cache wrap condition, the content of the active scoreboard is copied into a historic scoreboard and the content of at least one active scoreboard is reset.
    Type: Application
    Filed: March 29, 2005
    Publication date: October 5, 2006
    Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATION
    Inventors: Matthias Blumrich, Alan Gara, Thomas Puzak, Valentina Salapura
  • Publication number: 20060224839
    Abstract: A method and apparatus for implementing a snoop filter unit associated with a single processor in a multiprocessor system. The snoop filter unit has a plurality of ports, each port receiving snoop requests from exactly one dedicated source. Associated with each port is a snoop cache filter that processes each snoop cache request and records addresses of the most recent snoop requests for exactly one source. The snoop cache filter uses vector encoding to record the occurrence of snoop requests for a sequence of consecutive cache lines. All addresses of snoop requests are added to the snoop cache unless a received snoop cache request matches an entry present in the associated snoop cache, in which case the snoop request is discarded. Otherwise, the associated snoop cache request is enqueued for forwarding to the single processor.
    Type: Application
    Filed: March 29, 2005
    Publication date: October 5, 2006
    Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATION
    Inventors: Matthias Blumrich, Alan Gara, Valentina Salapura
  • Publication number: 20060224838
    Abstract: A method and apparatus for supporting cache coherency in a multiprocessor computing environment having multiple processing units, each processing unit having one or more local cache memories associated and operatively connected therewith. The method comprises providing a snoop filter device associated with each processing unit, each snoop filter device having a plurality of dedicated input ports for receiving snoop requests from dedicated memory writing sources in the multiprocessor computing environment. Each snoop filter device includes a plurality of parallel operating port snoop filters in correspondence with the plurality of dedicated input ports, each port snoop filter implementing one or more parallel operating sub-filter elements that are adapted to concurrently filter snoop requests received from respective dedicated memory writing sources and forward a subset of those requests to its associated processing unit.
    Type: Application
    Filed: March 29, 2005
    Publication date: October 5, 2006
    Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATION
    Inventors: Matthias Blumrich, Dong Chen, Alan Gara, Mark Giampapa, Philip Heidelberger, Dirk Hoenicke, Martin Ohmacht, Valentina Salapura, Pavlos Vranas
  • Publication number: 20060224835
    Abstract: A system and method for supporting cache coherency in a computing environment having multiple processing units, each unit having an associated cache memory system operatively coupled therewith. The system includes a plurality of interconnected snoop filter units, each snoop filter unit corresponding to and in communication with a respective processing unit, with each snoop filter unit comprising a plurality of devices for receiving asynchronous snoop requests from respective memory writing sources in the computing environment; and a point-to-point interconnect comprising communication links for directly connecting memory writing sources to corresponding receiving devices; and, a plurality of parallel operating filter devices coupled in one-to-one correspondence with each receiving device for processing snoop requests received thereat and one of forwarding requests or preventing forwarding of requests to its associated processing unit.
    Type: Application
    Filed: March 29, 2005
    Publication date: October 5, 2006
    Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATION
    Inventors: Matthias Blumrich, Dong Chen, Alan Gara, Mark Giampapa, Philip Heidelberger, Dirk Hoenicke, Martin Ohmacht, Valentina Salapura, Pavlos Vranas
  • Publication number: 20060224836
    Abstract: A method and apparatus for supporting cache coherency in a multiprocessor computing environment having multiple processing units, each processing unit having a local cache memory associated therewith. A snoop filter device is associated with each processing unit and includes at least one snoop filter primitive implementing filtering method based on usage of stream registers sets and associated stream register comparison logic. From the plurality of stream registers sets, at least one stream register set is active, and at least one stream register set is labeled historic at any point in time. In addition, the snoop filter block is operatively coupled with cache wrap detection logic whereby the content of the active stream register set is switched into a historic stream register set upon the cache wrap condition detection, and the content of at least one active stream register set is reset.
    Type: Application
    Filed: March 29, 2005
    Publication date: October 5, 2006
    Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATION
    Inventors: Matthias Blumrich, Alan Gara, Valentina Salapura
  • Publication number: 20050195808
    Abstract: Multidimensional switch data networks are disclosed, such as are used by a distributed-memory parallel computer, as applied for example to computations in the field of life sciences. A distributed memory parallel computing system comprises a number of parallel compute nodes and a message passing data network connecting the compute nodes together. The data network connecting the compute nodes comprises a multidimensional switch data network of compute nodes having N dimensions, and a number/array of compute nodes Ln in each of the N dimensions. Each compute node includes an N port routing element having a port for each of the N dimensions. Each compute node of an array of Ln compute nodes in each of the N dimensions connects through a port of its routing element to an Ln port crossbar switch having Ln ports. Several embodiments are disclosed of a 4 dimensional computing system having 65,536 compute nodes.
    Type: Application
    Filed: March 4, 2004
    Publication date: September 8, 2005
    Applicant: International Business Machines Corporation
    Inventors: Dong Chen, Alan Gara, Mark Giampapa, Philip Heidelberger, Dirk Hoenicke, Burkhard Steinmacher-Burow, Pavlos Vranas, Matthias Blumrich
  • Publication number: 20050081078
    Abstract: Disclosed are an error recovery method and system for use with a communication system having first and second nodes, each of said nodes having a receiver and a sender, the sender of the first node being connected to the receiver of the second node by a first cable, and the sender of the second node being connected to the receiver of the first node by a second cable. The method comprising the step of after one of the nodes detects an error, both of the nodes entering the same defined state. In particular, the receiver of the first node enters an error state, stays in the error state for a defined period of time T, and, after said defined period of time T, enters a wait state. Also, the sender of the first node sends to the receiver of the second node an error message for a defined period of time Te, and after the defined period of time Te, the sender of the first node enters an idle state.
    Type: Application
    Filed: September 30, 2003
    Publication date: April 14, 2005
    Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATION
    Inventors: Matthias Blumrich, Dong Chen, Alan Gara, Philip Heidelberger, Dirk Hoenicke, Burkhard Steinmacher-Burow, Pavlos Vranas
  • Publication number: 20050002387
    Abstract: A one-bounce data network comprises a plurality of nodes interconnected to each other via communication links, the network including a plurality of interconnected switch devices, said switch devices interconnected such that a message is communicated between any two switches passes over a single link from a source switch to a destination switch; and, the source switch concurrently sends a message to an arbitrary bounce switch which then sends the message to the destination switch.
    Type: Application
    Filed: September 30, 2003
    Publication date: January 6, 2005
    Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATION
    Inventors: Matthias A. Blumrich, Dong Chen, Alan G. Gara, Mark Giampapa, Philip Heidelberger, Burkhard Steinmacher-Burow, Pavlos Vranas
  • Publication number: 20040103218
    Abstract: A novel massively parallel supercomputer of hundreds of teraOPS-scale includes node architectures based upon System-On-a-Chip technology, i.e., each processing node comprises a single Application Specific Integrated Circuit (ASIC). Within each ASIC node is a plurality of processing elements each of which consists of a central processing unit (CPU) and plurality of floating point processors to enable optimal balance of computational performance, packaging density, low cost, and power and cooling requirements. The plurality of processors within a single node may be used individually or simultaneously to work on any combination of computation or communication as required by the particular algorithm being solved or executed at any point in time. The system-on-a-chip ASIC nodes are interconnected by multiple independent networks that optimally maximizes packet communications throughput and minimizes latency.
    Type: Application
    Filed: August 22, 2003
    Publication date: May 27, 2004
    Inventors: Matthias A Blumrich, Dong Chen, George L Chiu, Thomas M Cipolla, Paul W Coteus, Alan G Gara, Mark E Giampapa, Philip Heidelberg, Gerard V Kopcsay, Lawrence S Mok, Todd E Takken
  • Publication number: 20040081155
    Abstract: Class network routing is implemented in a network such as a computer network comprising a plurality of parallel compute processors at nodes thereof. Class network routing allows a compute processor to broadcast a message to a range (one or more) of other compute processors in the computer network, such as processors in a column or a row. Normally this type of operation requires a separate message to be sent to each processor. With class network routing pursuant to the invention, a single message is sufficient, which generally reduces the total number of messages in the network as well as the latency to do a broadcast. Class network routing is also applied to dense matrix inversion algorithms on distributed memory parallel supercomputers with hardware class function (multicast) capability. This is achieved by exploiting the fact that the communication patterns of dense matrix inversion can be served by hardware class functions, which results in faster execution times.
    Type: Application
    Filed: August 22, 2003
    Publication date: April 29, 2004
    Inventors: Gyan V Bhanot, Matthias A. Blumrich, Dong Chen, Paul W. Coteus, Alan G. Gara, Mark E. Giampapa, Philip Heidelberger, Burkhard D. Steinmacher-Burow, Todd E. Takken, Pavlos M. Vranas
  • Publication number: 20040078493
    Abstract: A system and method for enabling high-speed, low-latency global tree communications among processing nodes interconnected according to a tree network structure. The global tree network optimally enables collective reduction operations to be performed during parallel algorithm operations executing in a computer structure having a plurality of the interconnected processing nodes. Router devices are included that interconnect the nodes of the tree via links to facilitate performance of low-latency global processing operations at nodes of the virtual tree and sub-tree structures. The global operations include one or more of: global broadcast operations downstream from a root node to leaf nodes of a virtual tree, global reduction operations upstream from leaf nodes to the root node in the virtual tree, and point-to-point message passing from any node to the root node in the virtual tree.
    Type: Application
    Filed: August 22, 2003
    Publication date: April 22, 2004
    Inventors: Matthias A Blumrich, Dong Chen, Paul W Coteus, Alan G Gara, Mark E Giampapa, Philip Heidelberger, Dirk Hoenicke, Burkhard D Steinmacher-Burow, Todd E Takken, Pavlos M Vranas
  • Publication number: 20040078482
    Abstract: In a massively parallel computing system having a plurality of nodes configured in m multi-dimensions, each node including a computing device, a method for routing packets towards their destination nodes is provided which includes generating at least one of a 2m plurality of compact bit vectors containing information derived from downstream nodes. A multilevel arbitration process in which downstream information stored in the compact vectors, such as link status information and fullness of downstream buffers, is used to determine a preferred direction and virtual channel for packet transmission. Preferred direction ranges are encoded and virtual channels are selected by examining the plurality of compact bit vectors. This dynamic routing method eliminates the necessity of routing tables, thus enhancing scalability of the switch.
    Type: Application
    Filed: August 22, 2003
    Publication date: April 22, 2004
    Inventors: Matthias A. Blumrich, Dong Chen, Paul W. Coteus, Alan G. Gara, Mark E. Giampapa, Philip Heidelberger, Burkhard D. Steinmacher-Burow, Todd E. Takken, Pavlos M. Vranas
  • Publication number: 20040073590
    Abstract: Methods and systems for performing arithmetic functions. In accordance with a first aspect of the invention, methods and apparatus are provided, working in conjunction of software algorithms and hardware implementation of class network routing, to achieve a very significant reduction in the time required for global arithmetic operation on the torus. Therefore, it leads to greater scalability of applications running on large parallel machines. The invention involves three steps in improving the efficiency and accuracy of global operations: (1) Ensuring, when necessary, that all the nodes do the global operation on the data in the same order and so obtain a unique answer, independent of roundoff error; (2) Using the topology of the torus to minimize the number of hops and the bidirectional capabilities of the network to reduce the number of time steps in the data transfer operation to an absolute minimum; and (3) Using class function routing to reduce latency in the data transfer.
    Type: Application
    Filed: August 22, 2003
    Publication date: April 15, 2004
    Inventors: Gyan Bhanot, Matthias A. Blumrich, Dong Chen, Alan G. Gara, Mark E. Giampapa, Philip Heidelberger, Burkhard D. Steinmacher-Burow, Pavlos M. Vranas