Patents by Inventor Simon C. Steely

Simon C. Steely has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

  • Publication number: 20230350674
    Abstract: A processor of an aspect includes a decode unit to decode a matrix multiplication instruction. The matrix multiplication instruction is to indicate a first memory location of a first source matrix, is to indicate a second memory location of a second source matrix, and is to indicate a third memory location where a result matrix is to be stored. The processor also includes an execution unit coupled with the decode unit. The execution unit, in response to the matrix multiplication instruction, is to multiply a portion of the first and second source matrices prior to an interruption, and store a completion progress indicator in response to the interruption. The completion progress indicator to indicate an amount of progress in multiplying the first and second source matrices, and storing corresponding result data to the third memory location, that is to have been completed prior to the interruption.
    Type: Application
    Filed: July 10, 2023
    Publication date: November 2, 2023
    Inventors: Edward T. GROCHOWSKI, Asit K. MISHRA, Robert VALENTINE, Mark J. CHARNEY, Simon C. STEELY
  • Publication number: 20220107911
    Abstract: Systems, methods, and apparatuses relating to operations in a configurable spatial accelerator are described. In one embodiment, a configurable spatial accelerator includes a first processing element that includes a configuration register within the first processing element to store a configuration value that causes the first processing element to perform an operation according to the configuration value, a plurality of input queues, an input controller to control enqueue and dequeue of values into the plurality of input queues according to the configuration value, a plurality of output queues, and an output controller to control enqueue and dequeue of values into the plurality of output queues according to the configuration value.
    Type: Application
    Filed: December 14, 2021
    Publication date: April 7, 2022
    Inventors: Kermin E. FLEMING, Simon C. STEELY, Kent D. GLOSSOP, Mitchell DIAMOND, Benjamin KEEN, Dennis BRADFORD, Fabrizio Petrini, Barry TANNENBAUM, Yongzhi ZHANG
  • Patent number: 10915471
    Abstract: Systems, methods, and apparatuses relating to memory interface circuit allocation in a configurable spatial accelerator are described. In one embodiment, a configurable spatial accelerator (CSA) includes a plurality of processing elements; a plurality of request address file (RAF) circuits, and a circuit switched interconnect network between the plurality of processing elements and the RAF circuits. As a dataflow architecture, embodiments of CSA have a unique memory architecture where memory accesses are decoupled into an explicit request and response phase allowing pipelining through memory. Certain embodiments herein provide for an improved memory sub-system design via the improvements to allocation discussed herein.
    Type: Grant
    Filed: March 30, 2019
    Date of Patent: February 9, 2021
    Assignee: Intel Corporation
    Inventors: Kermin ChoFleming, Yu Bai, Simon C. Steely
  • Publication number: 20200310994
    Abstract: Systems, methods, and apparatuses relating to memory interface circuit allocation in a configurable spatial accelerator are described. In one embodiment, a configurable spatial accelerator (CSA) includes a plurality of processing elements; a plurality of request address file (RAF) circuits, and a circuit switched interconnect network between the plurality of processing elements and the RAF circuits. As a dataflow architecture, embodiments of CSA have a unique memory architecture where memory accesses are decoupled into an explicit request and response phase allowing pipelining through memory. Certain embodiments herein provide for an improved memory sub-system design via the improvements to allocation discussed herein.
    Type: Application
    Filed: March 30, 2019
    Publication date: October 1, 2020
    Inventors: Kermin ChoFleming, Yu Bai, Simon C. Steely
  • Patent number: 10515049
    Abstract: Methods and apparatuses relating to distributed memory hazard detection and error recovery are described. In one embodiment, a memory circuit includes a memory interface circuit to service memory requests from a spatial array of processing elements for data stored in a plurality of cache banks; and a hazard detection circuit in each of the plurality of cache banks, wherein a first hazard detection circuit for a speculative memory load request from the memory interface circuit, that is marked with a potential dynamic data dependency, to an address within a first cache bank of the first hazard detection circuit, is to mark the address for tracking of other memory requests to the address, store data from the address in speculative completion storage, and send the data from the speculative completion storage to the spatial array of processing elements when a memory dependency token is received for the speculative memory load request.
    Type: Grant
    Filed: July 1, 2017
    Date of Patent: December 24, 2019
    Assignee: intel corporation
    Inventors: Kermin E. Fleming, Simon C. Steely, Kent D. Glossop
  • Patent number: 10445098
    Abstract: Methods and apparatuses relating to privileged configuration in spatial arrays are described. In one embodiment, a processor includes processing elements; an interconnect network between the processing elements; and a configuration controller coupled to a first subset and a second, different subset of the plurality of processing elements, the first subset having an output coupled to an input of the second, different subset, wherein the configuration controller is to configure the interconnect network between the first subset and the second, different subset of the plurality of processing elements to not allow communication on the interconnect network between the first subset and the second, different subset when a privilege bit is set to a first value and to allow communication on the interconnect network between the first subset and the second, different subset of the plurality of processing elements when the privilege bit is set to a second value.
    Type: Grant
    Filed: September 30, 2017
    Date of Patent: October 15, 2019
    Assignee: Intel Corporation
    Inventors: Kermin E. Fleming, Simon C. Steely, Kent D. Glossop
  • Patent number: 10445250
    Abstract: Systems, methods, and apparatuses relating to a configurable spatial accelerator are described. In one embodiment, a processor includes a core with a decoder to decode an instruction into a decoded instruction and an execution unit to execute the decoded instruction to perform a first operation; a plurality of processing elements; and an interconnect network between the plurality of processing elements to receive an input of a dataflow graph comprising a plurality of nodes, wherein the dataflow graph is to be overlaid into the interconnect network and the plurality of processing elements with each node represented as a dataflow operator in the plurality of processing elements, and the plurality of processing elements are to perform a second operation by a respective, incoming operand set arriving at each of the dataflow operators of the plurality of processing elements.
    Type: Grant
    Filed: December 30, 2017
    Date of Patent: October 15, 2019
    Assignee: Intel Corporation
    Inventors: Kermin E. Fleming, Kent D. Glossop, Simon C. Steely
  • Patent number: 10380063
    Abstract: Systems, methods, and apparatuses relating to a sequencer dataflow operator of a configurable spatial accelerator are described. In one embodiment, an interconnect network between a plurality of processing elements receives an input of a dataflow graph comprising a plurality of nodes forming a loop construct, wherein the dataflow graph is overlaid into the interconnect network and the plurality of processing elements with each node represented as a dataflow operator in the plurality of processing elements and at least one dataflow operator controlled by a sequencer dataflow operator of the plurality of processing elements, and the plurality of processing elements is to perform an operation when an incoming operand set arrives at the plurality of processing elements and the sequencer dataflow operator generates control signals for the at least one dataflow operator in the plurality of processing elements.
    Type: Grant
    Filed: September 30, 2017
    Date of Patent: August 13, 2019
    Assignee: Intel Corporation
    Inventors: Jinjie Tang, Kermin E. Fleming, Simon C. Steely, Kent D. Glossop, Jim Sukha
  • Publication number: 20190205263
    Abstract: Systems, methods, and apparatuses relating to a configurable spatial accelerator are described. In one embodiment, a processor includes a core with a decoder to decode an instruction into a decoded instruction and an execution unit to execute the decoded instruction to perform a first operation; a plurality of processing elements; and an interconnect network between the plurality of processing elements to receive an input of a dataflow graph comprising a plurality of nodes, wherein the dataflow graph is to be overlaid into the interconnect network and the plurality of processing elements with each node represented as a dataflow operator in the plurality of processing elements, and the plurality of processing elements are to perform a second operation by a respective, incoming operand set arriving at each of the dataflow operators of the plurality of processing elements.
    Type: Application
    Filed: December 30, 2017
    Publication date: July 4, 2019
    Inventors: Kermin E. Fleming, Kent D. Glossop, Simon C. Steely
  • Publication number: 20190102338
    Abstract: Systems, methods, and apparatuses relating to a sequencer dataflow operator of a configurable spatial accelerator are described. In one embodiment, an interconnect network between a plurality of processing elements receives an input of a dataflow graph comprising a plurality of nodes forming a loop construct, wherein the dataflow graph is overlaid into the interconnect network and the plurality of processing elements with each node represented as a dataflow operator in the plurality of processing elements and at least one dataflow operator controlled by a sequencer dataflow operator of the plurality of processing elements, and the plurality of processing elements is to perform an operation when an incoming operand set arrives at the plurality of processing elements and the sequencer dataflow operator generates control signals for the at least one dataflow operator in the plurality of processing elements.
    Type: Application
    Filed: September 30, 2017
    Publication date: April 4, 2019
    Inventors: JINJIE TANG, KERMIN E. FLEMING, SIMON C. STEELY, KENT D. GLOSSOP, JIM SUKHA
  • Publication number: 20190102179
    Abstract: Methods and apparatuses relating to privileged configuration in spatial arrays are described. In one embodiment, a processor includes processing elements; an interconnect network between the processing elements; and a configuration controller coupled to a first subset and a second, different subset of the plurality of processing elements, the first subset having an output coupled to an input of the second, different subset, wherein the configuration controller is to configure the interconnect network between the first subset and the second, different subset of the plurality of processing elements to not allow communication on the interconnect network between the first subset and the second, different subset when a privilege bit is set to a first value and to allow communication on the interconnect network between the first subset and the second, different subset of the plurality of processing elements when the privilege bit is set to a second value.
    Type: Application
    Filed: September 30, 2017
    Publication date: April 4, 2019
    Inventors: KERMIN E. FLEMING, SIMON C. STEELY, KENT D. GLOSSOP
  • Publication number: 20190004994
    Abstract: Methods and apparatuses relating to pipelined runtime services in spatial arrays are described.
    Type: Application
    Filed: July 1, 2017
    Publication date: January 3, 2019
    Inventors: KERMIN FLEMING, SIMON C. STEELY, KENT D. GLOSSOP
  • Patent number: 9251073
    Abstract: A multi core processor implements a cash coherency protocol in which probe messages are address-ordered on a probe channel while responses are un-ordered on a response channel. When a first core generates a read of an address that misses in the first core's cache, a line fill is initiated. If a second core is writing the same address, the second core generates an update on the addressed ordered probe channel. The second core's update may arrive before or after the first core's line fill returns. If the update arrived before the fill returned, a mask is maintained to indicate which portions of the line were modified by the update so that the late arriving line fill only modifies portions of the line that were unaffected by the earlier-arriving update.
    Type: Grant
    Filed: December 31, 2012
    Date of Patent: February 2, 2016
    Assignee: Intel Corporation
    Inventors: Simon C. Steely, William C. Hasenplaugh
  • Publication number: 20080243268
    Abstract: According to one example embodiment of the inventive subject matter, there is provided a mechanism that controls which prefetchers are applied to execute an application in a computing system by turning them on and off. In one embodiment, this may be accomplished for example with a software control process that may run in the background. In another example embodiment, this may be accomplished using a hardware control machine, or a combination of hardware and software. The prefetchers are turned on and off in order to increase the performance of the computing system.
    Type: Application
    Filed: March 31, 2007
    Publication date: October 2, 2008
    Inventors: Meenakshi A. Kandaswamy, Simon C. Steely
  • Publication number: 20080235457
    Abstract: In one embodiment, the present invention includes a method for associating a first priority indicator with data stored in a first entry of a shared cache memory by a core to indicate a priority level of a first thread, and associating a second priority indicator with data stored in a second entry of the shared cache memory by a graphics engine to indicate a priority level of a second thread. Other embodiments are described and claimed.
    Type: Application
    Filed: March 21, 2007
    Publication date: September 25, 2008
    Inventors: William C. Hasenplaugh, Li Zhao, Ravishankar Iyer, Ramesh Illikkal, Srihari Makineni, Donald Newell, Aamer Jaleel, Simon C. Steely
  • Publication number: 20030076831
    Abstract: A technique efficiently combines data and ordered transactions in a multiprocessor system having a plurality of nodes interconnected by a hierarchical switch. The technique further enables an ordered channel of the system to make progress in the presence of a blocked interface within the hierarchical switch. Specifically, the technique combines ordered components and unordered data components into common packets that are transmitted over an ordered channel of the system in the event that ordered and unordered components are generated simultaneously. The technique further allows, in the event that a combined packet in the ordered channel is stalled due to a data buffer dependency, the packet to be decomposed into an ordered component and an unordered data component wherein the ordered component remains in the ordered channel and the unordered data component is reassigned to the unordered data channel.
    Type: Application
    Filed: March 21, 2001
    Publication date: April 24, 2003
    Inventors: Stephen R. Van Doren, Simon C. Steely, Madhumitra Sharma
  • Publication number: 20030037223
    Abstract: A method, for executing a load locked and a store conditional instruction in a processor, achieves an atomic read-write operation to a memory block. First the load locked instruction is executed to read a memory block, and the processor in response to executing the load locked instruction issues a read modify system command to read the block and to take ownership of the block by the processor, and also sets a lock flag for the address of the memory block, and writes a value of the memory block into a cache of the processor as a cache copy of the memory block. The lock flag, upon receipt of an invalidate message by the processor for the cache copy of the memory block, is reset if any invalidate messages for the memory block are received by the processor. The processor waits for a selected time interval before the processor surrenders ownership of the memory block upon receipt of an ownership request message, if any is received by the processor after execution of the load locked instruction.
    Type: Application
    Filed: August 20, 2001
    Publication date: February 20, 2003
    Inventors: Simon C. Steely, Stephen R. Van Doren, Madhumitra Sharma
  • Publication number: 20020194290
    Abstract: A multiple-processor system in which a commit message is returned to a source processor that requests a memory access operation so as to indicate the apparent completion of the operation includes a multiple-level switch unit linking nodes that contain the processors. The switch unit includes multiple input switches each of which receives messages from multiple nodes, and a set of output switches whose inputs are the outputs of the input switches and whose outputs are the inputs of the nodes. Each switch processes messages in the order in which they are received by the switch and each output switch follows the same rule as the other output switches.
    Type: Application
    Filed: April 26, 2001
    Publication date: December 19, 2002
    Inventors: Simon C. Steely, Madhumitra Sharma, Stephen R. Van Doren
  • Publication number: 20020152358
    Abstract: A performance enhancing change-to-dirty operation (CTD) is disclosed wherein contention among several processors trying to gain ownership of a block of data is obviated by arranging the CTD to always succeed. A method and a system are disclosed where a processor in a multiprocessor system having a copy of data gains assured ownership of data that the processor may then write. The method provides for the possibilities of conditions that may exist and provides a scenario that the requesting processor may have to wait for the ownership. Conditions are handled where the memory is the “owner” of the data and where other processor are requesting ownership, and where copies of the data exist at other processors. The method provides for messages to other processor having copies of the data informing them that the data is now invalid.
    Type: Application
    Filed: April 13, 2001
    Publication date: October 17, 2002
    Inventors: Simon C. Steely, Stephen R. Van Doren, Madhumitra Sharma
  • Publication number: 20020146022
    Abstract: A credit-based, flow control technique utilizes a plurality of counters to conserve resources of a switch fabric within a modular multiprocessor system while ensuring that transaction packets pending in virtual channel queues of the fabric efficiently progress through those resources. The multiprocessor system includes a plurality of nodes interconnected by the switch fabric that extends from a global input port of a node through a hierarchical switch to a global output port of the same or another node. The resources include shared buffers within the global ports and hierarchical switch. Each counter is associated with a virtual channel queue and the flow control technique uses the counters to essentially create the structure of the shared buffers.
    Type: Application
    Filed: April 9, 2001
    Publication date: October 10, 2002
    Inventors: Stephen R. Van Doren, Simon C. Steely, Madhumitra Sharma, Gregory E. Tierney