Patents by Inventor Simon C. Steely

Simon C. Steely has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

PROCESSORS AND METHODS FOR PIPELINED RUNTIME SERVICES IN A SPATIAL ARRAY

Publication number: 20190004994

Abstract: Methods and apparatuses relating to pipelined runtime services in spatial arrays are described.

Type: Application

Filed: July 1, 2017

Publication date: January 3, 2019

Inventors: KERMIN FLEMING, SIMON C. STEELY, KENT D. GLOSSOP
PROCESSORS AND METHODS WITH CONFIGURABLE NETWORK-BASED DATAFLOW OPERATOR CIRCUITS

Publication number: 20190007332

Abstract: Systems, methods, and apparatuses relating to configurable network-based dataflow operator circuits are described. In one embodiment, a processor includes a spatial array of processing elements, and a packet switched communications network to route data within the spatial array between processing elements according to a dataflow graph to perform a first dataflow operation of the dataflow graph, wherein the packet switched communications network further comprises a plurality of network dataflow endpoint circuits to perform a second dataflow operation of the dataflow graph.

Type: Application

Filed: July 1, 2017

Publication date: January 3, 2019

Inventors: KERMIN FLEMING, KENT D. GLOSSOP, SIMON C. STEELY, JR.
PROCESSORS, METHODS, AND SYSTEMS FOR A CONFIGURABLE SPATIAL ACCELERATOR WITH TRANSACTIONAL AND REPLAY FEATURES

Publication number: 20190004945

Abstract: Systems, methods, and apparatuses relating to a configurable spatial accelerator are described. In an embodiment, a processor includes a plurality of processing elements; and an interconnect network between the plurality of processing elements to receive an input of a dataflow graph comprising a plurality of nodes, wherein the dataflow graph is to be overlaid into the interconnect network and the plurality of processing elements with each node represented as a dataflow operator in the plurality of processing elements, and the plurality of processing elements are to perform an atomic operation when an incoming operand set arrives at the plurality of processing elements.

Type: Application

Filed: July 1, 2017

Publication date: January 3, 2019

Inventors: Kermin Fleming, Kent D. Glossop, Simon C. Steely, JR., Samantika S. Sury
PROCESSORS, METHODS, AND SYSTEMS FOR A CONFIGURABLE SPATIAL ACCELERATOR WITH PERFORMANCE, CORRECTNESS, AND POWER REDUCTION FEATURES

Publication number: 20190005161

Abstract: Systems, methods, and apparatuses relating to a configurable spatial accelerator are described. In one embodiment, a processor includes a plurality of processing elements; and an interconnect network between the plurality of processing elements to receive an input of a dataflow graph comprising a plurality of nodes, wherein the dataflow graph is to be overlaid into the interconnect network and the plurality of processing elements with each node represented as a dataflow operator in the plurality of processing elements, and the plurality of processing elements is to perform an operation when an incoming operand set arrives at the plurality of processing elements. At least one of the plurality of processing elements includes a plurality of control inputs.

Type: Application

Filed: July 1, 2017

Publication date: January 3, 2019

Inventors: Kermin Fleming, Kent D. Glossop, Simon C. Steely, JR., Ping Tak Peter Tang
PROCESSORS, METHODS, AND SYSTEMS FOR A CONFIGURABLE SPATIAL ACCELERATOR WITH MEMORY SYSTEM PERFORMANCE, POWER REDUCTION, AND ATOMICS SUPPORT FEATURES

Publication number: 20190004955

Abstract: Systems, methods, and apparatuses relating to a configurable spatial accelerator are described. In one embodiment, a processor includes a plurality of processing elements; and an interconnect network between the plurality of processing elements to receive an input of a dataflow graph comprising a plurality of nodes, wherein the dataflow graph is to be overlaid into the interconnect network and the plurality of processing elements with each node represented as a dataflow operator in the plurality of processing elements, and the plurality of processing elements is to perform an operation when an incoming operand set arrives at the plurality of processing elements. The processor also includes a streamer element to prefetch the incoming operand set from two or more levels of a memory system.

Type: Application

Filed: July 1, 2017

Publication date: January 3, 2019

Inventors: Michael C. Adler, Chiachen Chou, Neal C. Crago, Kermin Fleming, Kent D. Glossop, Aamer Jaleel, Pratik M. Marolia, Simon C. Steely, JR., Samantika S. Sury
PROCESSORS, METHODS, AND SYSTEMS FOR A CONFIGURABLE SPATIAL ACCELERATOR WITH SECURITY, POWER REDUCTION, AND PERFORMACE FEATURES

Publication number: 20190004878

Abstract: Systems, methods, and apparatuses relating to a configurable spatial accelerator are described. In one embodiment, a processor includes a plurality of processing elements; and an interconnect network between the plurality of processing elements to receive an input of two dataflow graphs each comprising a plurality of nodes, wherein a first dataflow graph and a second dataflow graph are be overlaid into a first and second portion, respectively, of the interconnect network and a first and second subset, respectively, of the plurality of processing elements with each node represented as a dataflow operator in the plurality of processing elements, and the first and second subsets of the plurality of processing elements are to perform a first and second operation, respectively, when incoming first and second, respectively, operand sets arrive at the plurality of processing elements.

Type: Application

Filed: July 1, 2017

Publication date: January 3, 2019

Inventors: Michael C. Adler, Kermin Fleming, Kent D. Glossop, Simon C. Steely, JR.
Synchronization logic for memory requests

Patent number: 10146690

Abstract: In an embodiment, a processor includes a plurality of cores and synchronization logic. The synchronization logic includes circuitry to: receive a first memory request and a second memory request; determine whether the second memory request is in contention with the first memory request; and in response to a determination that the second memory request is in contention with the first memory request, process the second memory request using a non-blocking cache coherence protocol. Other embodiments are described and claimed.

Type: Grant

Filed: June 13, 2016

Date of Patent: December 4, 2018

Assignee: Intel Corporation

Inventors: Samantika S. Sury, Robert G. Blankenship, Simon C. Steely, Jr.
High bandwidth full-block write commands

Patent number: 10102124

Abstract: A micro-architecture may provide a hardware and software of a high bandwidth write command. The micro-architecture may invoke a method to perform the high bandwidth write command. The method may comprise sending a write request from a requester to a record keeping structure. The write request may have a memory address of a memory that stores requested data. The method may further determine copies of the requested data being present in a distributed cache system outside the memory, sending invalidation requests to elements holding copies of the requested data in the distributed cache system, sending a notification to the requester to inform presence of copies of the requested data and sending a write response message after a latest value of the requested data and all invalidation acknowledgements have been received.

Type: Grant

Filed: December 28, 2011

Date of Patent: October 16, 2018

Assignee: Intel Corporation

Inventors: Simon C. Steely, Jr., William C. Hasenplaugh, Joel S. Emer, Samantika Subramaniam
RUNTIME ADDRESS DISAMBIGUATION IN ACCELERATION HARDWARE

Publication number: 20180188983

Abstract: An integrated circuit includes a processor to execute instructions and to interact with memory, and acceleration hardware, to execute a sub-program corresponding to instructions. A set of input queues includes a store address queue to receive, from the acceleration hardware, a first address of the memory, the first address associated with a store operation and a store data queue to receive, from the acceleration hardware, first data to be stored at the first address of the memory. The set of input queues also includes a completion queue to buffer response data for a load operation. A disambiguator circuit, coupled to the set of input queues and the memory, is to, responsive to determining the load operation, which succeeds the store operation, has an address conflict with the first address, copy the first data from the store data queue into the completion queue for the load operation.

Type: Application

Filed: December 30, 2016

Publication date: July 5, 2018

Inventors: Kermin Elliott Fleming, JR., Simon C. Steely, JR., Kent D. Glossop
MEMORY ORDERING IN ACCELERATION HARDWARE

Publication number: 20180188997

Abstract: An integrated circuit includes a memory interface, coupled to a memory to store data corresponding to instructions, and an operations queue to buffer memory operations corresponding to the instructions. The integrated circuit may include acceleration hardware to execute a sub-program corresponding to the instructions. A set of input queues may include an address queue to receive, from the acceleration hardware, an address of the memory associated with a second memory operation of the memory operations, and a dependency queue to receive, from the acceleration hardware, a dependency token associated with the address. The dependency token indicates a dependency on data generated by a first memory operation of the memory operations. A scheduler circuit may schedule issuance of the second memory operation to the memory in response to the dependency queue receiving the dependency token and the address queue receiving the address.

Type: Application

Filed: December 30, 2016

Publication date: July 5, 2018

Inventors: Kermin Elliott Fleming, JR., Simon C. Steely, JR., Kent D. Glossop
PROCESSORS, METHODS, AND SYSTEMS WITH A CONFIGURABLE SPATIAL ACCELERATOR

Publication number: 20180189063

Abstract: Systems, methods, and apparatuses relating to a configurable spatial accelerator are described. In one embodiment, a processor includes a core with a decoder to decode an instruction into a decoded instruction and an execution unit to execute the decoded instruction to perform a first operation; a plurality of processing elements; and an interconnect network between the plurality of processing elements to receive an input of a dataflow graph comprising a plurality of nodes, wherein the dataflow graph is to be overlaid into the interconnect network and the plurality of processing elements with each node represented as a dataflow operator in the plurality of processing elements, and the plurality of processing elements are to perform a second operation by a respective, incoming operand set arriving at each of the dataflow operators of the plurality of processing elements.

Type: Application

Filed: December 30, 2016

Publication date: July 5, 2018

Inventors: KERMIN FLEMING, KENT D. GLOSSOP, SIMON C. STEELY, Jr.
PROCESSORS, METHODS, AND SYSTEMS WITH A CONFIGURABLE SPATIAL ACCELERATOR

Publication number: 20180189231

Abstract: Systems, methods, and apparatuses relating to a configurable spatial accelerator are described. In one embodiment, a processor includes a core with a decoder to decode an instruction into a decoded instruction and an execution unit to execute the decoded instruction to perform a first operation; a plurality of processing elements; and an interconnect network between the plurality of processing elements to receive an input of a dataflow graph comprising a plurality of nodes, wherein the dataflow graph is to be overlaid into the interconnect network and the plurality of processing elements with each node represented as a dataflow operator in the plurality of processing elements, and the plurality of processing elements is to perform a second operation when an incoming operand set arrives at the plurality of processing elements.

Type: Application

Filed: December 30, 2016

Publication date: July 5, 2018

Inventors: KERMIN E. FLEMING, JR., KENT D. GLOSSOP, SIMON C. STEELY, JR., JINJIE TANG, ALAN G. GARA
SWITCHABLE TOPOLOGY MACHINE

Publication number: 20180113838

Abstract: Embodiments relate to a computational device including multiple processor tiles on a die that may have multiple switchable topologies. A topology of the computational device may include one or more virtual circuits. A virtual circuit may include multiple processor tiles. A processor tile of a virtual circuit of a topology may include a configuration vector to control a connection between the processor tile and a neighboring processor tile. A first topology of the computation device may correspond to a first phase of a computation of a program, and a second topology of the computation device may correspond to a second phase of the computation of the program. Other embodiments may be described and/or claimed.

Type: Application

Filed: June 29, 2017

Publication date: April 26, 2018

Inventors: William J. Butera, Simon C. Steely, JR., Richard J. Dischler
LOW ENERGY CONSUMPTION MANTISSA MULTIPLICATION FOR FLOATING POINT MULTIPLY-ADD OPERATIONS

Publication number: 20180095728

Abstract: A floating point multiply-add unit having inputs coupled to receive a floating point multiplier data element, a floating point multiplicand data element, and a floating point addend data element. The multiply-add unit including a mantissa multiplier to multiply a mantissa of the multiplier data element and a mantissa of the multiplicand data element to calculate a mantissa product. The mantissa multiplier including a most significant bit portion to calculate most significant bits of the mantissa product, and a least significant bit portion to calculate least significant bits of the mantissa product. The mantissa multiplier has a plurality of different possible sizes of the least significant bit portion. Energy consumption reduction logic to selectively reduce energy consumption of the least significant bit portion, but not the most significant bit portion, to cause the least significant bit portion to not calculate the least significant bits of the mantissa product.

Type: Application

Filed: October 1, 2016

Publication date: April 5, 2018

Applicant: Intel Corporation

Inventors: William C. Hasenplaugh, Kermin E. Fleming, JR., Tryggve Fossum, Simon C. Steely, JR.
PROCESSORS, METHODS, SYSTEMS, AND INSTRUCTIONS TO LOAD MULTIPLE DATA ELEMENTS TO DESTINATION STORAGE LOCATIONS OTHER THAN PACKED DATA REGISTERS

Publication number: 20180095756

Abstract: A processor of an aspect includes a plurality of packed data registers, and a decode unit to decode an instruction. The instruction is to indicate a packed data register of the plurality of packed data registers that is to store a source packed memory address information. The source packed memory address information is to include a plurality of memory address information data elements. An execution unit is coupled with the decode unit and the plurality of packed data registers, the execution unit, in response to the instruction, is to load a plurality of data elements from a plurality of memory addresses that are each to correspond to a different one of the plurality of memory address information data elements, and store the plurality of loaded data elements in a destination storage location. The destination storage location does not include a register of the plurality of packed data registers.

Type: Application

Filed: September 30, 2016

Publication date: April 5, 2018

Applicant: Intel Corporation

Inventors: William C. Hasenplaugh, Chris J. Newburn, Simon C. Steely, JR., Samantika S. Sury
Hardware apparatuses and methods to control cache line coherency

Patent number: 9934146

Abstract: Methods and apparatuses to control cache line coherency are described. A processor may include a first core having a cache to store a cache line, a second core to send a request for the cache line from the first core, moving logic to cause a move of the cache line between the first core and a memory and to update a tag directory of the move, and cache line coherency logic to create a chain home in the tag directory from the request to cause the cache line to be sent from the tag directory to the second core. A method to control cache line coherency may include creating a chain home in a tag directory from a request for a cache line in a first processor core from a second processor core to cause the cache line to be sent from the tag directory to the second processor core.

Type: Grant

Filed: September 26, 2014

Date of Patent: April 3, 2018

Assignee: INTEL CORPORATION

Inventors: Simon C. Steely, Jr., Samantika S. Sury, William C. Hasenplaugh
Sharing aware snoop filter apparatus and method

Patent number: 9898408

Abstract: An apparatus and method are described for a sharing aware snoop filter. For example, one embodiment of a processor comprises: a plurality of caches, each of the caches comprising a plurality of cache lines, at least some of which are to be shared by two or more of the caches; a snoop filter to monitor accesses to the plurality of cache lines shared by the two or more caches, the snoop filter comprising: a primary snoop filter comprising a first plurality of entries, each entry associated with one of the plurality of cache lines and comprising a N unique identifiers to uniquely identify up to N of the plurality of caches currently storing the cache line; an auxiliary snoop filter comprising a second plurality of entries, each entry associated with one of the plurality of cache lines, wherein once a particular cache line has been shared by more than N caches, an entry for that cache line is allocated in the auxiliary snoop filter to uniquely identify one or more additional caches storing the cache line.

Type: Grant

Filed: April 1, 2016

Date of Patent: February 20, 2018

Assignee: Intel Corporation

Inventors: Samantika S. Sury, Robert G. Blankenship, Simon C. Steely, Jr.
INTERRUPTIBLE AND RESTARTABLE MATRIX MULTIPLICATION INSTRUCTIONS, PROCESSORS, METHODS, AND SYSTEMS

Publication number: 20180004510

Abstract: A processor of an aspect includes a decode unit to decode a matrix multiplication instruction. The matrix multiplication instruction is to indicate a first memory location of a first source matrix, is to indicate a second memory location of a second source matrix, and is to indicate a third memory location where a result matrix is to be stored. The processor also includes an execution unit coupled with the decode unit. The execution unit, in response to the matrix multiplication instruction, is to multiply a portion of the first and second source matrices prior to an interruption, and store a completion progress indicator in response to the interruption. The completion progress indicator to indicate an amount of progress in multiplying the first and second source matrices, and storing corresponding result data to the third memory location, that is to have been completed prior to the interruption.

Type: Application

Filed: July 2, 2016

Publication date: January 4, 2018

Applicant: Intel Corporation

Inventors: Edward T. Grochowski, Asit K. Mishra, Robert Valentine, Mark J. Charney, Simon C. Steely, JR.
SYNCHRONIZATION LOGIC FOR MEMORY REQUESTS

Publication number: 20170357586

Abstract: In an embodiment, a processor includes a plurality of cores and synchronization logic. The synchronization logic includes circuitry to: receive a first memory request and a second memory request; determine whether the second memory request is in contention with the first memory request; and in response to a determination that the second memory request is in contention with the first memory request, process the second memory request using a non-blocking cache coherence protocol. Other embodiments are described and claimed.

Type: Application

Filed: June 13, 2016

Publication date: December 14, 2017

Inventors: SAMANTIKA S. SURY, ROBERT G. BLANKENSHIP, SIMON C. STEELY, JR.
METHOD, APPARATUS, AND SYSTEM FOR CACHE COHERENCY USING A COARSE DIRECTORY

Publication number: 20170351430

Abstract: Systems, methods, and apparatuses are directed to requesting access to a memory address; storing an identification of the memory address in a data structure; receiving a first request for access to the memory address, the request comprising a reference to a second processor core; storing the reference to the second processor in the data structure; receiving a second request for access to the memory address, the second request comprising a reference to a third processor core; determining, based on the data structure, that the third processor core is different from the second processor core; and responding to the second request without buffering the second request.

Type: Application

Filed: June 1, 2016

Publication date: December 7, 2017

Inventors: Robert G. Blankenship, Simon C. Steely, JR., Samantika S. Sury

prev 1 2 3 4 5 6 7 8 … next