Patents by Inventor Christopher J. Hughes
Christopher J. Hughes has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).
-
Patent number: 12287843Abstract: Disclosed embodiments relate to accelerating multiplication of sparse matrices. In one example, a processor is to fetch and decode an instruction having fields to specify locations of first, second, and third matrices, and an opcode indicating the processor is to multiply and accumulate matching non-zero (NZ) elements of the first and second matrices with corresponding elements of the third matrix, and executing the decoded instruction as per the opcode to generate NZ bitmasks for the first and second matrices, broadcast up to two NZ elements at a time from each row of the first matrix and each column of the second matrix to a processing engine (PE) grid, each PE to multiply and accumulate matching NZ elements of the first and second matrices with corresponding elements of the third matrix. Each PE further to store an NZ element for use in a subsequent multiplications.Type: GrantFiled: November 6, 2023Date of Patent: April 29, 2025Assignee: Intel CorporationInventors: Dan Baum, Chen Koren, Elmoustapha Ould-Ahmed-Vall, Michael Espig, Christopher J. Hughes, Raanan Sade, Robert Valentine, Mark J. Charney, Alexander F. Heinecke
-
Patent number: 12277419Abstract: Systems, methods, and apparatuses relating to instructions to convert 16-bit floating-point formats are described. In one embodiment, a processor includes fetch circuitry to fetch a single instruction having fields to specify an opcode and locations of a source vector comprising N plurality of 16-bit half-precision floating-point elements, and a destination vector to store N plurality of 16-bit bfloat floating-point elements, the opcode to indicate execution circuitry is to convert each of the elements of the source vector from 16-bit half-precision floating-point format to 16-bit bfloat floating-point format and store each converted element into a corresponding location of the destination vector, decode circuitry to decode the fetched single instruction into a decoded single instruction, and the execution circuitry to respond to the decoded single instruction as specified by the opcode.Type: GrantFiled: December 24, 2020Date of Patent: April 15, 2025Assignee: Intel CorporationInventors: Alexander F. Heinecke, Robert Valentine, Mark J. Charney, Menachem Adelman, Christopher J. Hughes, Evangelos Georganas, Zeev Sperber, Amit Gradstein, Simon Rubanovich
-
Patent number: 12265826Abstract: Disclosed embodiments relate to systems for performing instructions to quickly convert and use matrices (tiles) as one-dimensional vectors. In one example, a processor includes fetch circuitry to fetch an instruction having fields to specify an opcode, locations of a two-dimensional (2D) matrix and a one-dimensional (1D) vector, and a group of elements comprising one of a row, part of a row, multiple rows, a column, part of a column, multiple columns, and a rectangular sub-tile of the specified 2D matrix, and wherein the opcode is to indicate a move of the specified group between the 2D matrix and the 1D vector, decode circuitry to decode the fetched instruction; and execution circuitry, responsive to the decoded instruction, when the opcode specifies a move from 1D, to move contents of the specified 1D vector to the specified group of elements.Type: GrantFiled: December 28, 2023Date of Patent: April 1, 2025Assignee: Intel CorporationInventors: Bret Toll, Christopher J. Hughes, Dan Baum, Elmoustapha Ould-Ahmed-Vall, Raanan Sade, Robert Valentine, Mark J. Charney, Alexander F. Heinecke
-
Publication number: 20250103343Abstract: Embodiments described herein provide an apparatus comprising a plurality of processing resources including a first processing resource and a second processing resource, a memory communicatively coupled to the first processing resource and the second processing resource, and a processor to receive data dependencies for one or more tasks comprising one or more producer tasks executing on the first processing resource and one or more consumer tasks executing on the second processing resource and move a data output from one or more producer tasks executing on the first processing resource to a cache memory communicatively coupled to the second processing resource. Other embodiments may be described and claimed.Type: ApplicationFiled: November 21, 2024Publication date: March 27, 2025Applicant: INTEL CORPORATIONInventors: Christopher J. HUGHES, Prasoonkumar SURTI, Guei-Yuan LUEH, Adam T. LAKE, Jill BOYCE, Subramaniam MAIYURAN, Lidong XU, James M. HOLLAND, Vasanth RANGANATHAN, Nikos KABURLASOS, Altug KOKER, Abhishek R. Appu
-
Publication number: 20250103337Abstract: An apparatus and method for partitioned shuffling of data elements. A first partition is associated with a first number of source data elements corresponding to a first plurality of lanes having a first plurality of lane identifiers (IDs) and a second partition is associated with a second number of source data elements corresponding to a second plurality of lanes having a second plurality of lane IDs. A bounded offset vector is generated based on allowable ranges for a plurality of offset values associated with the source data elements. An index vector is generated by permuting the first and second plurality of lane IDs in accordance with the bounded offset vector.Type: ApplicationFiled: September 27, 2023Publication date: March 27, 2025Inventors: Simon PENNYCOOK, Christopher J. HUGHES
-
Publication number: 20250068424Abstract: Methods, apparatus, and computer programs are disclosed for data/instruction access based on performance hints. In some embodiments, a method comprises decoding an instruction to access data or code by a core of a computer processor, the instruction to provide one or more hints on how the data or code is to be processed through a cache hierarchy of the computer processor based on the instruction, the one or more hints indicating which level of the cache hierarchy or which cache in a level of the cache hierarchy to load or store the data or code, a priority of the data or code in a cache, or how the data or code is to be shared among multiple cores of the computer processor. The method further comprises processing the data or code based on the one or more hints responsive to the decoded instruction.Type: ApplicationFiled: November 8, 2024Publication date: February 27, 2025Inventors: Duane GALBI, Christopher J. HUGHES, Dan BAUM
-
Publication number: 20250068422Abstract: Methods, apparatus, and computer programs are disclosed for context switching. In some embodiments, a method comprises dedicating a first subset of a plurality of vector registers to a first thread of a plurality of threads for thread execution; and responsive to a context switch from the first thread to a second thread, bypassing saving a state of the first subset of the plurality of vector registers; and saving a state of a second subset of the plurality of vector registers, wherein the second subset of the plurality of vector registers is not dedicated to the first thread, and wherein the first and second subsets are mutually exclusive.Type: ApplicationFiled: November 8, 2024Publication date: February 27, 2025Inventors: Duane GALBI, Christopher J. HUGHES, Dan BAUM, H. Peter ANVIN, Stijn EYERMAN
-
Patent number: 12216579Abstract: Disclosed embodiments relate to atomic memory operations. In one example, an apparatus includes multiple processor cores, a cache hierarchy, a local execution unit, and a remote execution unit, and an adaptive remote atomic operation unit. The cache hierarchy includes a local cache at a first level and a shared cache at a second level. The local execution unit is to perform an atomic operation at the first level if the local cache is a storing a cache line including data for the atomic operation. The remote execution unit is to perform the atomic operation at the second level. The adaptive remote atomic operation unit is to determine whether to perform the first atomic operation at the first level or at the second level and whether to copy the cache line from the shared cache to the local cache.Type: GrantFiled: December 25, 2020Date of Patent: February 4, 2025Assignee: Intel CorporationInventors: Carl J. Beckmann, Samantika S. Sury, Christopher J. Hughes, Lingxiang Xiang, Rahul Agrawal
-
Patent number: 12210446Abstract: An embodiment of an integrated circuit may comprise circuitry communicatively coupled to two or more sub-non-uniform memory access clusters (SNCs) to allocate a specified memory space in the two or more SNCs in accordance with a SNC memory allocation policy indicated from a request to initialize the specified memory space. An embodiment of an apparatus may comprise decode circuitry to decode a single instruction, the single instruction to include a field for an opcode, and execution circuitry to execute the decoded instruction according to the opcode to provide an indicated SNC memory allocation policy (e.g., a SNC policy hint). Other embodiments are disclosed and claimed.Type: GrantFiled: June 21, 2021Date of Patent: January 28, 2025Assignee: Intel CorporationInventors: Zhe Wang, Lingxiang Xiang, Christopher J. Hughes
-
Patent number: 12190118Abstract: Embodiments described herein provide an apparatus comprising a plurality of processing resources including a first processing resource and a second processing resource, a memory communicatively coupled to the first processing resource and the second processing resource, and a processor to receive data dependencies for one or more tasks comprising one or more producer tasks executing on the first processing resource and one or more consumer tasks executing on the second processing resource and move a data output from one or more producer tasks executing on the first processing resource to a cache memory communicatively coupled to the second processing resource. Other embodiments may be described and claimed.Type: GrantFiled: June 22, 2023Date of Patent: January 7, 2025Assignee: INTEL CORPORATIONInventors: Christopher J. Hughes, Prasoonkumar Surti, Guei-Yuan Lueh, Adam T. Lake, Jill Boyce, Subramaniam Maiyuran, Lidong Xu, James M. Holland, Vasanth Ranganathan, Nikos Kaburlasos, Altug Koker, Abhishek R. Appu
-
Publication number: 20250004773Abstract: An apparatus and method are described for prefetching data with hints. For example, one embodiment of a processor comprises: a plurality of cores to process instructions; a first core of the plurality of cores comprising: decoder circuitry to decode instructions indicating memory operations including load operations of a first type with shared data hints and load operations of a second type without shared data hints; execution circuitry to execute the instructions to perform the memory operations; data prefetch circuitry to store tracking data in a tracking data structure responsive to the memory operations, a portion of the tracking data associated with the first type of load operations; and the data prefetch circuitry to detect memory access patterns using the tracking data, the data prefetch circuitry to responsively issue one or more prefetch operations using shared data hints based, at least in part, on the portion of the tracking data associated with the first type of load operations.Type: ApplicationFiled: June 30, 2023Publication date: January 2, 2025Inventors: Christopher J. HUGHES, Zhe WANG, Dan BAUM, Venkateswara Rao MADDURI, Chen DAN, Joseph NUZMAN
-
Publication number: 20250004764Abstract: Techniques for providing 512-bit operands or smaller are described. In some examples, a prefix of an instruction is utilized to define the operand (vector) length. For example, an instruction is to at least include fields for a prefix, an opcode, and operand addressing information, wherein the prefix and addressing information are to be used by decoder circuitry to determine support for a particular a vector length for one or more operands of the instance of the single instruction and the opcode is to indicate one or more operations to perform on the one or more operands.Type: ApplicationFiled: July 1, 2023Publication date: January 2, 2025Inventors: Michael ESPIG, Menachem ADELMAN, Jonathan COMBS, Amit GRADSTEIN, Christopher J. HUGHES, Vivekananthan SANJEEPAN, Wing Shek WONG
-
Publication number: 20250004765Abstract: Techniques for loading data with a hint related to data sharing with other cores. For example, one embodiment of an apparatus comprises: a plurality of cores to process instructions; a first core of the plurality of cores comprising: decoder circuitry to decode a single instruction, the single instruction having a first field for an opcode to indicate a load operation to read data from a memory, a second field to indicate a memory address for a location of the data in the memory, and a third field to store a value to indicate whether the data is expected to be shared between the first core and at least a second core of the plurality of cores; execution circuitry to execute the single instruction to read the data from the location in the memory; and cache controller circuitry to store the data in one or more caches in a state selected based on the value.Type: ApplicationFiled: June 30, 2023Publication date: January 2, 2025Inventors: Christopher J. HUGHES, Zhe WANG, Dan BAUM, Venkateswara Rao MADDURI, Alexander HEINECKE, Evangelos GEORGANAS, Chen DAN, Joseph NUZMAN
-
Patent number: 12175246Abstract: Disclosed embodiments relate to matrix compress/decompress instructions. In one example, a processor includes fetch circuitry to fetch a compress instruction having a format with fields to specify an opcode and locations of decompressed source and compressed destination matrices, decode circuitry to decode the fetched compress instructions, and execution circuitry, responsive to the decoded compress instruction, to: generate a compressed result according to a compress algorithm by compressing the specified decompressed source matrix by either packing non-zero-valued elements together and storing the matrix position of each non-zero-valued element in a header, or using fewer bits to represent one or more elements and using the header to identify matrix elements being represented by fewer bits; and store the compressed result to the specified compressed destination matrix.Type: GrantFiled: September 1, 2023Date of Patent: December 24, 2024Assignee: Intel CorporationInventors: Dan Baum, Michael Espig, James Guilford, Wajdi K. Feghali, Raanan Sade, Christopher J. Hughes, Robert Valentine, Bret Toll, Elmoustapha Ould-Ahmed-Vall, Mark J. Charney, Vinodh Gopal, Ronen Zohar, Alexander F. Heinecke
-
Publication number: 20240362021Abstract: Disclosed embodiments relate to atomic memory operations. In one example, a method of executing an instruction atomically and with weak order includes: fetching, by fetch circuitry, the instruction from code storage, the instruction including an opcode, a source identifier, and a destination identifier, decoding, by decode circuitry, the fetched instruction, selecting, by a scheduling circuit, an execution circuit among multiple circuits in a system, scheduling, by the scheduling circuit, execution of the decoded instruction out of order with respect to other instructions, with an order selected to optimize at least one of latency, throughput, power, and performance, and executing the decoded instruction, by the execution circuit, to: atomically read a datum from a location identified by the destination identifier, perform an operation on the datum as specified by the opcode, the operation to use a source operand identified by the source identifier, and write a result back to the location.Type: ApplicationFiled: May 21, 2024Publication date: October 31, 2024Inventors: Doddaballapur N. Jayasimha, Jonas Svennebring, Samantika S. Sury, Christopher J. Hughes, Jong Soo Park, Lingxiang Xiang
-
Patent number: 12130740Abstract: Embodiments of an invention a processor architecture are disclosed. In an embodiment, a processor includes a decoder, an execution unit, a coherent cache, and an interconnect. The decoder is to decode an instruction to zero a cache line. The execution unit is to issue a write command to initiate a cache line sized write of zeros. The coherent cache is to receive the write command, to determine whether there is a hit in the coherent cache and whether a cache coherency protocol state of the hit cache line is a modified state or an exclusive state, to configure a cache line to indicate all zeros, and to issue the write command toward the interconnect. The interconnect is to, responsive to receipt of the write command, issue a snoop to each of a plurality of other coherent caches for which it must be determined if there is a hit.Type: GrantFiled: April 4, 2022Date of Patent: October 29, 2024Assignee: Intel CorporationInventors: Jason W. Brandt, Robert S. Chappell, Jesus Corbal, Edward T. Grochowski, Stephen H. Gunther, Buford M. Guy, Thomas R. Huff, Christopher J. Hughes, Elmoustapha Ould-Ahmed-Vall, Ronak Singhal, Seyed Yahya Sotoudeh, Bret L. Toll, Lihu Rappoport, David B. Papworth, James D. Allen
-
Publication number: 20240354107Abstract: In one example, a processor includes: at least one core to execute instructions; and at least one cache memory coupled to the at least one core, the at least one cache memory to store data, at least some of the data a copy of data stored in a memory. The at least one core is to determine whether to conditionally offload a sequence of instructions for execution on a compute circuit associated with the memory, based at least in part on whether one or more first data is present in the at least one cache memory, the one or more first data for use during execution of the sequence of instructions. Other embodiments are described and claimed.Type: ApplicationFiled: June 26, 2024Publication date: October 24, 2024Inventors: Frank Hady, Christopher J. Hughes, Scott Peterson
-
Patent number: 12112167Abstract: Embodiments for gathering and scattering matrix data by row are disclosed. In an embodiment, a processor includes a storage matrix, a decoder, and execution circuitry. The decoder is to decode an instruction having a format including an opcode field to specify an opcode and a first operand field to specify a set of irregularly spaced memory locations. The execution circuitry is to, in response to the decoded instruction, calculate a set of addresses corresponding to the set of irregularly spaced memory locations and transfer a set of rows of data between the storage and the set of irregularly spaced memory locations.Type: GrantFiled: June 27, 2020Date of Patent: October 8, 2024Assignee: Intel CorporationInventors: Christopher J. Hughes, Alexander F. Heinecke, Robert Valentine, Menachem Adelman, Evangelos Georganas, Mark J. Charney, Nikita A. Shustrov, Sara Baghsorkhi
-
Publication number: 20240329938Abstract: Embodiments for a matrix transpose and multiply operation are disclosed. In an embodiment, a processor includes a decoder and execution circuitry. The decoder is to decode an instruction having a format including an opcode field to specify an opcode, a first destination operand field to specify a destination matrix location, a first source operand field to specify a first source matrix location, and a second source operand field to specify a second source matrix location. The execution circuitry is to, in response to the decoded instruction, transpose the first source matrix to generate a transposed first source matrix, perform a matrix multiplication using the transposed first source matrix and the second source matrix to generate a result, and store the result in a destination matrix location.Type: ApplicationFiled: March 15, 2024Publication date: October 3, 2024Applicant: Intel CorporationInventors: Menachem Adelman, Robert Valentine, Barukh Ziv, Amit Gradstein, Simon Rubanovich, Zeev Sperber, Mark J. Charney, Christopher J. Hughes, Alexander F. Heinecke, Evangelos Georganas, Binh Pham
-
Patent number: 12106104Abstract: A processor that includes compression instructions to compress multiple adjacent data blocks of uncompressed read-only data stored in memory into one compressed read-only data block and store the compressed read-only data block in multiple adjacent blocks in the memory is provided. During execution of an application to operate on the read-only data, one of the multiple adjacent blocks storing the compressed read-only block is read from memory, stored in a prefetch buffer and decompressed in the memory controller. In response to a subsequent request during execution of the application for an adjacent data block in the compressed read-only data block, the uncompressed adjacent block is read directly from the prefetch buffer.Type: GrantFiled: December 23, 2020Date of Patent: October 1, 2024Assignee: Intel CorporationInventors: Zhe Wang, Alaa R. Alameldeen, Christopher J. Hughes