Patents by Inventor Guei-Yuan Lueh

Guei-Yuan Lueh has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

SYSTOLIC ARRAY OF ARBITRARY PHYSICAL AND LOGICAL DEPTH

Publication number: 20250117360

Abstract: A processing apparatus includes a processing resource including a general-purpose parallel processing engine and a matrix accelerator. The matrix accelerator includes first circuitry to receive a command to perform operations associated with an instruction, second circuitry to configure the matrix accelerator according to a physical depth of a systolic array within the matrix accelerator and a logical depth associated with the instruction, third circuitry to read operands for the instruction from a register file associated with the systolic array, fourth circuitry to perform operations for the instruction via one or more passes through one or more physical pipeline stages of the systolic array based on a configuration performed by the second circuitry, and fifth circuitry to write output of the operations to the register file associated with the systolic array.

Type: Application

Filed: October 30, 2024

Publication date: April 10, 2025

Applicant: Intel Corporation

Inventors: Jorge Parra, Wei-yu Chen, Kaiyu Chen, Varghese George, Junjie Gu, Chandra Gurram, Guei-Yuan Lueh, Stephen Junkins, Subramaniam Maiyuran, Supratim Pal
CONVERSION OPERATIONS AND SPECIAL VALUE USE CASES SUPPORTING 8-BIT FLOATING POINT FORMAT IN A GRAPHICS ARCHITECTURE

Publication number: 20250110733

Abstract: An apparatus to facilitate conversion operations and special value use cases supporting 8-bit floating point format in a graphics architecture is disclosed. The apparatus includes a processor comprising a decoder to decode an instruction fetched for execution into a decoded instruction, wherein the decoded instruction to cause the processor to perform conversion operation corresponding to an 8-bit floating point format operand; a scheduler to schedule the decoded instruction and provide input data for an input operand of the conversion operation indicated by the decoded instruction; and conversion circuitry to execute the decoded instruction to perform the conversion operation to convert the input operand to an output operand in accordance with the 8-bit floating point format operand, the conversion circuitry comprising hardware circuitry to rescale, normalize, and convert the input operand to the output operand.

Type: Application

Filed: September 29, 2023

Publication date: April 3, 2025

Applicant: Intel Corporation

Inventors: Jorge Eduardo Parra Osorio, Fangwen Fu, Guei-Yuan Lueh, Jiasheng Chen, Naveen K. Mellempudi, Kevin Hurd, Alexandre Hadj-Chaib, Elliot Taylor, Marius Cornea-Hasegan
SUPPORTING 8-BIT FLOATING POINT FORMAT FOR PARALLEL COMPUTING AND STOCHASTIC ROUNDING OPERATIONS IN A GRAPHICS ARCHITECTURE

Publication number: 20250110741

Abstract: An apparatus to facilitate supporting 8-bit floating point format for parallel computing and stochastic rounding operations in a graphics architecture is disclosed. The apparatus includes a processor comprising: a decoder to decode an instruction fetched for execution into a decoded instruction, wherein the decoded instruction is a matrix instruction that is to operate on 8-bit floating point operands to perform a parallel dot product operation; a scheduler to schedule the decoded instruction and provide input data for the 8-bit floating point operands in accordance with an 8-bit floating data format indicated by the decoded instruction; and circuitry to execute the decoded instruction to perform 32-way dot-product using 8-bit wide dot-product layers, each 8-bit wide dot-product layer comprises one or more sets of interconnected multipliers, shifters, and adders, wherein each set of multipliers, shifters, and adders is to generate a dot product of the 8-bit floating point operands.

Type: Application

Filed: September 29, 2023

Publication date: April 3, 2025

Applicant: Intel Corporation

Inventors: Jorge Eduardo Parra Osorio, Fangwen Fu, Guei-Yuan Lueh, Hong Jiang, Jiasheng Chen, Naveen K. Mellempudi, Kevin Hurd, Chunhui Mei, Alexandre Hadj-Chaib, Elliot Taylor, Shuai Mu
DATA LOCALITY ENHANCEMENT FOR GRAPHICS PROCESSING UNITS

Publication number: 20250103343

Abstract: Embodiments described herein provide an apparatus comprising a plurality of processing resources including a first processing resource and a second processing resource, a memory communicatively coupled to the first processing resource and the second processing resource, and a processor to receive data dependencies for one or more tasks comprising one or more producer tasks executing on the first processing resource and one or more consumer tasks executing on the second processing resource and move a data output from one or more producer tasks executing on the first processing resource to a cache memory communicatively coupled to the second processing resource. Other embodiments may be described and claimed.

Type: Application

Filed: November 21, 2024

Publication date: March 27, 2025

Applicant: INTEL CORPORATION

Inventors: Christopher J. HUGHES, Prasoonkumar SURTI, Guei-Yuan LUEH, Adam T. LAKE, Jill BOYCE, Subramaniam MAIYURAN, Lidong XU, James M. HOLLAND, Vasanth RANGANATHAN, Nikos KABURLASOS, Altug KOKER, Abhishek R. Appu
PAGE FAULTING AND SELECTIVE PREEMPTION

Publication number: 20250095099

Abstract: One embodiment provides a graphics processor comprising a system interface and circuitry coupled with the system interface. The circuitry includes an execution resource and a preemption status register. The execution resource is configured to execute an instruction. During execution of the instruction, the execution resource is to receive a request to preempt execution of a thread associated with the instruction and, based on a value stored in the preemption status register, execute at least one additional instruction after receipt of the request to preempt execution of the thread.

Type: Application

Filed: September 23, 2024

Publication date: March 20, 2025

Applicant: Intel Corporation

Inventors: Altug Koker, Ingo Wald, David Puffer, Subramaniam M. Maiyuran, Prasoonkumar Surti, Balaji Vembu, Guei-Yuan Lueh, Murali Ramadoss, Abhishek R. Appu, Joydeep Ray
INSTRUCTION ENCODING TO IMPLEMENT INCREASED REGISTER CAPACITY PER THREAD

Publication number: 20250068423

Abstract: Described herein is a graphics processor comprising first circuitry configured to execute a decoded instruction and second circuitry configured to second circuitry configured to decode an instruction into the decoded instruction. The second circuitry is configured to determine a number of registers within a register file that are available to a thread of the processing resource and decode the instruction based on that number of registers.

Type: Application

Filed: August 22, 2023

Publication date: February 27, 2025

Applicant: Intel Corporation

Inventors: Jorge Eduardo Parra Osorio, Jiasheng Chen, Supratim Pal, Vasanth Ranganathan, Guei-Yuan Lueh, James Valerio, Pradeep Golconda, Brent Schwartz, Fangwen Fu, Sabareesh Ganapathy, Peter Caday, Wei-Yu Chen, Po-Yu Chen, Timothy Bauer, Maxim Kazakov, Stanley Gambarin, Samir Pandya
FLOATING-POINT CONVERSION VIA AN INTEGER UNIT

Publication number: 20250036361

Abstract: Described herein is a graphics processor comprising a memory interface and a graphics processing cluster coupled with the memory interface. The graphics processing cluster includes a multi-lane parallel floating-point unit and a multi-lane parallel integer unit. The multi-lane parallel integer unit includes an integer pipeline including a plurality of parallel integer logic units configured to perform integer compute operations on a plurality of input data elements and a format conversion pipeline including a plurality of parallel format conversion units configured to convert a plurality of input data elements from a first one of a plurality of datatype formats to a second one of the plurality of datatype formats, the plurality of datatype formats including integer and floating-point formats.

Type: Application

Filed: July 25, 2023

Publication date: January 30, 2025

Applicant: Intel Corporation

Inventors: Supratim Pal, Jiasheng Chen, Kevin Hurd, Jorge E. Parra Osorio, Christopher Spencer, Guei-Yuan Lueh, Pradeep K. Golconda, Fangwen Fu, Wei Xiong, Hongzheng Li, James Valerio, Mukundan Swaminathan, Nicholas Murphy, Shuai Mu, Clifford Gibson, Buqi Cheng
AVOIDING THE USE OF A RESULT CROSSBAR WHEN DOWN CONVERTING TO PACKED REGISTER FORMATS

Publication number: 20250036412

Abstract: Described herein is a graphics processor comprising a memory interface and a graphics processing cluster coupled with the memory interface. The graphics processing cluster includes a plurality of processing resources. A processing resource of the plurality of processing resources includes a source crossbar communicatively coupled with a register file, the source crossbar to reorder data elements of a source operand and a format conversion pipeline to convert a plurality of input data elements specified by the source operand from a first format of a plurality of datatype formats to a second format of the plurality of datatype formats, the plurality of datatype formats including integer and floating-point formats.

Type: Application

Filed: July 25, 2023

Publication date: January 30, 2025

Applicant: Intel Corporation

Inventors: Supratim Pal, Jiasheng Chen, Christopher Spencer, Jorge E. Parra Osorio, Kevin Hurd, Guei-Yuan Lueh, Pradeep K. Golconda, Fangwen Fu, Wei Xiong, Hongzheng Li, James Valerio, Mukundan Swaminathan, Nicholas Murphy, Shuai Mu, Clifford Gibson, Buqi Cheng
32-BIT CHANNEL-ALIGNED INTEGER MULTIPLICATION VIA MULTIPLE MULTIPLIERS PER-CHANNEL

Publication number: 20250037347

Abstract: Described herein is a graphics processor comprising an instruction cache and a plurality of processing elements coupled with the instruction cache. The plurality of processing elements include functional units configured to provide an integer pipeline to execute instructions to perform operations on integer data elements. The integer pipeline including a first multiplier and a second multiplier, the first multiplier and the second multiplier configured to execute operations for a single instruction.

Type: Application

Filed: July 25, 2023

Publication date: January 30, 2025

Applicant: Intel Corporation

Inventors: Jiasheng Chen, Supratim Pal, Kevin Hurd, Jorge E. Parra Osorio, Christopher Spencer, Takashi Nakagawa, Guei-Yuan Lueh, Pradeep K. Golconda, James Valerio, Mukundan Swaminathan, Nicholas Murphy, Clifford Gibson, Li-An Tang, Fangwen Fu, Kaiyu Chen, Buqi Cheng
Multiple register allocation sizes for threads

Patent number: 12210905

Abstract: Provision of multiple register allocation sizes for threads is described. An example of a system includes one or more processors including a graphics processor, the graphics processor including at least a first local thread dispatcher (TDL) and multiple processing resources, each processing resource including a plurality of registers; and memory for storage of data for processing, wherein the one or more processors are to determine a register size for a first thread; identify one or more processing resources having sufficient register space for the first thread; select a processing resource of the one or more processing resources having sufficient register space to assign the first thread; select an available thread slot of the selected processing resource for the first thread; and allocate registers of the selected processing resource for the first thread.

Type: Grant

Filed: June 25, 2021

Date of Patent: January 28, 2025

Assignee: INTEL CORPORATION

Inventors: Chandra Gurram, Wei-Yu Chen, Vikranth Vemulapalli, Subramaniam Maiyuran, Jorge Eduardo Parra Osorio, Shuai Mu, Guei-Yuan Lueh, Supratim Pal
Data locality enhancement for graphics processing units

Patent number: 12190118

Abstract: Embodiments described herein provide an apparatus comprising a plurality of processing resources including a first processing resource and a second processing resource, a memory communicatively coupled to the first processing resource and the second processing resource, and a processor to receive data dependencies for one or more tasks comprising one or more producer tasks executing on the first processing resource and one or more consumer tasks executing on the second processing resource and move a data output from one or more producer tasks executing on the first processing resource to a cache memory communicatively coupled to the second processing resource. Other embodiments may be described and claimed.

Type: Grant

Filed: June 22, 2023

Date of Patent: January 7, 2025

Assignee: INTEL CORPORATION

Inventors: Christopher J. Hughes, Prasoonkumar Surti, Guei-Yuan Lueh, Adam T. Lake, Jill Boyce, Subramaniam Maiyuran, Lidong Xu, James M. Holland, Vasanth Ranganathan, Nikos Kaburlasos, Altug Koker, Abhishek R. Appu
Systolic array of arbitrary physical and logical depth

Patent number: 12174783

Abstract: A processing apparatus includes a processing resource including a general-purpose parallel processing engine and a matrix accelerator. The matrix accelerator includes first circuitry to receive a command to perform operations associated with an instruction, second circuitry to configure the matrix accelerator according to a physical depth of a systolic array within the matrix accelerator and a logical depth associated with the instruction, third circuitry to read operands for the instruction from a register file associated with the systolic array, fourth circuitry to perform operations for the instruction via one or more passes through one or more physical pipeline stages of the systolic array based on a configuration performed by the second circuitry, and fifth circuitry to write output of the operations to the register file associated with the systolic array.

Type: Grant

Filed: June 24, 2021

Date of Patent: December 24, 2024

Assignee: Intel Corporation

Inventors: Jorge Parra, Wei-yu Chen, Kaiyu Chen, Varghese George, Junjie Gu, Chandra Gurram, Guei-Yuan Lueh, Stephen Junkins, Subramaniam Maiyuran, Supratim Pal
Instruction prefetch mechanism

Patent number: 12164430

Abstract: An apparatus to facilitate data prefetching is disclosed. The apparatus includes a cache, one or more execution units (EUs) to execute program code, prefetch logic to maintain tracking information of memory instructions in the program code that trigger a cache miss and compiler logic to receive the tracking information, insert one or more pre-fetch instructions in updated program code to prefetch data from a memory for execution of one or more of the memory instructions that triggered a cache miss and download the updated program code for execution by the one or more EUs.

Type: Grant

Filed: September 20, 2023

Date of Patent: December 10, 2024

Assignee: INTEL CORPORATION

Inventors: Vasileios Porpodas, Guei-Yuan Lueh, Subramaniam Maiyuran, Wei-Yu Chen
GRAPHICS PROCESSORS AND GRAPHICS PROCESSING UNITS HAVING DOT PRODUCT ACCUMULATE INSTRUCTION FOR HYBRID FLOATING POINT FORMAT

Publication number: 20240362180

Abstract: Graphics processors and graphics processing units having dot product accumulate instructions for a hybrid floating point format are disclosed. In one embodiment, a graphics multiprocessor comprises an instruction unit to dispatch instructions and a processing resource coupled to the instruction unit. The processing resource is configured to receive a dot product accumulate instruction from the instruction unit and to process the dot product accumulate instruction using a bfloat16 number (BF16) format.

Type: Application

Filed: April 26, 2024

Publication date: October 31, 2024

Applicant: Intel Corporation

Inventors: Subramaniam Maiyuran, Shubra Marwaha, Ashutosh Garg, Supratim Pal, Jorge Parra, Chandra Gurram, Varghese George, Darin Starkey, Guei-Yuan Lueh
Page faulting and selective preemption

Patent number: 12131402

Abstract: One embodiment provides a graphics processor comprising a system interface and circuitry coupled with the system interface. The circuitry includes an execution resource and a preemption status register. The execution resource is configured to execute an instruction. During execution of the instruction, the execution resource is to receive a request to preempt execution of a thread associated with the instruction and, based on a value stored in the preemption status register, execute at least one additional instruction after receipt of the request to preempt execution of the thread.

Type: Grant

Filed: May 20, 2022

Date of Patent: October 29, 2024

Assignee: Intel Corporation

Inventors: Altug Koker, Ingo Wald, David Puffer, Subramaniam M. Maiyuran, Prasoonkumar Surti, Balaji Vembu, Guei-Yuan Lueh, Murali Ramadoss, Abhishek R. Appu, Joydeep Ray
Page faulting and selective preemption

Patent number: 12067641

Abstract: One embodiment provides a parallel processor comprising a memory interface and a processing array coupled with the memory interface. The processing array is configured to address memory accessed via the memory interface via a virtual address mapping and includes circuitry to resolve a page fault for the virtual address mapping, wherein each of the multiple compute blocks is separately preemptable.

Type: Grant

Filed: May 20, 2022

Date of Patent: August 20, 2024

Assignee: Intel Corporation

Inventors: Altug Koker, Ingo Wald, David Puffer, Subramaniam M. Maiyuran, Prasoonkumar Surti, Balaji Vembu, Guei-Yuan Lueh, Murali Ramadoss, Abhishek R. Appu, Joydeep Ray
VIRTUAL ADDRESS ACCESS TO GPU SURFACE AND SAMPLER STATES

Publication number: 20240231621

Abstract: Embodiments described herein provide a technique to enable access to entries in a surface state or sampler state using 64-bit virtual addresses. One embodiment provides a graphics core that includes memory access circuitry configured to facilitate access to the memory by functional units of the graphics core. The memory access circuitry is configured to receive a message to access an entry in a surface state or a sampler state associated with a parallel processing operation. The message specifies a base address for a surface state entry or sampler state entry. The circuitry can add the base address and the offset to determine a 64-bit virtual address for the entry in the surface state entry or the sampler state and submit a memory access request to the memory to access the entry of the surface state or sampler state.

Type: Application

Filed: October 21, 2022

Publication date: July 11, 2024

Applicant: Intel Corporation

Inventors: Joydeep Ray, Michael Apodaca, Yoav Harel, Guei-Yuan Lueh, John A. Wiegert
DATA MULTICAST IN COMPUTE CORE CLUSTERS

Publication number: 20240220254

Abstract: Data multicast in compute core clusters is described. An example of an apparatus includes one or more processors including at least a first processor, the first processor including one or more clusters of cores and a memory, wherein each cluster of cores includes multiple cores, each core including one or more processing resources, shared memory, and broadcast circuitry; and wherein a first core in a first cluster of cores is to request a data element, determine whether any additional cores in the first cluster require the data element, and, upon determining that one or more additional cores in the first cluster require the data element, broadcast the data element to the one or more additional cores via interconnects between the broadcast circuitry of the cores of the first core cluster.

Type: Application

Filed: December 30, 2022

Publication date: July 4, 2024

Applicant: Intel Corporation

Inventors: Chunhui Mei, Yongsheng Liu, John A. Wiegert, Vasanth Ranganathan, Ben J. Ashbaugh, Fangwen Fu, Hong Jiang, Guei-Yuan Lueh, James Valerio, Alan M. Curtis, Maxim Kazakov
SCALABLE AND CONFIGURABLE CLUSTERED SYSTOLIC ARRAY

Publication number: 20240220448

Abstract: A scalable and configurable clustered systolic array is described. An example of apparatus includes a cluster including multiple cores; and a cache memory coupled with the cluster, wherein each core includes multiple processing resources, a memory coupled with the plurality of processing resources, a systolic array coupled with the memory, and one or more interconnects with one or more other cores of the plurality of cores; and wherein the systolic arrays of the cores are configurable by the apparatus to form a logically combined systolic array for processing of an operation by a cooperative group of threads running on one or more of the plurality of cores in the cluster.

Type: Application

Filed: December 30, 2022

Publication date: July 4, 2024

Applicant: Intel Corporation

Inventors: Chunhui Mei, Jiasheng Chen, Ben J. Ashbaugh, Fangwen Fu, Hong Jiang, Guei-Yuan Lueh, Rama S.B. Harihara, Maxim Kazakov
SYNCHRONIZATION FOR DATA MULTICAST IN COMPUTE CORE CLUSTERS

Publication number: 20240220335

Abstract: Synchronization for data multicast in compute core clusters is described. An example of an apparatus includes one or more processors including at least a graphics processing unit (GPU), the GPU including one or more clusters of cores and a memory, wherein each cluster of cores includes a plurality of cores, each core including one or more processing resources, shared local memory, and gateway circuitry, wherein the GPU is to initiate broadcast of a data element from a producer core to one or more consumer cores, and synchronize the broadcast of the data element utilizing the gateway circuitry of the producer core and the one or more consumer cores, and wherein synchronizing the broadcast of the data element includes establishing a multi-core barrier for broadcast of the data element.

Type: Application

Filed: December 30, 2022

Publication date: July 4, 2024

Applicant: Intel Corporation

Inventors: Chunhui Mei, Yongsheng Liu, John A. Wiegert, Vasanth Ranganathan, Ben J. Ashbaugh, Fangwen Fu, Hong Jiang, Guei-Yuan Lueh, James Valerio, Alan M. Curtis, Maxim Kazakov

1 2 3 4 5 … next