Patents by Inventor Guei-Yuan Lueh

Guei-Yuan Lueh has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

  • Publication number: 20250117360
    Abstract: A processing apparatus includes a processing resource including a general-purpose parallel processing engine and a matrix accelerator. The matrix accelerator includes first circuitry to receive a command to perform operations associated with an instruction, second circuitry to configure the matrix accelerator according to a physical depth of a systolic array within the matrix accelerator and a logical depth associated with the instruction, third circuitry to read operands for the instruction from a register file associated with the systolic array, fourth circuitry to perform operations for the instruction via one or more passes through one or more physical pipeline stages of the systolic array based on a configuration performed by the second circuitry, and fifth circuitry to write output of the operations to the register file associated with the systolic array.
    Type: Application
    Filed: October 30, 2024
    Publication date: April 10, 2025
    Applicant: Intel Corporation
    Inventors: Jorge Parra, Wei-yu Chen, Kaiyu Chen, Varghese George, Junjie Gu, Chandra Gurram, Guei-Yuan Lueh, Stephen Junkins, Subramaniam Maiyuran, Supratim Pal
  • Publication number: 20250110733
    Abstract: An apparatus to facilitate conversion operations and special value use cases supporting 8-bit floating point format in a graphics architecture is disclosed. The apparatus includes a processor comprising a decoder to decode an instruction fetched for execution into a decoded instruction, wherein the decoded instruction to cause the processor to perform conversion operation corresponding to an 8-bit floating point format operand; a scheduler to schedule the decoded instruction and provide input data for an input operand of the conversion operation indicated by the decoded instruction; and conversion circuitry to execute the decoded instruction to perform the conversion operation to convert the input operand to an output operand in accordance with the 8-bit floating point format operand, the conversion circuitry comprising hardware circuitry to rescale, normalize, and convert the input operand to the output operand.
    Type: Application
    Filed: September 29, 2023
    Publication date: April 3, 2025
    Applicant: Intel Corporation
    Inventors: Jorge Eduardo Parra Osorio, Fangwen Fu, Guei-Yuan Lueh, Jiasheng Chen, Naveen K. Mellempudi, Kevin Hurd, Alexandre Hadj-Chaib, Elliot Taylor, Marius Cornea-Hasegan
  • Publication number: 20250110741
    Abstract: An apparatus to facilitate supporting 8-bit floating point format for parallel computing and stochastic rounding operations in a graphics architecture is disclosed. The apparatus includes a processor comprising: a decoder to decode an instruction fetched for execution into a decoded instruction, wherein the decoded instruction is a matrix instruction that is to operate on 8-bit floating point operands to perform a parallel dot product operation; a scheduler to schedule the decoded instruction and provide input data for the 8-bit floating point operands in accordance with an 8-bit floating data format indicated by the decoded instruction; and circuitry to execute the decoded instruction to perform 32-way dot-product using 8-bit wide dot-product layers, each 8-bit wide dot-product layer comprises one or more sets of interconnected multipliers, shifters, and adders, wherein each set of multipliers, shifters, and adders is to generate a dot product of the 8-bit floating point operands.
    Type: Application
    Filed: September 29, 2023
    Publication date: April 3, 2025
    Applicant: Intel Corporation
    Inventors: Jorge Eduardo Parra Osorio, Fangwen Fu, Guei-Yuan Lueh, Hong Jiang, Jiasheng Chen, Naveen K. Mellempudi, Kevin Hurd, Chunhui Mei, Alexandre Hadj-Chaib, Elliot Taylor, Shuai Mu
  • Publication number: 20250103343
    Abstract: Embodiments described herein provide an apparatus comprising a plurality of processing resources including a first processing resource and a second processing resource, a memory communicatively coupled to the first processing resource and the second processing resource, and a processor to receive data dependencies for one or more tasks comprising one or more producer tasks executing on the first processing resource and one or more consumer tasks executing on the second processing resource and move a data output from one or more producer tasks executing on the first processing resource to a cache memory communicatively coupled to the second processing resource. Other embodiments may be described and claimed.
    Type: Application
    Filed: November 21, 2024
    Publication date: March 27, 2025
    Applicant: INTEL CORPORATION
    Inventors: Christopher J. HUGHES, Prasoonkumar SURTI, Guei-Yuan LUEH, Adam T. LAKE, Jill BOYCE, Subramaniam MAIYURAN, Lidong XU, James M. HOLLAND, Vasanth RANGANATHAN, Nikos KABURLASOS, Altug KOKER, Abhishek R. Appu
  • Publication number: 20250095099
    Abstract: One embodiment provides a graphics processor comprising a system interface and circuitry coupled with the system interface. The circuitry includes an execution resource and a preemption status register. The execution resource is configured to execute an instruction. During execution of the instruction, the execution resource is to receive a request to preempt execution of a thread associated with the instruction and, based on a value stored in the preemption status register, execute at least one additional instruction after receipt of the request to preempt execution of the thread.
    Type: Application
    Filed: September 23, 2024
    Publication date: March 20, 2025
    Applicant: Intel Corporation
    Inventors: Altug Koker, Ingo Wald, David Puffer, Subramaniam M. Maiyuran, Prasoonkumar Surti, Balaji Vembu, Guei-Yuan Lueh, Murali Ramadoss, Abhishek R. Appu, Joydeep Ray
  • Publication number: 20250068423
    Abstract: Described herein is a graphics processor comprising first circuitry configured to execute a decoded instruction and second circuitry configured to second circuitry configured to decode an instruction into the decoded instruction. The second circuitry is configured to determine a number of registers within a register file that are available to a thread of the processing resource and decode the instruction based on that number of registers.
    Type: Application
    Filed: August 22, 2023
    Publication date: February 27, 2025
    Applicant: Intel Corporation
    Inventors: Jorge Eduardo Parra Osorio, Jiasheng Chen, Supratim Pal, Vasanth Ranganathan, Guei-Yuan Lueh, James Valerio, Pradeep Golconda, Brent Schwartz, Fangwen Fu, Sabareesh Ganapathy, Peter Caday, Wei-Yu Chen, Po-Yu Chen, Timothy Bauer, Maxim Kazakov, Stanley Gambarin, Samir Pandya
  • Publication number: 20250036361
    Abstract: Described herein is a graphics processor comprising a memory interface and a graphics processing cluster coupled with the memory interface. The graphics processing cluster includes a multi-lane parallel floating-point unit and a multi-lane parallel integer unit. The multi-lane parallel integer unit includes an integer pipeline including a plurality of parallel integer logic units configured to perform integer compute operations on a plurality of input data elements and a format conversion pipeline including a plurality of parallel format conversion units configured to convert a plurality of input data elements from a first one of a plurality of datatype formats to a second one of the plurality of datatype formats, the plurality of datatype formats including integer and floating-point formats.
    Type: Application
    Filed: July 25, 2023
    Publication date: January 30, 2025
    Applicant: Intel Corporation
    Inventors: Supratim Pal, Jiasheng Chen, Kevin Hurd, Jorge E. Parra Osorio, Christopher Spencer, Guei-Yuan Lueh, Pradeep K. Golconda, Fangwen Fu, Wei Xiong, Hongzheng Li, James Valerio, Mukundan Swaminathan, Nicholas Murphy, Shuai Mu, Clifford Gibson, Buqi Cheng
  • Publication number: 20250036412
    Abstract: Described herein is a graphics processor comprising a memory interface and a graphics processing cluster coupled with the memory interface. The graphics processing cluster includes a plurality of processing resources. A processing resource of the plurality of processing resources includes a source crossbar communicatively coupled with a register file, the source crossbar to reorder data elements of a source operand and a format conversion pipeline to convert a plurality of input data elements specified by the source operand from a first format of a plurality of datatype formats to a second format of the plurality of datatype formats, the plurality of datatype formats including integer and floating-point formats.
    Type: Application
    Filed: July 25, 2023
    Publication date: January 30, 2025
    Applicant: Intel Corporation
    Inventors: Supratim Pal, Jiasheng Chen, Christopher Spencer, Jorge E. Parra Osorio, Kevin Hurd, Guei-Yuan Lueh, Pradeep K. Golconda, Fangwen Fu, Wei Xiong, Hongzheng Li, James Valerio, Mukundan Swaminathan, Nicholas Murphy, Shuai Mu, Clifford Gibson, Buqi Cheng
  • Publication number: 20250037347
    Abstract: Described herein is a graphics processor comprising an instruction cache and a plurality of processing elements coupled with the instruction cache. The plurality of processing elements include functional units configured to provide an integer pipeline to execute instructions to perform operations on integer data elements. The integer pipeline including a first multiplier and a second multiplier, the first multiplier and the second multiplier configured to execute operations for a single instruction.
    Type: Application
    Filed: July 25, 2023
    Publication date: January 30, 2025
    Applicant: Intel Corporation
    Inventors: Jiasheng Chen, Supratim Pal, Kevin Hurd, Jorge E. Parra Osorio, Christopher Spencer, Takashi Nakagawa, Guei-Yuan Lueh, Pradeep K. Golconda, James Valerio, Mukundan Swaminathan, Nicholas Murphy, Clifford Gibson, Li-An Tang, Fangwen Fu, Kaiyu Chen, Buqi Cheng
  • Patent number: 12210905
    Abstract: Provision of multiple register allocation sizes for threads is described. An example of a system includes one or more processors including a graphics processor, the graphics processor including at least a first local thread dispatcher (TDL) and multiple processing resources, each processing resource including a plurality of registers; and memory for storage of data for processing, wherein the one or more processors are to determine a register size for a first thread; identify one or more processing resources having sufficient register space for the first thread; select a processing resource of the one or more processing resources having sufficient register space to assign the first thread; select an available thread slot of the selected processing resource for the first thread; and allocate registers of the selected processing resource for the first thread.
    Type: Grant
    Filed: June 25, 2021
    Date of Patent: January 28, 2025
    Assignee: INTEL CORPORATION
    Inventors: Chandra Gurram, Wei-Yu Chen, Vikranth Vemulapalli, Subramaniam Maiyuran, Jorge Eduardo Parra Osorio, Shuai Mu, Guei-Yuan Lueh, Supratim Pal
  • Patent number: 12190118
    Abstract: Embodiments described herein provide an apparatus comprising a plurality of processing resources including a first processing resource and a second processing resource, a memory communicatively coupled to the first processing resource and the second processing resource, and a processor to receive data dependencies for one or more tasks comprising one or more producer tasks executing on the first processing resource and one or more consumer tasks executing on the second processing resource and move a data output from one or more producer tasks executing on the first processing resource to a cache memory communicatively coupled to the second processing resource. Other embodiments may be described and claimed.
    Type: Grant
    Filed: June 22, 2023
    Date of Patent: January 7, 2025
    Assignee: INTEL CORPORATION
    Inventors: Christopher J. Hughes, Prasoonkumar Surti, Guei-Yuan Lueh, Adam T. Lake, Jill Boyce, Subramaniam Maiyuran, Lidong Xu, James M. Holland, Vasanth Ranganathan, Nikos Kaburlasos, Altug Koker, Abhishek R. Appu
  • Patent number: 12174783
    Abstract: A processing apparatus includes a processing resource including a general-purpose parallel processing engine and a matrix accelerator. The matrix accelerator includes first circuitry to receive a command to perform operations associated with an instruction, second circuitry to configure the matrix accelerator according to a physical depth of a systolic array within the matrix accelerator and a logical depth associated with the instruction, third circuitry to read operands for the instruction from a register file associated with the systolic array, fourth circuitry to perform operations for the instruction via one or more passes through one or more physical pipeline stages of the systolic array based on a configuration performed by the second circuitry, and fifth circuitry to write output of the operations to the register file associated with the systolic array.
    Type: Grant
    Filed: June 24, 2021
    Date of Patent: December 24, 2024
    Assignee: Intel Corporation
    Inventors: Jorge Parra, Wei-yu Chen, Kaiyu Chen, Varghese George, Junjie Gu, Chandra Gurram, Guei-Yuan Lueh, Stephen Junkins, Subramaniam Maiyuran, Supratim Pal
  • Patent number: 12164430
    Abstract: An apparatus to facilitate data prefetching is disclosed. The apparatus includes a cache, one or more execution units (EUs) to execute program code, prefetch logic to maintain tracking information of memory instructions in the program code that trigger a cache miss and compiler logic to receive the tracking information, insert one or more pre-fetch instructions in updated program code to prefetch data from a memory for execution of one or more of the memory instructions that triggered a cache miss and download the updated program code for execution by the one or more EUs.
    Type: Grant
    Filed: September 20, 2023
    Date of Patent: December 10, 2024
    Assignee: INTEL CORPORATION
    Inventors: Vasileios Porpodas, Guei-Yuan Lueh, Subramaniam Maiyuran, Wei-Yu Chen
  • Publication number: 20240362180
    Abstract: Graphics processors and graphics processing units having dot product accumulate instructions for a hybrid floating point format are disclosed. In one embodiment, a graphics multiprocessor comprises an instruction unit to dispatch instructions and a processing resource coupled to the instruction unit. The processing resource is configured to receive a dot product accumulate instruction from the instruction unit and to process the dot product accumulate instruction using a bfloat16 number (BF16) format.
    Type: Application
    Filed: April 26, 2024
    Publication date: October 31, 2024
    Applicant: Intel Corporation
    Inventors: Subramaniam Maiyuran, Shubra Marwaha, Ashutosh Garg, Supratim Pal, Jorge Parra, Chandra Gurram, Varghese George, Darin Starkey, Guei-Yuan Lueh
  • Patent number: 12131402
    Abstract: One embodiment provides a graphics processor comprising a system interface and circuitry coupled with the system interface. The circuitry includes an execution resource and a preemption status register. The execution resource is configured to execute an instruction. During execution of the instruction, the execution resource is to receive a request to preempt execution of a thread associated with the instruction and, based on a value stored in the preemption status register, execute at least one additional instruction after receipt of the request to preempt execution of the thread.
    Type: Grant
    Filed: May 20, 2022
    Date of Patent: October 29, 2024
    Assignee: Intel Corporation
    Inventors: Altug Koker, Ingo Wald, David Puffer, Subramaniam M. Maiyuran, Prasoonkumar Surti, Balaji Vembu, Guei-Yuan Lueh, Murali Ramadoss, Abhishek R. Appu, Joydeep Ray
  • Patent number: 12067641
    Abstract: One embodiment provides a parallel processor comprising a memory interface and a processing array coupled with the memory interface. The processing array is configured to address memory accessed via the memory interface via a virtual address mapping and includes circuitry to resolve a page fault for the virtual address mapping, wherein each of the multiple compute blocks is separately preemptable.
    Type: Grant
    Filed: May 20, 2022
    Date of Patent: August 20, 2024
    Assignee: Intel Corporation
    Inventors: Altug Koker, Ingo Wald, David Puffer, Subramaniam M. Maiyuran, Prasoonkumar Surti, Balaji Vembu, Guei-Yuan Lueh, Murali Ramadoss, Abhishek R. Appu, Joydeep Ray
  • Publication number: 20240231621
    Abstract: Embodiments described herein provide a technique to enable access to entries in a surface state or sampler state using 64-bit virtual addresses. One embodiment provides a graphics core that includes memory access circuitry configured to facilitate access to the memory by functional units of the graphics core. The memory access circuitry is configured to receive a message to access an entry in a surface state or a sampler state associated with a parallel processing operation. The message specifies a base address for a surface state entry or sampler state entry. The circuitry can add the base address and the offset to determine a 64-bit virtual address for the entry in the surface state entry or the sampler state and submit a memory access request to the memory to access the entry of the surface state or sampler state.
    Type: Application
    Filed: October 21, 2022
    Publication date: July 11, 2024
    Applicant: Intel Corporation
    Inventors: Joydeep Ray, Michael Apodaca, Yoav Harel, Guei-Yuan Lueh, John A. Wiegert
  • Publication number: 20240220254
    Abstract: Data multicast in compute core clusters is described. An example of an apparatus includes one or more processors including at least a first processor, the first processor including one or more clusters of cores and a memory, wherein each cluster of cores includes multiple cores, each core including one or more processing resources, shared memory, and broadcast circuitry; and wherein a first core in a first cluster of cores is to request a data element, determine whether any additional cores in the first cluster require the data element, and, upon determining that one or more additional cores in the first cluster require the data element, broadcast the data element to the one or more additional cores via interconnects between the broadcast circuitry of the cores of the first core cluster.
    Type: Application
    Filed: December 30, 2022
    Publication date: July 4, 2024
    Applicant: Intel Corporation
    Inventors: Chunhui Mei, Yongsheng Liu, John A. Wiegert, Vasanth Ranganathan, Ben J. Ashbaugh, Fangwen Fu, Hong Jiang, Guei-Yuan Lueh, James Valerio, Alan M. Curtis, Maxim Kazakov
  • Publication number: 20240220448
    Abstract: A scalable and configurable clustered systolic array is described. An example of apparatus includes a cluster including multiple cores; and a cache memory coupled with the cluster, wherein each core includes multiple processing resources, a memory coupled with the plurality of processing resources, a systolic array coupled with the memory, and one or more interconnects with one or more other cores of the plurality of cores; and wherein the systolic arrays of the cores are configurable by the apparatus to form a logically combined systolic array for processing of an operation by a cooperative group of threads running on one or more of the plurality of cores in the cluster.
    Type: Application
    Filed: December 30, 2022
    Publication date: July 4, 2024
    Applicant: Intel Corporation
    Inventors: Chunhui Mei, Jiasheng Chen, Ben J. Ashbaugh, Fangwen Fu, Hong Jiang, Guei-Yuan Lueh, Rama S.B. Harihara, Maxim Kazakov
  • Publication number: 20240220335
    Abstract: Synchronization for data multicast in compute core clusters is described. An example of an apparatus includes one or more processors including at least a graphics processing unit (GPU), the GPU including one or more clusters of cores and a memory, wherein each cluster of cores includes a plurality of cores, each core including one or more processing resources, shared local memory, and gateway circuitry, wherein the GPU is to initiate broadcast of a data element from a producer core to one or more consumer cores, and synchronize the broadcast of the data element utilizing the gateway circuitry of the producer core and the one or more consumer cores, and wherein synchronizing the broadcast of the data element includes establishing a multi-core barrier for broadcast of the data element.
    Type: Application
    Filed: December 30, 2022
    Publication date: July 4, 2024
    Applicant: Intel Corporation
    Inventors: Chunhui Mei, Yongsheng Liu, John A. Wiegert, Vasanth Ranganathan, Ben J. Ashbaugh, Fangwen Fu, Hong Jiang, Guei-Yuan Lueh, James Valerio, Alan M. Curtis, Maxim Kazakov