Patents by Inventor Fangwen Fu

Fangwen Fu has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

Hierarchical thread scheduling based on multiple barriers

Patent number: 11977895

Abstract: Examples described herein relate to a graphics processing unit (GPU) coupled to the memory device, the GPU configured to: execute an instruction thread; determine if a dual directional signal barrier is associated with the instruction thread; and based on clearance of the dual directional signal barrier for a particular signal barrier identifier and a mode of operation, indicate a clearance of the dual directional signal barrier for the mode of operation, wherein the dual directional signal barrier is to provide a single barrier to gate activity of one or more producers based on activity of one or more consumers or gate activity of one or more consumers based on activity of one or more producers.

Type: Grant

Filed: December 22, 2020

Date of Patent: May 7, 2024

Assignee: Intel Corporation

Inventors: Sabareesh Ganapathy, Fangwen Fu, Hong Jiang, James Valerio
BROADCAST ASYNCHRONOUS LOADS TO SHARED LOCAL MEMORY

Publication number: 20240134797

Abstract: Embodiments described herein provide a technique to facilitate the broadcast or multicast of asynchronous loads to shared local memory of a plurality of graphics cores within a graphics core cluster. One embodiment provides a graphics processor including a cache memory a graphics core cluster coupled with the cache memory. The graphics core cluster includes a plurality of graphics cores. The plurality of graphics cores includes a graphics core configured to receive a designation as a producer graphics core for a multicast load, read data from the cache memory; and transmit the data read from the cache memory to a consumer graphics core of the plurality of graphics cores.

Type: Application

Filed: October 24, 2022

Publication date: April 25, 2024

Applicant: Intel Corporation

Inventors: John A. Wiegert, Joydeep Ray, Vasanth Ranganathan, Biju George, Fangwen Fu, Abhishek R. Appu, Chunhui Mei, Changwon Rhee
NAMED AND CLUSTER BARRIERS

Publication number: 20240134719

Abstract: Embodiments described herein provide a technique to facilitate the synchronization of workgroups executed on multiple graphics cores of a graphics core cluster. One embodiment provides a graphics core including a cache memory and a graphics core coupled with the cache memory. The graphics core includes execution resources to execute an instruction via a plurality of hardware threads and barrier circuitry to synchronize execution of the plurality of hardware threads, wherein the barrier circuitry is configured to provide a plurality of re-usable named barriers.

Type: Application

Filed: October 24, 2022

Publication date: April 25, 2024

Applicant: Intel Corporation

Inventors: Fangwen Fu, Chunhui Mei, John A. Wiegert, Yongsheng Liu, Ben J. Ashbaugh
DETERMINISTIC BROADCASTING FROM SHARED MEMORY

Publication number: 20240111534

Abstract: Embodiments described herein provide a technique enable a broadcast load from an L1 cache or shared local memory to register files associated with hardware threads of a graphics core. One embodiment provides a graphics processor comprising a cache memory and a graphics core coupled with the cache memory. The graphics core includes a plurality of hardware threads and memory access circuitry to facilitate access to memory by the plurality of hardware threads. The graphics core is configurable to process a plurality of load request from the plurality of hardware threads, detect duplicate load requests within the plurality of load requests, perform a single read from the cache memory in response to the duplicate load requests, and transmit data associated with the duplicate load requests to requesting hardware threads.

Type: Application

Filed: September 30, 2022

Publication date: April 4, 2024

Applicant: Intel Corporation

Inventors: Fangwen Fu, Chunhui Mei, Maxim Kazakov, Biju George, Jorge Parra, Supratim Pal
HARDWARE ENHANCEMENTS FOR DOUBLE PRECISION SYSTOLIC SUPPORT

Publication number: 20240111826

Abstract: An apparatus to facilitate hardware enhancements for double precision systolic support is disclosed. The apparatus includes matrix acceleration hardware having double-precision (DP) matrix multiplication circuitry including a multiplier circuits to multiply pairs of input source operands in a DP floating-point format; adders to receive multiplier outputs from the multiplier circuits and accumulate the multiplier outputs in a high precision intermediate format; an accumulator circuit to accumulate adder outputs from the adders with at least one of a third global source operand on a first pass of the DP matrix multiplication circuitry or an intermediate result from the first pass on a second pass of the DP matrix multiplication circuitry, wherein the accumulator circuit to generate an accumulator output in the high precision intermediate format; and a down conversion and rounding circuit to down convert and round an output of the second pass as final result in the DP floating-point format.

Type: Application

Filed: September 30, 2022

Publication date: April 4, 2024

Applicant: Intel Corporation

Inventors: Jiasheng Chen, Kevin Hurd, Changwon Rhee, Jorge Parra, Fangwen Fu, Theo Drane, William Zorn, Peter Caday, Gregory Henry, Guei-Yuan Lueh, Farzad Chehrazi, Amit Karande, Turbo Majumder, Xinmin Tian, Milind Girkar, Hong Jiang
SYNCHRONIZATION UTILIZING LOCAL TEAM BARRIERS FOR THREAD TEAM PROCESSING

Publication number: 20240111609

Abstract: Low-latency synchronization utilizing local team barriers for thread team processing is described. An example of an apparatus includes one or more processors including a graphics processor, the graphics processor including a plurality of processing resources; and memory for storage of data including data for graphics processing, wherein the graphics processor is to receive a request for establishment of a local team barrier for a thread team, the thread team being allocated to a first processing resource, the thread team including multiple threads; determine requirements and designated threads for the local team barrier; and establish the local team barrier in a local register of the first processing resource based at least in part on the requirements and designated threads for the local barrier.

Type: Application

Filed: September 30, 2022

Publication date: April 4, 2024

Applicant: Intel Corporation

Inventors: Biju George, Supratim Pal, James Valerio, Vasanth Ranganathan, Fangwen Fu, Chunhui Mei
SHARED LOCAL REGISTERS FOR THREAD TEAM PROCESSING

Publication number: 20240112295

Abstract: Shared local registers for thread team processing is described. An example of an apparatus includes one or more processors including a graphic processor having multiple processing resources; and memory for storage of data, the graphics processor to allocate a first thread team to a first processing resource, the first thread team including hardware threads to be executed solely by the first processing resource; allocate a shared local register (SLR) space that may be directly reference in the ISA instructions to the first processing resource, the SLR space being accessible to the threads of the thread team and being inaccessible to threads outside of the thread team; and allocate individual register spaces to the thread team, each of the individual register spaces being accessible to a respective thread of the thread team.

Type: Application

Filed: September 30, 2022

Publication date: April 4, 2024

Applicant: Intel Corporation

Inventors: Biju George, Fangwen Fu, Supratim Pal, Jorge Parra, Chunhui Mei, Maxim Kazakov, Joydeep Ray
ORDERED THREAD DISPATCH FOR THREAD TEAMS

Publication number: 20240111590

Abstract: An apparatus to facilitate ordered thread dispatch for thread teams is disclosed. The apparatus includes one or more processors including a graphic processor, the graphics processor including a plurality of processing resources, and wherein the graphics processor is to: allocate a thread team local identifier (ID) for respective threads of a thread team comprising a plurality of hardware threads that are to be executed solely by a processing resource of the plurality of processing resources; and dispatch the respective threads together into the processing resource, the respective threads having the thread team local ID allocated.

Type: Application

Filed: September 30, 2022

Publication date: April 4, 2024

Applicant: Intel Corporation

Inventors: Biju George, Vasanth Ranganathan, Fangwen Fu, Ben Ashbaugh, Roland Schulz
PREFETCH AWARE LRU CACHE REPLACEMENT POLICY

Publication number: 20240104025

Abstract: Prefetch aware LRU cache replacement policy is described. An example of an apparatus includes one or more processors including a graphic processor, the graphics processor including a load store cache having multiple cache lines (CLs), each including bits for a cache line level (CL level) and one or more sectors for data storage; wherein the graphics processor is to receive one or more data elements for storage in the cache; set a CL level to track each CL receiving data, including setting CL level 1 for a CL receiving data in response to a miss in the cache and setting a CL level 2 for a CL receiving prefetched data in response to a prefetch request, and, upon determining that space is required in the cache to store data, apply a cache replacement policy, the policy being based at least in part on set CL levels for the CLs.

Type: Application

Filed: September 23, 2022

Publication date: March 28, 2024

Applicant: Intel Corporation

Inventors: Biju George, Zamshed I. Chowdhury, Prathamesh Raghunath Shinde, Chunhui Mei, Fangwen Fu
HARDWARE ENHANCEMENTS FOR MATRIX LOAD/STORE INSTRUCTIONS

Publication number: 20240069914

Abstract: Embodiments described herein provide a system to enable access to an n-dimensional tensor in memory of a graphics processor via a batch of two-dimensional block access messages. One embodiment provides a graphics processor comprising general-purpose graphics execution resources coupled with the system interface, the general-purpose graphics execution resources including a matrix accelerator. The matrix accelerator is configured to perform a matrix operation on a plurality of tensors stored in a memory. Circuitry is included to facilitate access to the memory by the general-purpose graphics execution resources. The circuitry is configured to receive a request to access a tensor of the plurality of tensors and generate a batch of two-dimensional block access messages along a dimension of n>2 of the tensor. The batch of two-dimensional block access messages enables access to the tensor by the matrix accelerator.

Type: Application

Filed: August 23, 2022

Publication date: February 29, 2024

Applicant: Intel Corporation

Inventors: Biju George, Fangwen Fu, Joydeep Ray
Dot product operations on sparse matrix elements

Patent number: 11842423

Abstract: Embodiments described herein include software, firmware, and hardware logic that provides techniques to perform arithmetic on sparse data via a systolic processing unit. One embodiment provides for data aware sparsity via compressed bitstreams. One embodiment provides for block sparse dot product instructions. One embodiment provides for a depth-wise adapter for a systolic array.

Type: Grant

Filed: December 15, 2020

Date of Patent: December 12, 2023

Assignee: Intel Corporation

Inventors: Abhishek Appu, Subramaniam Maiyuran, Mike Macpherson, Fangwen Fu, Jiasheng Chen, Varghese George, Vasanth Ranganathan, Ashutosh Garg, Joydeep Ray
MATRIX OPERATION OPTIMIZATION MECHANISM

Publication number: 20230289399

Abstract: An apparatus to facilitate machine learning matrix processing is disclosed. The apparatus comprises a memory to store matrix data one or more processors to execute an instruction to examine a message descriptor included in the instruction to determine a type of matrix layout manipulation operation that is to be executed, examine a message header included in the instruction having a plurality of parameters that define a two-dimensional (2D) memory surface that is to be retrieved, retrieve one or more blocks of the matrix data from the memory based on the plurality of parameters and a register file including a plurality of registers, wherein the one or more blocks of the matrix data is stored within a first set of the plurality of registers.

Type: Application

Filed: February 2, 2023

Publication date: September 14, 2023

Applicant: Intel Corporation

Inventors: Joydeep Ray, Fangwen Fu, Dhiraj D. Kalamkar, Sasikanth Avancha
Lossless pixel compression based on inferred control information

Patent number: 11729403

Abstract: A lossless pixel compressor may include technology to detect a format of a pixel memory region, and compress the pixel memory region together with embedded control information which indicates the detected format of the pixel memory region. Other embodiments are disclosed and claimed.

Type: Grant

Filed: December 5, 2017

Date of Patent: August 15, 2023

Assignee: Intel Corporation

Inventors: James Holland, Hiu-Fai Chan, Fangwen Fu, Qian Xu, Sang-Hee Lee, Vidhya Krishnan
GRAPHICS PROCESSOR MEMORY ACCESS ARCHITECTURE WITH ADDRESS SORTING

Publication number: 20230104845

Abstract: One embodiment provides a graphics processor including a processing resource including a register file, memory, a cache, and load/store/cache circuitry to process load, store, and prefetch messages from the processing resource. The circuitry will sort received memory access messages into address sorted lists of reads and writes. The circuitry schedules a first set of address sorted requests from a first request buffer for a first period of time, then schedules a second set of address sorted requests from a second request buffer for a second period of time.

Type: Application

Filed: September 24, 2021

Publication date: April 6, 2023

Applicant: Intel Corporation

Inventors: Joydeep Ray, Abhishek R. Appu, Altug Koker, Aditya Navale, Varghese George, Vasanth Ranganathan, Fangwen Fu, Ben J. Ashbaugh, Vidhya Krishnan, Sabareesh Ganapathy, Prathamesh Raghunath Shinde
EMULATION OF FLOATING POINT CALCULATION

Publication number: 20230086275

Abstract: Emulating floating point calculation using lower precision format calculations is described. An example of a processor includes a floating point unit (FPU) to provide a native floating point operation in a first precision format; and systolic array hardware including multiple data processing units, wherein the processor is to receive data for performance of a matrix multiplication operation in the first precision format; enable an emulated floating point multiplication operation using one or more values with a second precision format, the second precision format having a lower precision than the first precision format, the emulated floating point multiplication including operation of the systolic array hardware; and generate an emulated result for the matrix multiplication operation.

Type: Application

Filed: September 22, 2021

Publication date: March 23, 2023

Applicant: Intel Corporation

Inventors: Jiasheng Chen, Changwon Rhee, Sabareesh Ganapathy, Gregory Henry, Fangwen Fu
Matrix operation optimization mechanism

Patent number: 11593454

Abstract: An apparatus to facilitate machine learning matrix processing is disclosed. The apparatus comprises a memory to store matrix data one or more processors to execute an instruction to examine a message descriptor included in the instruction to determine a type of matrix layout manipulation operation that is to be executed, examine a message header included in the instruction having a plurality of parameters that define a two-dimensional (2D) memory surface that is to be retrieved, retrieve one or more blocks of the matrix data from the memory based on the plurality of parameters and a register file including a plurality of registers, wherein the one or more blocks of the matrix data is stored within a first set of the plurality of registers.

Type: Grant

Filed: June 2, 2020

Date of Patent: February 28, 2023

Assignee: Intel Corporation

Inventors: Joydeep Ray, Fangwen Fu, Dhiraj D. Kalamkar, Sasikanth Avancha
64-BIT TWO-DIMENSIONAL BLOCK LOAD WITH TRANSPOSE

Publication number: 20220413854

Abstract: An apparatus to facilitate 64-bit two-dimensional (2D) block load with transpose is disclosed. The apparatus includes a processor comprising processing resources; and load store pipeline hardware circuitry coupled to the processing resources, the load store pipeline hardware circuitry to receive a 64-bit two-dimensional (2D) block load message with transpose from the processing resources. The load store pipeline hardware circuitry comprising a load store pipeline sequencer to map rows of a block of memory corresponding to the 64-bit 2D block load message with transpose to 64-bit standard load messages; and load store pipeline return circuitry to: sequentially number general register files (GRFs) used for returning elements of the block of memory accessed by the 64-bit standard load messages to the processing resources; and return, to the processing resources, the sequentially numbered GRFs in response to the 64-bit 2D block load message with transpose.

Type: Application

Filed: June 25, 2021

Publication date: December 29, 2022

Applicant: Intel Corporation

Inventors: Joydeep Ray, Supratim Pal, Prathamesh Raghunath Shinde, Ben J. Ashbaugh, Changwon Rhee, Hong Jiang, FangWen Fu
SYSTOLIC ARRAY HAVING SUPPORT FOR OUTPUT SPARSITY

Publication number: 20220413803

Abstract: A processing apparatus is described herein that includes a general-purpose parallel processing engine comprising a matrix accelerator including one or more systolic arrays, at least one of the one or more systolic arrays comprising multiple pipeline stages, each pipeline stage of the multiple pipeline stages including multiple processing elements, the multiple processing elements configured to perform processing operations on input matrix elements based on output sparsity metadata. The output sparsity metadata indicates to the multiple processing elements to bypass multiplication for a first row of elements of a second matrix and multiply a second row of elements of the second matrix with a column of matrix elements of a first matrix.

Type: Application

Filed: June 25, 2021

Publication date: December 29, 2022

Applicant: Intel Corporation

Inventors: Jorge Parra, Fangwen Fu, Subramaniam Maiyuran, Varghese George, Mike Macpherson, Supratim Pal, Chandra Gurram, Sabareesh Ganapathy, Sasikanth Avancha, Dharma Teja Vooturi, Naveen Mellempudi, Dipankar Das
DUAL PIPELINE PARALLEL SYSTOLIC ARRAY

Publication number: 20220414054

Abstract: A processing apparatus described herein includes a general-purpose parallel processing engine comprising a systolic array having multiple pipelines, each of the multiple pipelines including multiple pipeline stages, wherein the multiple pipelines include a first pipeline, a second pipeline, and a common input shared between the first pipeline and the second pipeline.

Type: Application

Filed: June 25, 2021

Publication date: December 29, 2022

Applicant: Intel Corporation

Inventors: Jorge Parra, Jiasheng Chen, Supratim Pal, Fangwen Fu, Sabareesh Ganapathy, Chandra Gurram, Chunhui Mei, Yue Qi
REGISTER FILE FOR SYSTOLIC ARRAY

Publication number: 20220413851

Abstract: A processing apparatus includes a general-purpose parallel processing engine including a set of multiple processing elements including a single precision floating-point unit, a double precision floating point unit, and an integer unit; a matrix accelerator including one or more systolic arrays; a first register file coupled with a first read control circuit, wherein the first read control circuit couples with the set of multiple processing elements and the matrix accelerator to arbitrate read requests to the first register file from the set of multiple processing elements and the matrix accelerator; and a second register file coupled with a second read control circuit, wherein the second read control circuit couples with the matrix accelerator to arbitrate read requests to the second register file from the matrix accelerator and limit access to the second register file by the set of multiple processing elements.

Type: Application

Filed: June 25, 2021

Publication date: December 29, 2022

Applicant: Intel Corporation

Inventors: Chandra Gurram, Wei-yu Chen, Fangwen Fu, Sabareesh Ganapathy, Varghese George, Guei-Yuan Lueh, Subramaniam Maiyuran, Mike Macpherson, Supratim Pal, Jorge Parra

1 2 3 4 next