Patents by Inventor Fangwen Fu

Fangwen Fu has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

  • Patent number: 11977895
    Abstract: Examples described herein relate to a graphics processing unit (GPU) coupled to the memory device, the GPU configured to: execute an instruction thread; determine if a dual directional signal barrier is associated with the instruction thread; and based on clearance of the dual directional signal barrier for a particular signal barrier identifier and a mode of operation, indicate a clearance of the dual directional signal barrier for the mode of operation, wherein the dual directional signal barrier is to provide a single barrier to gate activity of one or more producers based on activity of one or more consumers or gate activity of one or more consumers based on activity of one or more producers.
    Type: Grant
    Filed: December 22, 2020
    Date of Patent: May 7, 2024
    Assignee: Intel Corporation
    Inventors: Sabareesh Ganapathy, Fangwen Fu, Hong Jiang, James Valerio
  • Publication number: 20240134797
    Abstract: Embodiments described herein provide a technique to facilitate the broadcast or multicast of asynchronous loads to shared local memory of a plurality of graphics cores within a graphics core cluster. One embodiment provides a graphics processor including a cache memory a graphics core cluster coupled with the cache memory. The graphics core cluster includes a plurality of graphics cores. The plurality of graphics cores includes a graphics core configured to receive a designation as a producer graphics core for a multicast load, read data from the cache memory; and transmit the data read from the cache memory to a consumer graphics core of the plurality of graphics cores.
    Type: Application
    Filed: October 24, 2022
    Publication date: April 25, 2024
    Applicant: Intel Corporation
    Inventors: John A. Wiegert, Joydeep Ray, Vasanth Ranganathan, Biju George, Fangwen Fu, Abhishek R. Appu, Chunhui Mei, Changwon Rhee
  • Publication number: 20240134719
    Abstract: Embodiments described herein provide a technique to facilitate the synchronization of workgroups executed on multiple graphics cores of a graphics core cluster. One embodiment provides a graphics core including a cache memory and a graphics core coupled with the cache memory. The graphics core includes execution resources to execute an instruction via a plurality of hardware threads and barrier circuitry to synchronize execution of the plurality of hardware threads, wherein the barrier circuitry is configured to provide a plurality of re-usable named barriers.
    Type: Application
    Filed: October 24, 2022
    Publication date: April 25, 2024
    Applicant: Intel Corporation
    Inventors: Fangwen Fu, Chunhui Mei, John A. Wiegert, Yongsheng Liu, Ben J. Ashbaugh
  • Publication number: 20240111534
    Abstract: Embodiments described herein provide a technique enable a broadcast load from an L1 cache or shared local memory to register files associated with hardware threads of a graphics core. One embodiment provides a graphics processor comprising a cache memory and a graphics core coupled with the cache memory. The graphics core includes a plurality of hardware threads and memory access circuitry to facilitate access to memory by the plurality of hardware threads. The graphics core is configurable to process a plurality of load request from the plurality of hardware threads, detect duplicate load requests within the plurality of load requests, perform a single read from the cache memory in response to the duplicate load requests, and transmit data associated with the duplicate load requests to requesting hardware threads.
    Type: Application
    Filed: September 30, 2022
    Publication date: April 4, 2024
    Applicant: Intel Corporation
    Inventors: Fangwen Fu, Chunhui Mei, Maxim Kazakov, Biju George, Jorge Parra, Supratim Pal
  • Publication number: 20240111826
    Abstract: An apparatus to facilitate hardware enhancements for double precision systolic support is disclosed. The apparatus includes matrix acceleration hardware having double-precision (DP) matrix multiplication circuitry including a multiplier circuits to multiply pairs of input source operands in a DP floating-point format; adders to receive multiplier outputs from the multiplier circuits and accumulate the multiplier outputs in a high precision intermediate format; an accumulator circuit to accumulate adder outputs from the adders with at least one of a third global source operand on a first pass of the DP matrix multiplication circuitry or an intermediate result from the first pass on a second pass of the DP matrix multiplication circuitry, wherein the accumulator circuit to generate an accumulator output in the high precision intermediate format; and a down conversion and rounding circuit to down convert and round an output of the second pass as final result in the DP floating-point format.
    Type: Application
    Filed: September 30, 2022
    Publication date: April 4, 2024
    Applicant: Intel Corporation
    Inventors: Jiasheng Chen, Kevin Hurd, Changwon Rhee, Jorge Parra, Fangwen Fu, Theo Drane, William Zorn, Peter Caday, Gregory Henry, Guei-Yuan Lueh, Farzad Chehrazi, Amit Karande, Turbo Majumder, Xinmin Tian, Milind Girkar, Hong Jiang
  • Publication number: 20240111609
    Abstract: Low-latency synchronization utilizing local team barriers for thread team processing is described. An example of an apparatus includes one or more processors including a graphics processor, the graphics processor including a plurality of processing resources; and memory for storage of data including data for graphics processing, wherein the graphics processor is to receive a request for establishment of a local team barrier for a thread team, the thread team being allocated to a first processing resource, the thread team including multiple threads; determine requirements and designated threads for the local team barrier; and establish the local team barrier in a local register of the first processing resource based at least in part on the requirements and designated threads for the local barrier.
    Type: Application
    Filed: September 30, 2022
    Publication date: April 4, 2024
    Applicant: Intel Corporation
    Inventors: Biju George, Supratim Pal, James Valerio, Vasanth Ranganathan, Fangwen Fu, Chunhui Mei
  • Publication number: 20240112295
    Abstract: Shared local registers for thread team processing is described. An example of an apparatus includes one or more processors including a graphic processor having multiple processing resources; and memory for storage of data, the graphics processor to allocate a first thread team to a first processing resource, the first thread team including hardware threads to be executed solely by the first processing resource; allocate a shared local register (SLR) space that may be directly reference in the ISA instructions to the first processing resource, the SLR space being accessible to the threads of the thread team and being inaccessible to threads outside of the thread team; and allocate individual register spaces to the thread team, each of the individual register spaces being accessible to a respective thread of the thread team.
    Type: Application
    Filed: September 30, 2022
    Publication date: April 4, 2024
    Applicant: Intel Corporation
    Inventors: Biju George, Fangwen Fu, Supratim Pal, Jorge Parra, Chunhui Mei, Maxim Kazakov, Joydeep Ray
  • Publication number: 20240111590
    Abstract: An apparatus to facilitate ordered thread dispatch for thread teams is disclosed. The apparatus includes one or more processors including a graphic processor, the graphics processor including a plurality of processing resources, and wherein the graphics processor is to: allocate a thread team local identifier (ID) for respective threads of a thread team comprising a plurality of hardware threads that are to be executed solely by a processing resource of the plurality of processing resources; and dispatch the respective threads together into the processing resource, the respective threads having the thread team local ID allocated.
    Type: Application
    Filed: September 30, 2022
    Publication date: April 4, 2024
    Applicant: Intel Corporation
    Inventors: Biju George, Vasanth Ranganathan, Fangwen Fu, Ben Ashbaugh, Roland Schulz
  • Publication number: 20240104025
    Abstract: Prefetch aware LRU cache replacement policy is described. An example of an apparatus includes one or more processors including a graphic processor, the graphics processor including a load store cache having multiple cache lines (CLs), each including bits for a cache line level (CL level) and one or more sectors for data storage; wherein the graphics processor is to receive one or more data elements for storage in the cache; set a CL level to track each CL receiving data, including setting CL level 1 for a CL receiving data in response to a miss in the cache and setting a CL level 2 for a CL receiving prefetched data in response to a prefetch request, and, upon determining that space is required in the cache to store data, apply a cache replacement policy, the policy being based at least in part on set CL levels for the CLs.
    Type: Application
    Filed: September 23, 2022
    Publication date: March 28, 2024
    Applicant: Intel Corporation
    Inventors: Biju George, Zamshed I. Chowdhury, Prathamesh Raghunath Shinde, Chunhui Mei, Fangwen Fu
  • Publication number: 20240069914
    Abstract: Embodiments described herein provide a system to enable access to an n-dimensional tensor in memory of a graphics processor via a batch of two-dimensional block access messages. One embodiment provides a graphics processor comprising general-purpose graphics execution resources coupled with the system interface, the general-purpose graphics execution resources including a matrix accelerator. The matrix accelerator is configured to perform a matrix operation on a plurality of tensors stored in a memory. Circuitry is included to facilitate access to the memory by the general-purpose graphics execution resources. The circuitry is configured to receive a request to access a tensor of the plurality of tensors and generate a batch of two-dimensional block access messages along a dimension of n>2 of the tensor. The batch of two-dimensional block access messages enables access to the tensor by the matrix accelerator.
    Type: Application
    Filed: August 23, 2022
    Publication date: February 29, 2024
    Applicant: Intel Corporation
    Inventors: Biju George, Fangwen Fu, Joydeep Ray
  • Patent number: 11842423
    Abstract: Embodiments described herein include software, firmware, and hardware logic that provides techniques to perform arithmetic on sparse data via a systolic processing unit. One embodiment provides for data aware sparsity via compressed bitstreams. One embodiment provides for block sparse dot product instructions. One embodiment provides for a depth-wise adapter for a systolic array.
    Type: Grant
    Filed: December 15, 2020
    Date of Patent: December 12, 2023
    Assignee: Intel Corporation
    Inventors: Abhishek Appu, Subramaniam Maiyuran, Mike Macpherson, Fangwen Fu, Jiasheng Chen, Varghese George, Vasanth Ranganathan, Ashutosh Garg, Joydeep Ray
  • Publication number: 20230289399
    Abstract: An apparatus to facilitate machine learning matrix processing is disclosed. The apparatus comprises a memory to store matrix data one or more processors to execute an instruction to examine a message descriptor included in the instruction to determine a type of matrix layout manipulation operation that is to be executed, examine a message header included in the instruction having a plurality of parameters that define a two-dimensional (2D) memory surface that is to be retrieved, retrieve one or more blocks of the matrix data from the memory based on the plurality of parameters and a register file including a plurality of registers, wherein the one or more blocks of the matrix data is stored within a first set of the plurality of registers.
    Type: Application
    Filed: February 2, 2023
    Publication date: September 14, 2023
    Applicant: Intel Corporation
    Inventors: Joydeep Ray, Fangwen Fu, Dhiraj D. Kalamkar, Sasikanth Avancha
  • Patent number: 11729403
    Abstract: A lossless pixel compressor may include technology to detect a format of a pixel memory region, and compress the pixel memory region together with embedded control information which indicates the detected format of the pixel memory region. Other embodiments are disclosed and claimed.
    Type: Grant
    Filed: December 5, 2017
    Date of Patent: August 15, 2023
    Assignee: Intel Corporation
    Inventors: James Holland, Hiu-Fai Chan, Fangwen Fu, Qian Xu, Sang-Hee Lee, Vidhya Krishnan
  • Publication number: 20230104845
    Abstract: One embodiment provides a graphics processor including a processing resource including a register file, memory, a cache, and load/store/cache circuitry to process load, store, and prefetch messages from the processing resource. The circuitry will sort received memory access messages into address sorted lists of reads and writes. The circuitry schedules a first set of address sorted requests from a first request buffer for a first period of time, then schedules a second set of address sorted requests from a second request buffer for a second period of time.
    Type: Application
    Filed: September 24, 2021
    Publication date: April 6, 2023
    Applicant: Intel Corporation
    Inventors: Joydeep Ray, Abhishek R. Appu, Altug Koker, Aditya Navale, Varghese George, Vasanth Ranganathan, Fangwen Fu, Ben J. Ashbaugh, Vidhya Krishnan, Sabareesh Ganapathy, Prathamesh Raghunath Shinde
  • Publication number: 20230086275
    Abstract: Emulating floating point calculation using lower precision format calculations is described. An example of a processor includes a floating point unit (FPU) to provide a native floating point operation in a first precision format; and systolic array hardware including multiple data processing units, wherein the processor is to receive data for performance of a matrix multiplication operation in the first precision format; enable an emulated floating point multiplication operation using one or more values with a second precision format, the second precision format having a lower precision than the first precision format, the emulated floating point multiplication including operation of the systolic array hardware; and generate an emulated result for the matrix multiplication operation.
    Type: Application
    Filed: September 22, 2021
    Publication date: March 23, 2023
    Applicant: Intel Corporation
    Inventors: Jiasheng Chen, Changwon Rhee, Sabareesh Ganapathy, Gregory Henry, Fangwen Fu
  • Patent number: 11593454
    Abstract: An apparatus to facilitate machine learning matrix processing is disclosed. The apparatus comprises a memory to store matrix data one or more processors to execute an instruction to examine a message descriptor included in the instruction to determine a type of matrix layout manipulation operation that is to be executed, examine a message header included in the instruction having a plurality of parameters that define a two-dimensional (2D) memory surface that is to be retrieved, retrieve one or more blocks of the matrix data from the memory based on the plurality of parameters and a register file including a plurality of registers, wherein the one or more blocks of the matrix data is stored within a first set of the plurality of registers.
    Type: Grant
    Filed: June 2, 2020
    Date of Patent: February 28, 2023
    Assignee: Intel Corporation
    Inventors: Joydeep Ray, Fangwen Fu, Dhiraj D. Kalamkar, Sasikanth Avancha
  • Publication number: 20220413854
    Abstract: An apparatus to facilitate 64-bit two-dimensional (2D) block load with transpose is disclosed. The apparatus includes a processor comprising processing resources; and load store pipeline hardware circuitry coupled to the processing resources, the load store pipeline hardware circuitry to receive a 64-bit two-dimensional (2D) block load message with transpose from the processing resources. The load store pipeline hardware circuitry comprising a load store pipeline sequencer to map rows of a block of memory corresponding to the 64-bit 2D block load message with transpose to 64-bit standard load messages; and load store pipeline return circuitry to: sequentially number general register files (GRFs) used for returning elements of the block of memory accessed by the 64-bit standard load messages to the processing resources; and return, to the processing resources, the sequentially numbered GRFs in response to the 64-bit 2D block load message with transpose.
    Type: Application
    Filed: June 25, 2021
    Publication date: December 29, 2022
    Applicant: Intel Corporation
    Inventors: Joydeep Ray, Supratim Pal, Prathamesh Raghunath Shinde, Ben J. Ashbaugh, Changwon Rhee, Hong Jiang, FangWen Fu
  • Publication number: 20220413803
    Abstract: A processing apparatus is described herein that includes a general-purpose parallel processing engine comprising a matrix accelerator including one or more systolic arrays, at least one of the one or more systolic arrays comprising multiple pipeline stages, each pipeline stage of the multiple pipeline stages including multiple processing elements, the multiple processing elements configured to perform processing operations on input matrix elements based on output sparsity metadata. The output sparsity metadata indicates to the multiple processing elements to bypass multiplication for a first row of elements of a second matrix and multiply a second row of elements of the second matrix with a column of matrix elements of a first matrix.
    Type: Application
    Filed: June 25, 2021
    Publication date: December 29, 2022
    Applicant: Intel Corporation
    Inventors: Jorge Parra, Fangwen Fu, Subramaniam Maiyuran, Varghese George, Mike Macpherson, Supratim Pal, Chandra Gurram, Sabareesh Ganapathy, Sasikanth Avancha, Dharma Teja Vooturi, Naveen Mellempudi, Dipankar Das
  • Publication number: 20220414054
    Abstract: A processing apparatus described herein includes a general-purpose parallel processing engine comprising a systolic array having multiple pipelines, each of the multiple pipelines including multiple pipeline stages, wherein the multiple pipelines include a first pipeline, a second pipeline, and a common input shared between the first pipeline and the second pipeline.
    Type: Application
    Filed: June 25, 2021
    Publication date: December 29, 2022
    Applicant: Intel Corporation
    Inventors: Jorge Parra, Jiasheng Chen, Supratim Pal, Fangwen Fu, Sabareesh Ganapathy, Chandra Gurram, Chunhui Mei, Yue Qi
  • Publication number: 20220413851
    Abstract: A processing apparatus includes a general-purpose parallel processing engine including a set of multiple processing elements including a single precision floating-point unit, a double precision floating point unit, and an integer unit; a matrix accelerator including one or more systolic arrays; a first register file coupled with a first read control circuit, wherein the first read control circuit couples with the set of multiple processing elements and the matrix accelerator to arbitrate read requests to the first register file from the set of multiple processing elements and the matrix accelerator; and a second register file coupled with a second read control circuit, wherein the second read control circuit couples with the matrix accelerator to arbitrate read requests to the second register file from the matrix accelerator and limit access to the second register file by the set of multiple processing elements.
    Type: Application
    Filed: June 25, 2021
    Publication date: December 29, 2022
    Applicant: Intel Corporation
    Inventors: Chandra Gurram, Wei-yu Chen, Fangwen Fu, Sabareesh Ganapathy, Varghese George, Guei-Yuan Lueh, Subramaniam Maiyuran, Mike Macpherson, Supratim Pal, Jorge Parra