Patents by Inventor Chunhui MEI

Chunhui MEI has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

  • Publication number: 20240160478
    Abstract: An apparatus to facilitate increasing processing resources in processing cores of a graphics environment is disclosed. The apparatus includes a plurality of processing resources to execute one or more execution threads; a plurality of message arbiter-processing resource (MA-PR) routers, wherein a respective MA-PR router of the plurality of MA-PR routers corresponds to a pair of processing resources of the plurality of processing resources and is to arbitrate routing of a thread control message from a message arbiter between the pair of processing resources; a plurality of local shared cache (LSC) sequencers to provide an interface between at least one LSC of the processing core and the plurality of processing resources; and a plurality of instruction caches (ICs) to store instructions of the one or more execution threads, wherein a respective IC of the plurality of ICs interfaces with a portion of the plurality of processing resources.
    Type: Application
    Filed: November 15, 2022
    Publication date: May 16, 2024
    Applicant: Intel Corporation
    Inventors: Jiasheng Chen, Chunhui Mei, Ben J. Ashbaugh, Naveen Matam, Joydeep Ray, Timothy Bauer, Guei-Yuan Lueh, Vasanth Ranganathan, Prashant Chaudhari, Vikranth Vemulapalli, Nishanth Reddy Pendluru, Piotr Reiter, Jain Philip, Marek Rudniewski, Christopher Spencer, Parth Damani, Prathamesh Raghunath Shinde, John Wiegert, Fataneh Ghodrat
  • Patent number: 11977885
    Abstract: An apparatus to facilitate utilizing structured sparsity in systolic arrays is disclosed. The apparatus includes a processor comprising a systolic array to receive data from a plurality of source registers, the data comprising unpacked source data, structured source data that is packed based on sparsity, and metadata corresponding to the structured source data; identify portions of the unpacked source data to multiply with the structured source data, the portions of the unpacked source data identified based on the metadata; and output, to a destination register, a result of multiplication of the portions of the unpacked source data and the structured source data.
    Type: Grant
    Filed: November 30, 2020
    Date of Patent: May 7, 2024
    Assignee: INTEL CORPORATION
    Inventors: Subramaniam Maiyuran, Jorge Parra, Ashutosh Garg, Chandra Gurram, Chunhui Mei, Durgesh Borkar, Shubra Marwaha, Supratim Pal, Varghese George, Wei Xiong, Yan Li, Yongsheng Liu, Dipankar Das, Sasikanth Avancha, Dharma Teja Vooturi, Naveen K. Mellempudi
  • Publication number: 20240134797
    Abstract: Embodiments described herein provide a technique to facilitate the broadcast or multicast of asynchronous loads to shared local memory of a plurality of graphics cores within a graphics core cluster. One embodiment provides a graphics processor including a cache memory a graphics core cluster coupled with the cache memory. The graphics core cluster includes a plurality of graphics cores. The plurality of graphics cores includes a graphics core configured to receive a designation as a producer graphics core for a multicast load, read data from the cache memory; and transmit the data read from the cache memory to a consumer graphics core of the plurality of graphics cores.
    Type: Application
    Filed: October 24, 2022
    Publication date: April 25, 2024
    Applicant: Intel Corporation
    Inventors: John A. Wiegert, Joydeep Ray, Vasanth Ranganathan, Biju George, Fangwen Fu, Abhishek R. Appu, Chunhui Mei, Changwon Rhee
  • Publication number: 20240134719
    Abstract: Embodiments described herein provide a technique to facilitate the synchronization of workgroups executed on multiple graphics cores of a graphics core cluster. One embodiment provides a graphics core including a cache memory and a graphics core coupled with the cache memory. The graphics core includes execution resources to execute an instruction via a plurality of hardware threads and barrier circuitry to synchronize execution of the plurality of hardware threads, wherein the barrier circuitry is configured to provide a plurality of re-usable named barriers.
    Type: Application
    Filed: October 24, 2022
    Publication date: April 25, 2024
    Applicant: Intel Corporation
    Inventors: Fangwen Fu, Chunhui Mei, John A. Wiegert, Yongsheng Liu, Ben J. Ashbaugh
  • Publication number: 20240111534
    Abstract: Embodiments described herein provide a technique enable a broadcast load from an L1 cache or shared local memory to register files associated with hardware threads of a graphics core. One embodiment provides a graphics processor comprising a cache memory and a graphics core coupled with the cache memory. The graphics core includes a plurality of hardware threads and memory access circuitry to facilitate access to memory by the plurality of hardware threads. The graphics core is configurable to process a plurality of load request from the plurality of hardware threads, detect duplicate load requests within the plurality of load requests, perform a single read from the cache memory in response to the duplicate load requests, and transmit data associated with the duplicate load requests to requesting hardware threads.
    Type: Application
    Filed: September 30, 2022
    Publication date: April 4, 2024
    Applicant: Intel Corporation
    Inventors: Fangwen Fu, Chunhui Mei, Maxim Kazakov, Biju George, Jorge Parra, Supratim Pal
  • Publication number: 20240112295
    Abstract: Shared local registers for thread team processing is described. An example of an apparatus includes one or more processors including a graphic processor having multiple processing resources; and memory for storage of data, the graphics processor to allocate a first thread team to a first processing resource, the first thread team including hardware threads to be executed solely by the first processing resource; allocate a shared local register (SLR) space that may be directly reference in the ISA instructions to the first processing resource, the SLR space being accessible to the threads of the thread team and being inaccessible to threads outside of the thread team; and allocate individual register spaces to the thread team, each of the individual register spaces being accessible to a respective thread of the thread team.
    Type: Application
    Filed: September 30, 2022
    Publication date: April 4, 2024
    Applicant: Intel Corporation
    Inventors: Biju George, Fangwen Fu, Supratim Pal, Jorge Parra, Chunhui Mei, Maxim Kazakov, Joydeep Ray
  • Publication number: 20240111609
    Abstract: Low-latency synchronization utilizing local team barriers for thread team processing is described. An example of an apparatus includes one or more processors including a graphics processor, the graphics processor including a plurality of processing resources; and memory for storage of data including data for graphics processing, wherein the graphics processor is to receive a request for establishment of a local team barrier for a thread team, the thread team being allocated to a first processing resource, the thread team including multiple threads; determine requirements and designated threads for the local team barrier; and establish the local team barrier in a local register of the first processing resource based at least in part on the requirements and designated threads for the local barrier.
    Type: Application
    Filed: September 30, 2022
    Publication date: April 4, 2024
    Applicant: Intel Corporation
    Inventors: Biju George, Supratim Pal, James Valerio, Vasanth Ranganathan, Fangwen Fu, Chunhui Mei
  • Publication number: 20240104025
    Abstract: Prefetch aware LRU cache replacement policy is described. An example of an apparatus includes one or more processors including a graphic processor, the graphics processor including a load store cache having multiple cache lines (CLs), each including bits for a cache line level (CL level) and one or more sectors for data storage; wherein the graphics processor is to receive one or more data elements for storage in the cache; set a CL level to track each CL receiving data, including setting CL level 1 for a CL receiving data in response to a miss in the cache and setting a CL level 2 for a CL receiving prefetched data in response to a prefetch request, and, upon determining that space is required in the cache to store data, apply a cache replacement policy, the policy being based at least in part on set CL levels for the CLs.
    Type: Application
    Filed: September 23, 2022
    Publication date: March 28, 2024
    Applicant: Intel Corporation
    Inventors: Biju George, Zamshed I. Chowdhury, Prathamesh Raghunath Shinde, Chunhui Mei, Fangwen Fu
  • Publication number: 20230205587
    Abstract: Thread group dispatch in a clustered graphics architecture is described. An example of an apparatus includes of compute front end (CFE) clusters to receive dispatched thread groups, the CFE clusters including at least a first CFE cluster and a second CFE cluster; processing resources coupled with the CFE clusters to execute threads within thread groups; and cache clusters to cache data including thread groups, wherein the apparatus is to receive thread groups for dispatch, and to dispatch the thread groups to the CFE clusters according to a dispatch operation, the dispatch operation including dispatching multiple thread groups to each of multiple CFEs in the first CFE cluster and multiple thread groups to each of multiple CFEs in the second CFE cluster.
    Type: Application
    Filed: December 23, 2021
    Publication date: June 29, 2023
    Applicant: Intel Corporation
    Inventors: Zamshed Iqbal Chowdhury, Joydeep Ray, Chunhui Mei, Yongsheng Liu, Vasanth Ranganathan, Abhishek R. Appu, Aravindh Anantaraman
  • Publication number: 20230153176
    Abstract: An apparatus to facilitate facilitating forward progress guarantee using single-level synchronization at individual thread granularity is disclosed. The apparatus includes a processor comprising a barrier synchronization hardware circuitry to assign a set of global named barrier identifiers (IDs) to individual execution threads of a plurality of execution threads and synchronize execution of the individual execution threads on a single level via the set of global named barrier IDs; and a plurality of processing resources to execute the plurality of execution threads and comprising divergent barrier scheduling hardware circuitry to facilitate execution flow switching from a first divergent branch executed by a first thread to a second divergent branch executed by a second thread, the execution flow switching performed responsive to the first thread stalling to wait on a named barrier of the set of global named barrier IDs.
    Type: Application
    Filed: November 17, 2021
    Publication date: May 18, 2023
    Applicant: Intel Corporation
    Inventors: Chunhui Mei, James Valerio, Supratim Pal, Guei-Yuan Lueh, Hong Jiang
  • Publication number: 20220414054
    Abstract: A processing apparatus described herein includes a general-purpose parallel processing engine comprising a systolic array having multiple pipelines, each of the multiple pipelines including multiple pipeline stages, wherein the multiple pipelines include a first pipeline, a second pipeline, and a common input shared between the first pipeline and the second pipeline.
    Type: Application
    Filed: June 25, 2021
    Publication date: December 29, 2022
    Applicant: Intel Corporation
    Inventors: Jorge Parra, Jiasheng Chen, Supratim Pal, Fangwen Fu, Sabareesh Ganapathy, Chandra Gurram, Chunhui Mei, Yue Qi
  • Patent number: 11494163
    Abstract: An apparatus to facilitate a computer number format conversion is disclosed. The apparatus comprises a control unit to receive to receive data format information indicating a first precision data format that input data is to be received and converter hardware to receive the input data and convert the first precision data format to a second precision data format based on the data format information.
    Type: Grant
    Filed: September 6, 2019
    Date of Patent: November 8, 2022
    Assignee: Intel Corporation
    Inventors: Naveen Mellempudi, Dipankar Das, Chunhui Mei, Kristopher Wong, Dhiraj D. Kalamkar, Hong H. Jiang, Subramaniam Maiyuran, Varghese George
  • Publication number: 20220309124
    Abstract: Matrix multiply units can take advantage of input sparsity by zero gating ALUs, which saves power consumption, but compute throughput does not increase. To improve compute throughput from sparsity, processing resources in a matrix accelerator can skip computation with zero involved in input or output. If zeros in input can be skipped, the processing units can focus calculations on generating meaningful non-zero output.
    Type: Application
    Filed: March 24, 2021
    Publication date: September 29, 2022
    Applicant: Intel Corporation
    Inventors: Chunhui Mei, Hong Jiang, Jiasheng Chen, Yongsheng Liu, Yan Li
  • Publication number: 20210081201
    Abstract: An apparatus to facilitate utilizing structured sparsity in systolic arrays is disclosed. The apparatus includes a processor comprising a systolic array to receive data from a plurality of source registers, the data comprising unpacked source data, structured source data that is packed based on sparsity, and metadata corresponding to the structured source data; identify portions of the unpacked source data to multiply with the structured source data, the portions of the unpacked source data identified based on the metadata; and output, to a destination register, a result of multiplication of the portions of the unpacked source data and the structured source data.
    Type: Application
    Filed: November 30, 2020
    Publication date: March 18, 2021
    Applicant: Intel Corporation
    Inventors: Subramaniam Maiyuran, Jorge Parra, Ashutosh Garg, Chandra Gurram, Chunhui Mei, Durgesh Borkar, Shubra Marwaha, Supratim Pal, Varghese George, Wei Xiong, Yan Li, Yongsheng Liu, Dipankar Das, Sasikanth Avancha, Dharma Teja Vooturi, Naveen K. Mellempudi
  • Publication number: 20210072955
    Abstract: An apparatus to facilitate a computer number format conversion is disclosed. The apparatus comprises a control unit to receive to receive data format information indicating a first precision data format that input data is to be received and converter hardware to receive the input data and convert the first precision data format to a second precision data format based on the data format information.
    Type: Application
    Filed: September 6, 2019
    Publication date: March 11, 2021
    Applicant: Intel Corporation
    Inventors: Naveen MELLEMPUDI, Dipankar DAS, Chunhui MEI, Kristopher WONG, Dhiraj D. KALAMKAR, Hong H. JIANG, Subramaniam Maiyuran, Varghese George
  • Patent number: 10504462
    Abstract: A device for graphics processing includes a memory and at least one processor. The at least one processor is configured to generate image data for an image, fetch, for each two-dimensional matrix of multiple two-dimensional matrices of units of the image, a respective portion of the image data, and process each two-dimensional matrix of the multiple two-dimensional matrices based on the respective portion of the image data to generate pixel data for the image. To process each two-dimensional matrix of the multiple two-dimensional matrices, the at least one processor is configured to process multiple units arranged in a first two-dimensional matrix of the multiple two-dimensional matrices and process, after processing the multiple units arranged in the first two-dimensional matrix, multiple units arranged in a second two-dimensional matrix of the multiple two-dimensional matrices.
    Type: Grant
    Filed: January 25, 2018
    Date of Patent: December 10, 2019
    Assignee: QUALCOMM Incorporated
    Inventor: Chunhui Mei
  • Publication number: 20190228723
    Abstract: A device for graphics processing includes a memory and at least one processor. The at least one processor is configured to generate image data for an image, fetch, for each two-dimensional matrix of multiple two-dimensional matrices of units of the image, a respective portion of the image data, and process each two-dimensional matrix of the multiple two-dimensional matrices based on the respective portion of the image data to generate pixel data for the image. To process each two-dimensional matrix of the multiple two-dimensional matrices, the at least one processor is configured to process multiple units arranged in a first two-dimensional matrix of the multiple two-dimensional matrices and process, after processing the multiple units arranged in the first two-dimensional matrix, multiple units arranged in a second two-dimensional matrix of the multiple two-dimensional matrices.
    Type: Application
    Filed: January 25, 2018
    Publication date: July 25, 2019
    Inventor: Chunhui Mei
  • Patent number: 10089708
    Abstract: A texture unit of a graphics processing unit (GPU) may receive a texture data. The texture unit may receive the texture data from the memory. The texture unit may also multiply, by a multiplier circuit of the texture unit, the texture data by at least one constant, where the constant is not associated with a filtering operation, and where the texture data comprises at least one texel. The texture unit may also output, by the texture unit, a result of multiplying the texture data by the at least one constant.
    Type: Grant
    Filed: April 28, 2016
    Date of Patent: October 2, 2018
    Assignee: QUALCOMM Incorporated
    Inventors: Andrew Evan Gruber, Lin Chen, Liang Li, Chunhui Mei
  • Patent number: 10026145
    Abstract: Techniques for allowing for concurrent execution of multiple different tasks and preempted prioritized execution of tasks on a shader processor. In an example operation, a driver executed by a central processing unit (CPU) configures GPU resources based on needs of a first “host” shader to allow the first shader to execute “normally” on the GPU. The GPU may observe two sets of tasks, “guest” tasks. Based on, for example, detecting an availability of resources, the GPU may determine a “guest” task may be run while the “host” task is running. A second “guest” shader executes on a GPU by using resources that were configured for the first “host” shader if there are available resources and, in some examples, additional resources are obtained through software-programmable means.
    Type: Grant
    Filed: December 13, 2016
    Date of Patent: July 17, 2018
    Assignee: QUALCOMM Incorporated
    Inventors: Alexei Vladimirovich Bourd, Maxim Kazakov, Chunhui Mei, Sumesh Udayakumaran
  • Publication number: 20180165786
    Abstract: Techniques for allowing for concurrent execution of multiple different tasks and preempted prioritized execution of tasks on a shader processor. In an example operation, a driver executed by a central processing unit (CPU) configures GPU resources based on needs of a first “host” shader to allow the first shader to execute “normally” on the GPU. The GPU may observe two sets of tasks, “guest” tasks. Based on, for example, detecting an availability of resources, the GPU may determine a “guest” task may be run while the “host” task is running. A second “guest” shader executes on a GPU by using resources that were configured for the first “host” shader if there are available resources and, in some examples, additional resources are obtained through software-programmable means.
    Type: Application
    Filed: December 13, 2016
    Publication date: June 14, 2018
    Inventors: Alexei Vladimirovich Bourd, Maxim Kazakov, Chunhui Mei, Sumesh Udayakumaran