Patents by Inventor Guokai Ma

Guokai Ma has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

  • Publication number: 20240281667
    Abstract: Provided herein are apparatus and methods for batch rebalance in distributed data parallel DNN training. An apparatus includes interface circuitry; and processor circuitry coupled with the interface circuitry, wherein the processor circuitry is to: obtain sorted samples of a mini batch via the interface circuitry, wherein the sorted samples are in an ascend or descend order based on a volume of each of the samples; and assign the sorted samples to each of a plurality of local batches one by one in an order from a first local batch to a last local batch of the plurality of local batches and then from the last local batch to the first local batch until all of the sorted samples are assigned. Other embodiments may also be disclosed and claimed.
    Type: Application
    Filed: October 18, 2021
    Publication date: August 22, 2024
    Inventors: Guokai MA, Jiong GONG, Hongzhen LIU
  • Publication number: 20240037378
    Abstract: Systems, apparatuses and methods may provide for technology that identifies an embedding table associated with a neural network. The neural network is associated with a plurality of compute nodes. The technology further identifies a number of entries of the embedding table, and determines whether to process gradients associated with the embedding table as dense gradients or sparse gradients based on the number of entries.
    Type: Application
    Filed: December 24, 2020
    Publication date: February 1, 2024
    Applicant: Intel Corporation
    Inventors: Guokai Ma, Jiong Gong, Dhiraj Kalamkar, Rachitha Prem Seelin, Hongzhen Liu, Akshay Jain, Liangang Zhang
  • Publication number: 20230315654
    Abstract: A method of performing ring allreduce operations is disclosed. The method includes sending a chunk of a message in a receive buffer at a current index of a send buffer to a next node in a virtual ring of nodes, receiving a chunk of the message from a previous node in the virtual ring of nodes and store the chunk at the current index of the receive buffer, and reducing a chunk in a send buffer at a previous index of the receive buffer and a chunk in the receive buffer at a previous index of the receive buffer and storing a result at the previous index of the receive buffer. The method includes repeating the sending, receiving and storing, and reducing and storing steps until all chunks of the message are reduced, and sending reduced chunks to the next node and receive reduced chunks from the previous node.
    Type: Application
    Filed: November 30, 2020
    Publication date: October 5, 2023
    Applicant: Intel Corporation
    Inventors: Guokai Ma, Zhouhai Ye, Feng Zou, Xiaojie Deng
  • Patent number: 9141362
    Abstract: A method and system to support scheduling of memory store instructions across atomic regions in binary translation in a processing unit or processor. In one embodiment of the invention, the processing unit has a store buffer that allows store instructions to be issued in different order than the source binary program order but still retire in source binary program order. This facilitates a small atomic region that maps to each iteration of a source binary code and these atomic regions are joined together into a pipelined region. In one embodiment of the invention, the processing unit executes commit instruction(s) once every loop iteration instead of executing the commit instruction(s) once after the loop exit.
    Type: Grant
    Filed: September 27, 2012
    Date of Patent: September 22, 2015
    Assignee: Intel Corporation
    Inventors: Guokai Ma, Yihua Jin, Daniel M. Lavery, Jianhui Li
  • Publication number: 20140282437
    Abstract: A method and system to support scheduling of memory store instructions across atomic regions in binary translation in a processing unit or processor. In one embodiment of the invention, the processing unit has a store buffer that allows store instructions to be issued in different order than the source binary program order but still retire in source binary program order. This facilitates a small atomic region that maps to each iteration of a source binary code and these atomic regions are joined together into a pipelined region. In one embodiment of the invention, the processing unit executes commit instruction(s) once every loop iteration instead of executing the commit instruction(s) once after the loop exit.
    Type: Application
    Filed: September 27, 2012
    Publication date: September 18, 2014
    Inventors: Guokai Ma, Yihua Jin, Daniel M. Lavery, Jianhui Li
  • Publication number: 20060288188
    Abstract: A technique includes performing multiple aligned accesses to a memory to retrieve data of a string misaligned with respect to boundaries of the memory by an offset. Based on the offset, a subset of the data is selected, and the subset is stored in a register.
    Type: Application
    Filed: June 17, 2005
    Publication date: December 21, 2006
    Inventors: Guokai Ma, Jianhui Li