Patents by Inventor Guokai Ma

Guokai Ma has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

APPARATUS AND METHOD FOR BATCH REBALANCE IN DISTRIBUTED DATA PARALLEL DNN TRAINING

Publication number: 20240281667

Abstract: Provided herein are apparatus and methods for batch rebalance in distributed data parallel DNN training. An apparatus includes interface circuitry; and processor circuitry coupled with the interface circuitry, wherein the processor circuitry is to: obtain sorted samples of a mini batch via the interface circuitry, wherein the sorted samples are in an ascend or descend order based on a volume of each of the samples; and assign the sorted samples to each of a plurality of local batches one by one in an order from a first local batch to a last local batch of the plurality of local batches and then from the last local batch to the first local batch until all of the sorted samples are assigned. Other embodiments may also be disclosed and claimed.

Type: Application

Filed: October 18, 2021

Publication date: August 22, 2024

Inventors: Guokai MA, Jiong GONG, Hongzhen LIU
ACCELERATED SCALE-OUT PERFORMANCE OF DEEP LEARNING TRAINING WORKLOAD WITH EMBEDDING TABLES

Publication number: 20240037378

Abstract: Systems, apparatuses and methods may provide for technology that identifies an embedding table associated with a neural network. The neural network is associated with a plurality of compute nodes. The technology further identifies a number of entries of the embedding table, and determines whether to process gradients associated with the embedding table as dense gradients or sparse gradients based on the number of entries.

Type: Application

Filed: December 24, 2020

Publication date: February 1, 2024

Applicant: Intel Corporation

Inventors: Guokai Ma, Jiong Gong, Dhiraj Kalamkar, Rachitha Prem Seelin, Hongzhen Liu, Akshay Jain, Liangang Zhang
METHOD OF RING ALLREDUCE PROCESSING

Publication number: 20230315654

Abstract: A method of performing ring allreduce operations is disclosed. The method includes sending a chunk of a message in a receive buffer at a current index of a send buffer to a next node in a virtual ring of nodes, receiving a chunk of the message from a previous node in the virtual ring of nodes and store the chunk at the current index of the receive buffer, and reducing a chunk in a send buffer at a previous index of the receive buffer and a chunk in the receive buffer at a previous index of the receive buffer and storing a result at the previous index of the receive buffer. The method includes repeating the sending, receiving and storing, and reducing and storing steps until all chunks of the message are reduced, and sending reduced chunks to the next node and receive reduced chunks from the previous node.

Type: Application

Filed: November 30, 2020

Publication date: October 5, 2023

Applicant: Intel Corporation

Inventors: Guokai Ma, Zhouhai Ye, Feng Zou, Xiaojie Deng
Method and apparatus to schedule store instructions across atomic regions in binary translation

Patent number: 9141362

Abstract: A method and system to support scheduling of memory store instructions across atomic regions in binary translation in a processing unit or processor. In one embodiment of the invention, the processing unit has a store buffer that allows store instructions to be issued in different order than the source binary program order but still retire in source binary program order. This facilitates a small atomic region that maps to each iteration of a source binary code and these atomic regions are joined together into a pipelined region. In one embodiment of the invention, the processing unit executes commit instruction(s) once every loop iteration instead of executing the commit instruction(s) once after the loop exit.

Type: Grant

Filed: September 27, 2012

Date of Patent: September 22, 2015

Assignee: Intel Corporation

Inventors: Guokai Ma, Yihua Jin, Daniel M. Lavery, Jianhui Li
METHOD AND APPARATUS TO SCHEDULE STORE INSTRUCTIONS ACROSS ATOMIC REGIONS IN BINARY TRANSLATION

Publication number: 20140282437

Abstract: A method and system to support scheduling of memory store instructions across atomic regions in binary translation in a processing unit or processor. In one embodiment of the invention, the processing unit has a store buffer that allows store instructions to be issued in different order than the source binary program order but still retire in source binary program order. This facilitates a small atomic region that maps to each iteration of a source binary code and these atomic regions are joined together into a pipelined region. In one embodiment of the invention, the processing unit executes commit instruction(s) once every loop iteration instead of executing the commit instruction(s) once after the loop exit.

Type: Application

Filed: September 27, 2012

Publication date: September 18, 2014

Inventors: Guokai Ma, Yihua Jin, Daniel M. Lavery, Jianhui Li
Translating a string operation

Publication number: 20060288188

Abstract: A technique includes performing multiple aligned accesses to a memory to retrieve data of a string misaligned with respect to boundaries of the memory by an offset. Based on the offset, a subset of the data is selected, and the subset is stored in a register.

Type: Application

Filed: June 17, 2005

Publication date: December 21, 2006

Inventors: Guokai Ma, Jianhui Li

APPARATUS AND METHOD FOR BATCH REBALANCE IN DISTRIBUTED DATA PARALLEL DNN TRAINING

ACCELERATED SCALE-OUT PERFORMANCE OF DEEP LEARNING TRAINING WORKLOAD WITH EMBEDDING TABLES

METHOD OF RING ALLREDUCE PROCESSING

Method and apparatus to schedule store instructions across atomic regions in binary translation

METHOD AND APPARATUS TO SCHEDULE STORE INSTRUCTIONS ACROSS ATOMIC REGIONS IN BINARY TRANSLATION

Translating a string operation