Patents by Inventor Ahmed Mohammed ElShafiey Mohammed Eltantawy

Ahmed Mohammed ElShafiey Mohammed Eltantawy has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

Method and apparatus for retaining optimal width vector operations in arbitrary/flexible vector width architecture

Patent number: 11714619

Abstract: A method and apparatus to optimize a list of vector instructions using dynamic programming, in particular memoization, by generating a table containing instruction subvectors having individual (parts), contiguous (superparts) and repeated (broadcasts) lanes. Because the instructions in the table are subvectors selected to have individual, contiguous and repeated lanes in the registers, compiler optimizations can be enhanced. Introduction of such dynamic programming allows for speculative lane optimizations, as well as improved analysis-guided optimizations, either of which can be performed alone or in combination with other optimizations, whether or not they make use of dynamic programming.

Type: Grant

Filed: December 17, 2020

Date of Patent: August 1, 2023

Assignee: HUAWEI TECHNOLOGIES CO., LTD.

Inventors: Amruth Sandhupatla, Ramshankar Ramanarayanan, Boris Kravchenko, Ahmed Mohammed Elshafiey Mohammed Eltantawy
Method and system for parallel processing of tasks in multiple thread computing

Patent number: 11625250

Abstract: The disclosed systems, structures, and methods are directed to parallel processing of tasks in a multiple thread computing system. Execution of an instruction sequence of a thread allocated to a first task proceeds until an exit point of the instruction sequence is reached. The execution of the instruction sequence of the thread for the first task is terminated at a convergence point of the instruction sequence. The thread is selectively reallocated to process a second task.

Type: Grant

Filed: January 29, 2021

Date of Patent: April 11, 2023

Assignee: HUAWEI TECHNOLOGIES CO., LTD.

Inventors: Ahmed Mohammed ElShafiey Mohammed Eltantawy, Yan Luo, Tyler Bryce Nowicki
DEVICES, METHODS, AND MEDIA FOR EFFICIENT DATA DEPENDENCY MANAGEMENT FOR IN-ORDER ISSUE PROCESSORS

Publication number: 20230101571

Abstract: Methods, devices and media for efficient data dependency management for in-order issue processors are described. In various embodiments described herein, methods, devices and media are disclosed that provide techniques for managing RAW data dependencies between instructions in a constrained hardware environment. The described techniques include initial wait station allocation of write instructions, followed by wait station allocation conflict resolution methods that use a greedy algorithm to optimize a cost function based on the estimated latency of a single instruction. Efficient compilation and reduced execution time may be achieved in some embodiments. Methods and devices for compiling source code are described, as well as devices for executing the compiled machine code and media for storing compiled machine code.

Type: Application

Filed: October 28, 2022

Publication date: March 30, 2023

Inventors: Hazem A. ABDELHAFEZ, Ning XIE, Ahmed Mohammed ElShafiey Mohammed ELTANTAWY
Systems and methods for extending a live range of a virtual scalar register

Patent number: 11556319

Abstract: Systems and methods are described for extending a live range for a virtual scalar register during compiling of a program, comprising: receiving an intermediate representation (IR) of a source code configured for implementing single-instruction-multiple-thread (SIMT) execution, the IR representing the source code as control flow graph including a plurality of basic blocks (BB); and when a virtual scalar register defined in a first BB of the IR is last used in a second BB of the IR that is a divergent BB, modifying the IR to extend the live range of the virtual scalar register.

Type: Grant

Filed: September 1, 2020

Date of Patent: January 17, 2023

Assignee: HUAWEI TECHNOLOGIES CO., LTD.

Inventors: Abraham Davidson Fai Chung Chan, Tyler Bryce Nowicki, Guansong Zhang, Ahmed Mohammed ElShafiey Mohammed Eltantawy
Devices, methods, and media for efficient data dependency management for in-order issue processors

Patent number: 11500641

Abstract: Methods, devices and media for efficient data dependency management for in-order issue processors are described. In various embodiments described herein, methods, devices and media are disclosed that provide techniques for managing RAW data dependencies between instructions in a constrained hardware environment. The described techniques include initial wait station allocation of write instructions, followed by wait station allocation conflict resolution methods that use a greedy algorithm to optimize a cost function based on the estimated latency of a single instruction. Efficient compilation and reduced execution time may be achieved in some embodiments. Methods and devices for compiling source code are described, as well as devices for executing the compiled machine code and media for storing compiled machine code.

Type: Grant

Filed: October 7, 2020

Date of Patent: November 15, 2022

Assignee: HUAWEI TECHNOLOGIES CO., LTD.

Inventors: Hazem A. Abdelhafez, Ning Xie, Ahmed Mohammed ElShafiey Mohammed Eltantawy
Method of deadlock detection and synchronization-aware optimizations on asynchronous architectures

Patent number: 11429359

Abstract: A method for improving the performance of applications executed within asynchronous processor architectures. In an embodiment, a method for improving execution time of compiled synchronized source code on an asynchronous processor architecture includes receiving, by a processing system, synchronized source code comprising synchronization instructions to synchronize execution of the synchronized source code on different pipelines of the asynchronous processor architecture. The method also includes analyzing, by the processing system, the synchronized source code to determine whether the synchronized source code includes a broken code condition.

Type: Grant

Filed: July 20, 2020

Date of Patent: August 30, 2022

Assignee: Huawei Technologies Co., Ltd.

Inventors: Ahmed Mohammed ElShafiey Mohammed Eltantawy, Yaoqing Gao, Christopher Rodrigues, Lijuan Hai
METHOD AND SYSTEM FOR PARALLEL PROCESSING OF TASKS IN MULTIPLE THREAD COMPUTING

Publication number: 20220244962

Abstract: The disclosed systems, structures, and methods are directed to parallel processing of tasks in a multiple thread computing system. Execution of an instruction sequence of a thread allocated to a first task proceeds until an exit point of the instruction sequence is reached. The execution of the instruction sequence of the thread for the first task is terminated at a convergence point of the instruction sequence. The thread is selectively reallocated to process a second task.

Type: Application

Filed: January 29, 2021

Publication date: August 4, 2022

Inventors: Ahmed Mohammed ElShafiey Mohammed ELTANTAWY, Yan LUO, Tyler Bryce NOWICKI
Methods and apparatuses for coalescing function calls for ray-tracing

Patent number: 11397615

Abstract: Methods and systems for executing threads in a thread-group, for example for ray-tracing. The threads are processed to collect, for each thread, a respective set of function call indicators over a respective number of call instances. The function call indicators are reordered across all threads and all call instances, to coalesce identical function call indicators to a common call instance, and non-identical function call indicators are reordered to different call instances. Function calls are executed across the threads of the thread-group, according to the reordered and coalesced function call indicators. In ray-tracing applications, the threads represent rays, each call instance is a ray-hit of a ray, and each function call is a shader call.

Type: Grant

Filed: August 31, 2020

Date of Patent: July 26, 2022

Assignee: HUAWEI TECHNOLOGIES CO., LTD.

Inventors: Tyler Bryce Nowicki, Ahmed Mohammed Elshafiey Mohammed Eltantawy
Methods, devices, and media for reducing register pressure in flexible vector processors

Patent number: 11397580

Abstract: Methods, devices and media for reducing register pressure in flexible vector processors are described. In various embodiments described herein, methods, devices and media are disclosed that selectively re-scalarize vector instructions in a sequence of instructions such that register pressure is reduced and thread level parallelism is increased. A compiler may be used to perform a first method to partially or fully scalarize vectorized instructions of a code region of high register pressure. A compiler may be used to perform a second method to fully scalarize a sequence of vectorized instructions while preserving associations of the scalar instructions with their original vectorized instructions; the scalar instructions may then be scheduled and selectively re-vectorized. Devices executing code compiled with either method are described, as are processor-readable media storing code compiled by either method.

Type: Grant

Filed: September 17, 2020

Date of Patent: July 26, 2022

Assignee: Huawei Technologies Co., Ltd.

Inventors: Ahmed Mohammed ElShafiey Mohammed Eltantawy, Ning Xie
METHOD AND APPARATUS FOR RETAINING OPTIMAL WIDTH VECTOR OPERATIONS IN ARBITRARY/FLEXIBLE VECTOR WIDTH ARCHITECTURE

Publication number: 20220197614

Abstract: A method and apparatus to optimize a list of vector instructions using dynamic programming, in particular memoization, by generating a table containing instruction subvectors having individual (parts), contiguous (superparts) and repeated (broadcasts) lanes. Because the instructions in the table are subvectors selected to have individual, contiguous and repeated lanes in the registers, compiler optimizations can be enhanced. Introduction of such dynamic programming allows for speculative lane optimizations, as well as improved analysis-guided optimizations, either of which can be performed alone or in combination with other optimizations, whether or not they make use of dynamic programming.

Type: Application

Filed: December 17, 2020

Publication date: June 23, 2022

Applicant: HUAWEI TECHNOLOGIES CO., LTD.

Inventors: Amruth SANDHUPATLA, Ramshankar RAMANARAYANAN, Boris KRAVCHENKO, Ahmed Mohammed Elshafiey Mohammed ELTANTAWY
Method and apparatus for balancing binary instruction burstization and chaining

Patent number: 11327760

Abstract: A method for grouping computer instructions includes receiving a set of computer instructions, grouping the set of computer instructions by register dependencies, identifying a plurality of single-definition-use flow (SDF) bundles based on a burstization criteria and a chaining criteria; and based on the SDF bundles, transforming the set of computer instructions. The transformation may include splitting one of the set of computer instructions and setting a burst parameter for the one of the set of computer instruction. The transformation may include grouping a plurality of the set of computer instructions and replacing a pair of register file accesses with a pair of temporary register accesses.

Type: Grant

Filed: April 9, 2020

Date of Patent: May 10, 2022

Assignee: HUAWEI TECHNOLOGIES CO., LTD.

Inventors: Andrew Siu Doug Lee, Ahmed Mohammed Elshafiey Mohammed Eltantawy
DEVICES, METHODS, AND MEDIA FOR EFFICIENT DATA DEPENDENCY MANAGEMENT FOR IN-ORDER ISSUE PROCESSORS

Publication number: 20220107811

Abstract: Methods, devices and media for efficient data dependency management for in-order issue processors are described. In various embodiments described herein, methods, devices and media are disclosed that provide techniques for managing RAW data dependencies between instructions in a constrained hardware environment. The described techniques include initial wait station allocation of write instructions, followed by wait station allocation conflict resolution methods that use a greedy algorithm to optimize a cost function based on the estimated latency of a single instruction. Efficient compilation and reduced execution time may be achieved in some embodiments. Methods and devices for compiling source code are described, as well as devices for executing the compiled machine code and media for storing compiled machine code.

Type: Application

Filed: October 7, 2020

Publication date: April 7, 2022

Inventors: Hazem A. ABDELHAFEZ, Ning XIE, Ahmed Mohammed ElShafiey Mohammed ELTANTAWY
METHODS, DEVICES, AND MEDIA FOR REDUCING REGISTER PRESSURE IN FLEXIBLE VECTOR PROCESSORS

Publication number: 20220083337

Abstract: Methods, devices and media for reducing register pressure in flexible vector processors are described. In various embodiments described herein, methods, devices and media are disclosed that selectively re-scalarize vector instructions in a sequence of instructions such that register pressure is reduced and thread level parallelism is increased. A compiler may be used to perform a first method to partially or fully scalarize vectorized instructions of a code region of high register pressure. A compiler may be used to perform a second method to fully scalarize a sequence of vectorized instructions while preserving associations of the scalar instructions with their original vectorized instructions; the scalar instructions may then be scheduled and selectively re-vectorized. Devices executing code compiled with either method are described, as are processor-readable media storing code compiled by either method.

Type: Application

Filed: September 17, 2020

Publication date: March 17, 2022

Inventors: Ahmed Mohammed ElShafiey Mohammed ELTANTAWY, Ning XIE
METHODS AND APPARATUSES FOR COALESCING FUNCTION CALLS FOR RAY-TRACING

Publication number: 20220066819

Abstract: Methods and systems for executing threads in a thread-group, for example for ray-tracing. The threads are processed to collect, for each thread, a respective set of function call indicators over a respective number of call instances. The function call indicators are reordered across all threads and all call instances, to coalesce identical function call indicators to a common call instance, and non-identical function call indicators are reordered to different call instances. Function calls are executed across the threads of the thread-group, according to the reordered and coalesced function call indicators. In ray-tracing applications, the threads represent rays, each call instance is a ray-hit of a ray, and each function call is a shader call.

Type: Application

Filed: August 31, 2020

Publication date: March 3, 2022

Inventors: Tyler Bryce NOWICKI, Ahmed Mohammed ElShafiey Mohammed ELTANTAWY
SYSTEMS AND METHODS FOR EXTENDING A LIVE RANGE OF A VIRTUAL SCALAR REGISTER

Publication number: 20220066783

Abstract: Systems and methods are described for extending a live range for a virtual scalar register during compiling of a program, comprising: receiving an intermediate representation (IR) of a source code configured for implementing single-instruction-multiple-thread (SIMT) execution, the IR representing the source code as control flow graph including a plurality of basic blocks (BB); and when a virtual scalar register defined in a first BB of the IR is last used in a second BB of the IR that is a divergent BB, modifying the IR to extend the live range of the virtual scalar register.

Type: Application

Filed: September 1, 2020

Publication date: March 3, 2022

Inventors: Abraham Davidson Fai Chung CHAN, Tyler Bryce NOWICKI, Guansong ZHANG, Ahmed Mohammed ElShafiey Mohammed ELTANTAWY
Method and apparatus for reusable and relative indexed register resource allocation in function calls

Patent number: 11188315

Abstract: The disclosed systems, apparatuses and methods are directed to optimizing by a compiler register resource allocation for functions of a module, using a Register File comprising a limited number of registers. After performing interprocedural analysis in the module, the compiler computes the number of registers used by each function, and compiles the function to final machine code, except at callsites where a call is detected to be made to another function. At each callsite and for each called function, the compiler expands call instructions to final machine code after computing and setting a relative index to be used by a called function for running in an available part of the Register File. The relative index optimizes register resource allocation by minimizing the number of spilled registers before a function is called.

Type: Grant

Filed: September 4, 2020

Date of Patent: November 30, 2021

Assignee: HUAWEI TECHNOLOGIES CO., LTD.

Inventors: Yan Luo, Ahmed Mohammed ElShafiey Mohammed Eltantawy, Tyler Bryce Nowicki
METHOD AND APPARATUS FOR BALANCING BINARY INSTRUCTION BURSTIZATION AND CHAINING

Publication number: 20210318875

Abstract: A method for grouping computer instructions includes receiving a set of computer instructions, grouping the set of computer instructions by register dependencies, identifying a plurality of single-definition-use flow (SDF) bundles based on a burstization criteria and a chaining criteria; and based on the SDF bundles, transforming the set of computer instructions. The transformation may include splitting one of the set of computer instructions and setting a burst parameter for the one of the set of computer instruction. The transformation may include grouping a plurality of the set of computer instructions and replacing a pair of register file accesses with a pair of temporary register accesses.

Type: Application

Filed: April 9, 2020

Publication date: October 14, 2021

Applicant: HUAWEI TECHNOLOGIES CO., LTD.

Inventors: Andrew Siu Doug LEE, Ahmed Mohammed Elshafiey Mohammed ELTANTAWY
Method of Deadlock Detection and Synchronization-Aware Optimizations on Asynchronous Architectures

Publication number: 20210004213

Abstract: A method for improving the performance of applications executed within asynchronous processor architectures. In an embodiment, a method for improving execution time of compiled synchronized source code on an asynchronous processor architecture includes receiving, by a processing system, synchronized source code comprising synchronization instructions to synchronize execution of the synchronized source code on different pipelines of the asynchronous processor architecture. The method also includes analyzing, by the processing system, the synchronized source code to determine whether the synchronized source code includes a broken code condition.

Type: Application

Filed: July 20, 2020

Publication date: January 7, 2021

Inventors: Ahmed Mohammed ElShafiey Mohammed Eltantawy, Yaoqing Gao, Christopher Rodrigues, Lijuan Hai