Patents by Inventor Ahmed Mohammed ElShafiey Mohammed Eltantawy
Ahmed Mohammed ElShafiey Mohammed Eltantawy has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).
-
Patent number: 11714619Abstract: A method and apparatus to optimize a list of vector instructions using dynamic programming, in particular memoization, by generating a table containing instruction subvectors having individual (parts), contiguous (superparts) and repeated (broadcasts) lanes. Because the instructions in the table are subvectors selected to have individual, contiguous and repeated lanes in the registers, compiler optimizations can be enhanced. Introduction of such dynamic programming allows for speculative lane optimizations, as well as improved analysis-guided optimizations, either of which can be performed alone or in combination with other optimizations, whether or not they make use of dynamic programming.Type: GrantFiled: December 17, 2020Date of Patent: August 1, 2023Assignee: HUAWEI TECHNOLOGIES CO., LTD.Inventors: Amruth Sandhupatla, Ramshankar Ramanarayanan, Boris Kravchenko, Ahmed Mohammed Elshafiey Mohammed Eltantawy
-
Patent number: 11625250Abstract: The disclosed systems, structures, and methods are directed to parallel processing of tasks in a multiple thread computing system. Execution of an instruction sequence of a thread allocated to a first task proceeds until an exit point of the instruction sequence is reached. The execution of the instruction sequence of the thread for the first task is terminated at a convergence point of the instruction sequence. The thread is selectively reallocated to process a second task.Type: GrantFiled: January 29, 2021Date of Patent: April 11, 2023Assignee: HUAWEI TECHNOLOGIES CO., LTD.Inventors: Ahmed Mohammed ElShafiey Mohammed Eltantawy, Yan Luo, Tyler Bryce Nowicki
-
Publication number: 20230101571Abstract: Methods, devices and media for efficient data dependency management for in-order issue processors are described. In various embodiments described herein, methods, devices and media are disclosed that provide techniques for managing RAW data dependencies between instructions in a constrained hardware environment. The described techniques include initial wait station allocation of write instructions, followed by wait station allocation conflict resolution methods that use a greedy algorithm to optimize a cost function based on the estimated latency of a single instruction. Efficient compilation and reduced execution time may be achieved in some embodiments. Methods and devices for compiling source code are described, as well as devices for executing the compiled machine code and media for storing compiled machine code.Type: ApplicationFiled: October 28, 2022Publication date: March 30, 2023Inventors: Hazem A. ABDELHAFEZ, Ning XIE, Ahmed Mohammed ElShafiey Mohammed ELTANTAWY
-
Patent number: 11556319Abstract: Systems and methods are described for extending a live range for a virtual scalar register during compiling of a program, comprising: receiving an intermediate representation (IR) of a source code configured for implementing single-instruction-multiple-thread (SIMT) execution, the IR representing the source code as control flow graph including a plurality of basic blocks (BB); and when a virtual scalar register defined in a first BB of the IR is last used in a second BB of the IR that is a divergent BB, modifying the IR to extend the live range of the virtual scalar register.Type: GrantFiled: September 1, 2020Date of Patent: January 17, 2023Assignee: HUAWEI TECHNOLOGIES CO., LTD.Inventors: Abraham Davidson Fai Chung Chan, Tyler Bryce Nowicki, Guansong Zhang, Ahmed Mohammed ElShafiey Mohammed Eltantawy
-
Patent number: 11500641Abstract: Methods, devices and media for efficient data dependency management for in-order issue processors are described. In various embodiments described herein, methods, devices and media are disclosed that provide techniques for managing RAW data dependencies between instructions in a constrained hardware environment. The described techniques include initial wait station allocation of write instructions, followed by wait station allocation conflict resolution methods that use a greedy algorithm to optimize a cost function based on the estimated latency of a single instruction. Efficient compilation and reduced execution time may be achieved in some embodiments. Methods and devices for compiling source code are described, as well as devices for executing the compiled machine code and media for storing compiled machine code.Type: GrantFiled: October 7, 2020Date of Patent: November 15, 2022Assignee: HUAWEI TECHNOLOGIES CO., LTD.Inventors: Hazem A. Abdelhafez, Ning Xie, Ahmed Mohammed ElShafiey Mohammed Eltantawy
-
Patent number: 11429359Abstract: A method for improving the performance of applications executed within asynchronous processor architectures. In an embodiment, a method for improving execution time of compiled synchronized source code on an asynchronous processor architecture includes receiving, by a processing system, synchronized source code comprising synchronization instructions to synchronize execution of the synchronized source code on different pipelines of the asynchronous processor architecture. The method also includes analyzing, by the processing system, the synchronized source code to determine whether the synchronized source code includes a broken code condition.Type: GrantFiled: July 20, 2020Date of Patent: August 30, 2022Assignee: Huawei Technologies Co., Ltd.Inventors: Ahmed Mohammed ElShafiey Mohammed Eltantawy, Yaoqing Gao, Christopher Rodrigues, Lijuan Hai
-
Publication number: 20220244962Abstract: The disclosed systems, structures, and methods are directed to parallel processing of tasks in a multiple thread computing system. Execution of an instruction sequence of a thread allocated to a first task proceeds until an exit point of the instruction sequence is reached. The execution of the instruction sequence of the thread for the first task is terminated at a convergence point of the instruction sequence. The thread is selectively reallocated to process a second task.Type: ApplicationFiled: January 29, 2021Publication date: August 4, 2022Inventors: Ahmed Mohammed ElShafiey Mohammed ELTANTAWY, Yan LUO, Tyler Bryce NOWICKI
-
Patent number: 11397580Abstract: Methods, devices and media for reducing register pressure in flexible vector processors are described. In various embodiments described herein, methods, devices and media are disclosed that selectively re-scalarize vector instructions in a sequence of instructions such that register pressure is reduced and thread level parallelism is increased. A compiler may be used to perform a first method to partially or fully scalarize vectorized instructions of a code region of high register pressure. A compiler may be used to perform a second method to fully scalarize a sequence of vectorized instructions while preserving associations of the scalar instructions with their original vectorized instructions; the scalar instructions may then be scheduled and selectively re-vectorized. Devices executing code compiled with either method are described, as are processor-readable media storing code compiled by either method.Type: GrantFiled: September 17, 2020Date of Patent: July 26, 2022Assignee: Huawei Technologies Co., Ltd.Inventors: Ahmed Mohammed ElShafiey Mohammed Eltantawy, Ning Xie
-
Patent number: 11397615Abstract: Methods and systems for executing threads in a thread-group, for example for ray-tracing. The threads are processed to collect, for each thread, a respective set of function call indicators over a respective number of call instances. The function call indicators are reordered across all threads and all call instances, to coalesce identical function call indicators to a common call instance, and non-identical function call indicators are reordered to different call instances. Function calls are executed across the threads of the thread-group, according to the reordered and coalesced function call indicators. In ray-tracing applications, the threads represent rays, each call instance is a ray-hit of a ray, and each function call is a shader call.Type: GrantFiled: August 31, 2020Date of Patent: July 26, 2022Assignee: HUAWEI TECHNOLOGIES CO., LTD.Inventors: Tyler Bryce Nowicki, Ahmed Mohammed Elshafiey Mohammed Eltantawy
-
Publication number: 20220197614Abstract: A method and apparatus to optimize a list of vector instructions using dynamic programming, in particular memoization, by generating a table containing instruction subvectors having individual (parts), contiguous (superparts) and repeated (broadcasts) lanes. Because the instructions in the table are subvectors selected to have individual, contiguous and repeated lanes in the registers, compiler optimizations can be enhanced. Introduction of such dynamic programming allows for speculative lane optimizations, as well as improved analysis-guided optimizations, either of which can be performed alone or in combination with other optimizations, whether or not they make use of dynamic programming.Type: ApplicationFiled: December 17, 2020Publication date: June 23, 2022Applicant: HUAWEI TECHNOLOGIES CO., LTD.Inventors: Amruth SANDHUPATLA, Ramshankar RAMANARAYANAN, Boris KRAVCHENKO, Ahmed Mohammed Elshafiey Mohammed ELTANTAWY
-
Patent number: 11327760Abstract: A method for grouping computer instructions includes receiving a set of computer instructions, grouping the set of computer instructions by register dependencies, identifying a plurality of single-definition-use flow (SDF) bundles based on a burstization criteria and a chaining criteria; and based on the SDF bundles, transforming the set of computer instructions. The transformation may include splitting one of the set of computer instructions and setting a burst parameter for the one of the set of computer instruction. The transformation may include grouping a plurality of the set of computer instructions and replacing a pair of register file accesses with a pair of temporary register accesses.Type: GrantFiled: April 9, 2020Date of Patent: May 10, 2022Assignee: HUAWEI TECHNOLOGIES CO., LTD.Inventors: Andrew Siu Doug Lee, Ahmed Mohammed Elshafiey Mohammed Eltantawy
-
Publication number: 20220107811Abstract: Methods, devices and media for efficient data dependency management for in-order issue processors are described. In various embodiments described herein, methods, devices and media are disclosed that provide techniques for managing RAW data dependencies between instructions in a constrained hardware environment. The described techniques include initial wait station allocation of write instructions, followed by wait station allocation conflict resolution methods that use a greedy algorithm to optimize a cost function based on the estimated latency of a single instruction. Efficient compilation and reduced execution time may be achieved in some embodiments. Methods and devices for compiling source code are described, as well as devices for executing the compiled machine code and media for storing compiled machine code.Type: ApplicationFiled: October 7, 2020Publication date: April 7, 2022Inventors: Hazem A. ABDELHAFEZ, Ning XIE, Ahmed Mohammed ElShafiey Mohammed ELTANTAWY
-
Publication number: 20220083337Abstract: Methods, devices and media for reducing register pressure in flexible vector processors are described. In various embodiments described herein, methods, devices and media are disclosed that selectively re-scalarize vector instructions in a sequence of instructions such that register pressure is reduced and thread level parallelism is increased. A compiler may be used to perform a first method to partially or fully scalarize vectorized instructions of a code region of high register pressure. A compiler may be used to perform a second method to fully scalarize a sequence of vectorized instructions while preserving associations of the scalar instructions with their original vectorized instructions; the scalar instructions may then be scheduled and selectively re-vectorized. Devices executing code compiled with either method are described, as are processor-readable media storing code compiled by either method.Type: ApplicationFiled: September 17, 2020Publication date: March 17, 2022Inventors: Ahmed Mohammed ElShafiey Mohammed ELTANTAWY, Ning XIE
-
Publication number: 20220066783Abstract: Systems and methods are described for extending a live range for a virtual scalar register during compiling of a program, comprising: receiving an intermediate representation (IR) of a source code configured for implementing single-instruction-multiple-thread (SIMT) execution, the IR representing the source code as control flow graph including a plurality of basic blocks (BB); and when a virtual scalar register defined in a first BB of the IR is last used in a second BB of the IR that is a divergent BB, modifying the IR to extend the live range of the virtual scalar register.Type: ApplicationFiled: September 1, 2020Publication date: March 3, 2022Inventors: Abraham Davidson Fai Chung CHAN, Tyler Bryce NOWICKI, Guansong ZHANG, Ahmed Mohammed ElShafiey Mohammed ELTANTAWY
-
Publication number: 20220066819Abstract: Methods and systems for executing threads in a thread-group, for example for ray-tracing. The threads are processed to collect, for each thread, a respective set of function call indicators over a respective number of call instances. The function call indicators are reordered across all threads and all call instances, to coalesce identical function call indicators to a common call instance, and non-identical function call indicators are reordered to different call instances. Function calls are executed across the threads of the thread-group, according to the reordered and coalesced function call indicators. In ray-tracing applications, the threads represent rays, each call instance is a ray-hit of a ray, and each function call is a shader call.Type: ApplicationFiled: August 31, 2020Publication date: March 3, 2022Inventors: Tyler Bryce NOWICKI, Ahmed Mohammed ElShafiey Mohammed ELTANTAWY
-
Patent number: 11188315Abstract: The disclosed systems, apparatuses and methods are directed to optimizing by a compiler register resource allocation for functions of a module, using a Register File comprising a limited number of registers. After performing interprocedural analysis in the module, the compiler computes the number of registers used by each function, and compiles the function to final machine code, except at callsites where a call is detected to be made to another function. At each callsite and for each called function, the compiler expands call instructions to final machine code after computing and setting a relative index to be used by a called function for running in an available part of the Register File. The relative index optimizes register resource allocation by minimizing the number of spilled registers before a function is called.Type: GrantFiled: September 4, 2020Date of Patent: November 30, 2021Assignee: HUAWEI TECHNOLOGIES CO., LTD.Inventors: Yan Luo, Ahmed Mohammed ElShafiey Mohammed Eltantawy, Tyler Bryce Nowicki
-
Publication number: 20210318875Abstract: A method for grouping computer instructions includes receiving a set of computer instructions, grouping the set of computer instructions by register dependencies, identifying a plurality of single-definition-use flow (SDF) bundles based on a burstization criteria and a chaining criteria; and based on the SDF bundles, transforming the set of computer instructions. The transformation may include splitting one of the set of computer instructions and setting a burst parameter for the one of the set of computer instruction. The transformation may include grouping a plurality of the set of computer instructions and replacing a pair of register file accesses with a pair of temporary register accesses.Type: ApplicationFiled: April 9, 2020Publication date: October 14, 2021Applicant: HUAWEI TECHNOLOGIES CO., LTD.Inventors: Andrew Siu Doug LEE, Ahmed Mohammed Elshafiey Mohammed ELTANTAWY
-
Publication number: 20210004213Abstract: A method for improving the performance of applications executed within asynchronous processor architectures. In an embodiment, a method for improving execution time of compiled synchronized source code on an asynchronous processor architecture includes receiving, by a processing system, synchronized source code comprising synchronization instructions to synchronize execution of the synchronized source code on different pipelines of the asynchronous processor architecture. The method also includes analyzing, by the processing system, the synchronized source code to determine whether the synchronized source code includes a broken code condition.Type: ApplicationFiled: July 20, 2020Publication date: January 7, 2021Inventors: Ahmed Mohammed ElShafiey Mohammed Eltantawy, Yaoqing Gao, Christopher Rodrigues, Lijuan Hai