Patents by Inventor Ahmed Mohammed ElShafiey Mohammed Eltantawy

Ahmed Mohammed ElShafiey Mohammed Eltantawy has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

  • Patent number: 11714619
    Abstract: A method and apparatus to optimize a list of vector instructions using dynamic programming, in particular memoization, by generating a table containing instruction subvectors having individual (parts), contiguous (superparts) and repeated (broadcasts) lanes. Because the instructions in the table are subvectors selected to have individual, contiguous and repeated lanes in the registers, compiler optimizations can be enhanced. Introduction of such dynamic programming allows for speculative lane optimizations, as well as improved analysis-guided optimizations, either of which can be performed alone or in combination with other optimizations, whether or not they make use of dynamic programming.
    Type: Grant
    Filed: December 17, 2020
    Date of Patent: August 1, 2023
    Assignee: HUAWEI TECHNOLOGIES CO., LTD.
    Inventors: Amruth Sandhupatla, Ramshankar Ramanarayanan, Boris Kravchenko, Ahmed Mohammed Elshafiey Mohammed Eltantawy
  • Patent number: 11625250
    Abstract: The disclosed systems, structures, and methods are directed to parallel processing of tasks in a multiple thread computing system. Execution of an instruction sequence of a thread allocated to a first task proceeds until an exit point of the instruction sequence is reached. The execution of the instruction sequence of the thread for the first task is terminated at a convergence point of the instruction sequence. The thread is selectively reallocated to process a second task.
    Type: Grant
    Filed: January 29, 2021
    Date of Patent: April 11, 2023
    Assignee: HUAWEI TECHNOLOGIES CO., LTD.
    Inventors: Ahmed Mohammed ElShafiey Mohammed Eltantawy, Yan Luo, Tyler Bryce Nowicki
  • Publication number: 20230101571
    Abstract: Methods, devices and media for efficient data dependency management for in-order issue processors are described. In various embodiments described herein, methods, devices and media are disclosed that provide techniques for managing RAW data dependencies between instructions in a constrained hardware environment. The described techniques include initial wait station allocation of write instructions, followed by wait station allocation conflict resolution methods that use a greedy algorithm to optimize a cost function based on the estimated latency of a single instruction. Efficient compilation and reduced execution time may be achieved in some embodiments. Methods and devices for compiling source code are described, as well as devices for executing the compiled machine code and media for storing compiled machine code.
    Type: Application
    Filed: October 28, 2022
    Publication date: March 30, 2023
    Inventors: Hazem A. ABDELHAFEZ, Ning XIE, Ahmed Mohammed ElShafiey Mohammed ELTANTAWY
  • Patent number: 11556319
    Abstract: Systems and methods are described for extending a live range for a virtual scalar register during compiling of a program, comprising: receiving an intermediate representation (IR) of a source code configured for implementing single-instruction-multiple-thread (SIMT) execution, the IR representing the source code as control flow graph including a plurality of basic blocks (BB); and when a virtual scalar register defined in a first BB of the IR is last used in a second BB of the IR that is a divergent BB, modifying the IR to extend the live range of the virtual scalar register.
    Type: Grant
    Filed: September 1, 2020
    Date of Patent: January 17, 2023
    Assignee: HUAWEI TECHNOLOGIES CO., LTD.
    Inventors: Abraham Davidson Fai Chung Chan, Tyler Bryce Nowicki, Guansong Zhang, Ahmed Mohammed ElShafiey Mohammed Eltantawy
  • Patent number: 11500641
    Abstract: Methods, devices and media for efficient data dependency management for in-order issue processors are described. In various embodiments described herein, methods, devices and media are disclosed that provide techniques for managing RAW data dependencies between instructions in a constrained hardware environment. The described techniques include initial wait station allocation of write instructions, followed by wait station allocation conflict resolution methods that use a greedy algorithm to optimize a cost function based on the estimated latency of a single instruction. Efficient compilation and reduced execution time may be achieved in some embodiments. Methods and devices for compiling source code are described, as well as devices for executing the compiled machine code and media for storing compiled machine code.
    Type: Grant
    Filed: October 7, 2020
    Date of Patent: November 15, 2022
    Assignee: HUAWEI TECHNOLOGIES CO., LTD.
    Inventors: Hazem A. Abdelhafez, Ning Xie, Ahmed Mohammed ElShafiey Mohammed Eltantawy
  • Patent number: 11429359
    Abstract: A method for improving the performance of applications executed within asynchronous processor architectures. In an embodiment, a method for improving execution time of compiled synchronized source code on an asynchronous processor architecture includes receiving, by a processing system, synchronized source code comprising synchronization instructions to synchronize execution of the synchronized source code on different pipelines of the asynchronous processor architecture. The method also includes analyzing, by the processing system, the synchronized source code to determine whether the synchronized source code includes a broken code condition.
    Type: Grant
    Filed: July 20, 2020
    Date of Patent: August 30, 2022
    Assignee: Huawei Technologies Co., Ltd.
    Inventors: Ahmed Mohammed ElShafiey Mohammed Eltantawy, Yaoqing Gao, Christopher Rodrigues, Lijuan Hai
  • Publication number: 20220244962
    Abstract: The disclosed systems, structures, and methods are directed to parallel processing of tasks in a multiple thread computing system. Execution of an instruction sequence of a thread allocated to a first task proceeds until an exit point of the instruction sequence is reached. The execution of the instruction sequence of the thread for the first task is terminated at a convergence point of the instruction sequence. The thread is selectively reallocated to process a second task.
    Type: Application
    Filed: January 29, 2021
    Publication date: August 4, 2022
    Inventors: Ahmed Mohammed ElShafiey Mohammed ELTANTAWY, Yan LUO, Tyler Bryce NOWICKI
  • Patent number: 11397615
    Abstract: Methods and systems for executing threads in a thread-group, for example for ray-tracing. The threads are processed to collect, for each thread, a respective set of function call indicators over a respective number of call instances. The function call indicators are reordered across all threads and all call instances, to coalesce identical function call indicators to a common call instance, and non-identical function call indicators are reordered to different call instances. Function calls are executed across the threads of the thread-group, according to the reordered and coalesced function call indicators. In ray-tracing applications, the threads represent rays, each call instance is a ray-hit of a ray, and each function call is a shader call.
    Type: Grant
    Filed: August 31, 2020
    Date of Patent: July 26, 2022
    Assignee: HUAWEI TECHNOLOGIES CO., LTD.
    Inventors: Tyler Bryce Nowicki, Ahmed Mohammed Elshafiey Mohammed Eltantawy
  • Patent number: 11397580
    Abstract: Methods, devices and media for reducing register pressure in flexible vector processors are described. In various embodiments described herein, methods, devices and media are disclosed that selectively re-scalarize vector instructions in a sequence of instructions such that register pressure is reduced and thread level parallelism is increased. A compiler may be used to perform a first method to partially or fully scalarize vectorized instructions of a code region of high register pressure. A compiler may be used to perform a second method to fully scalarize a sequence of vectorized instructions while preserving associations of the scalar instructions with their original vectorized instructions; the scalar instructions may then be scheduled and selectively re-vectorized. Devices executing code compiled with either method are described, as are processor-readable media storing code compiled by either method.
    Type: Grant
    Filed: September 17, 2020
    Date of Patent: July 26, 2022
    Assignee: Huawei Technologies Co., Ltd.
    Inventors: Ahmed Mohammed ElShafiey Mohammed Eltantawy, Ning Xie
  • Publication number: 20220197614
    Abstract: A method and apparatus to optimize a list of vector instructions using dynamic programming, in particular memoization, by generating a table containing instruction subvectors having individual (parts), contiguous (superparts) and repeated (broadcasts) lanes. Because the instructions in the table are subvectors selected to have individual, contiguous and repeated lanes in the registers, compiler optimizations can be enhanced. Introduction of such dynamic programming allows for speculative lane optimizations, as well as improved analysis-guided optimizations, either of which can be performed alone or in combination with other optimizations, whether or not they make use of dynamic programming.
    Type: Application
    Filed: December 17, 2020
    Publication date: June 23, 2022
    Applicant: HUAWEI TECHNOLOGIES CO., LTD.
    Inventors: Amruth SANDHUPATLA, Ramshankar RAMANARAYANAN, Boris KRAVCHENKO, Ahmed Mohammed Elshafiey Mohammed ELTANTAWY
  • Patent number: 11327760
    Abstract: A method for grouping computer instructions includes receiving a set of computer instructions, grouping the set of computer instructions by register dependencies, identifying a plurality of single-definition-use flow (SDF) bundles based on a burstization criteria and a chaining criteria; and based on the SDF bundles, transforming the set of computer instructions. The transformation may include splitting one of the set of computer instructions and setting a burst parameter for the one of the set of computer instruction. The transformation may include grouping a plurality of the set of computer instructions and replacing a pair of register file accesses with a pair of temporary register accesses.
    Type: Grant
    Filed: April 9, 2020
    Date of Patent: May 10, 2022
    Assignee: HUAWEI TECHNOLOGIES CO., LTD.
    Inventors: Andrew Siu Doug Lee, Ahmed Mohammed Elshafiey Mohammed Eltantawy
  • Publication number: 20220107811
    Abstract: Methods, devices and media for efficient data dependency management for in-order issue processors are described. In various embodiments described herein, methods, devices and media are disclosed that provide techniques for managing RAW data dependencies between instructions in a constrained hardware environment. The described techniques include initial wait station allocation of write instructions, followed by wait station allocation conflict resolution methods that use a greedy algorithm to optimize a cost function based on the estimated latency of a single instruction. Efficient compilation and reduced execution time may be achieved in some embodiments. Methods and devices for compiling source code are described, as well as devices for executing the compiled machine code and media for storing compiled machine code.
    Type: Application
    Filed: October 7, 2020
    Publication date: April 7, 2022
    Inventors: Hazem A. ABDELHAFEZ, Ning XIE, Ahmed Mohammed ElShafiey Mohammed ELTANTAWY
  • Publication number: 20220083337
    Abstract: Methods, devices and media for reducing register pressure in flexible vector processors are described. In various embodiments described herein, methods, devices and media are disclosed that selectively re-scalarize vector instructions in a sequence of instructions such that register pressure is reduced and thread level parallelism is increased. A compiler may be used to perform a first method to partially or fully scalarize vectorized instructions of a code region of high register pressure. A compiler may be used to perform a second method to fully scalarize a sequence of vectorized instructions while preserving associations of the scalar instructions with their original vectorized instructions; the scalar instructions may then be scheduled and selectively re-vectorized. Devices executing code compiled with either method are described, as are processor-readable media storing code compiled by either method.
    Type: Application
    Filed: September 17, 2020
    Publication date: March 17, 2022
    Inventors: Ahmed Mohammed ElShafiey Mohammed ELTANTAWY, Ning XIE
  • Publication number: 20220066819
    Abstract: Methods and systems for executing threads in a thread-group, for example for ray-tracing. The threads are processed to collect, for each thread, a respective set of function call indicators over a respective number of call instances. The function call indicators are reordered across all threads and all call instances, to coalesce identical function call indicators to a common call instance, and non-identical function call indicators are reordered to different call instances. Function calls are executed across the threads of the thread-group, according to the reordered and coalesced function call indicators. In ray-tracing applications, the threads represent rays, each call instance is a ray-hit of a ray, and each function call is a shader call.
    Type: Application
    Filed: August 31, 2020
    Publication date: March 3, 2022
    Inventors: Tyler Bryce NOWICKI, Ahmed Mohammed ElShafiey Mohammed ELTANTAWY
  • Publication number: 20220066783
    Abstract: Systems and methods are described for extending a live range for a virtual scalar register during compiling of a program, comprising: receiving an intermediate representation (IR) of a source code configured for implementing single-instruction-multiple-thread (SIMT) execution, the IR representing the source code as control flow graph including a plurality of basic blocks (BB); and when a virtual scalar register defined in a first BB of the IR is last used in a second BB of the IR that is a divergent BB, modifying the IR to extend the live range of the virtual scalar register.
    Type: Application
    Filed: September 1, 2020
    Publication date: March 3, 2022
    Inventors: Abraham Davidson Fai Chung CHAN, Tyler Bryce NOWICKI, Guansong ZHANG, Ahmed Mohammed ElShafiey Mohammed ELTANTAWY
  • Patent number: 11188315
    Abstract: The disclosed systems, apparatuses and methods are directed to optimizing by a compiler register resource allocation for functions of a module, using a Register File comprising a limited number of registers. After performing interprocedural analysis in the module, the compiler computes the number of registers used by each function, and compiles the function to final machine code, except at callsites where a call is detected to be made to another function. At each callsite and for each called function, the compiler expands call instructions to final machine code after computing and setting a relative index to be used by a called function for running in an available part of the Register File. The relative index optimizes register resource allocation by minimizing the number of spilled registers before a function is called.
    Type: Grant
    Filed: September 4, 2020
    Date of Patent: November 30, 2021
    Assignee: HUAWEI TECHNOLOGIES CO., LTD.
    Inventors: Yan Luo, Ahmed Mohammed ElShafiey Mohammed Eltantawy, Tyler Bryce Nowicki
  • Publication number: 20210318875
    Abstract: A method for grouping computer instructions includes receiving a set of computer instructions, grouping the set of computer instructions by register dependencies, identifying a plurality of single-definition-use flow (SDF) bundles based on a burstization criteria and a chaining criteria; and based on the SDF bundles, transforming the set of computer instructions. The transformation may include splitting one of the set of computer instructions and setting a burst parameter for the one of the set of computer instruction. The transformation may include grouping a plurality of the set of computer instructions and replacing a pair of register file accesses with a pair of temporary register accesses.
    Type: Application
    Filed: April 9, 2020
    Publication date: October 14, 2021
    Applicant: HUAWEI TECHNOLOGIES CO., LTD.
    Inventors: Andrew Siu Doug LEE, Ahmed Mohammed Elshafiey Mohammed ELTANTAWY
  • Publication number: 20210004213
    Abstract: A method for improving the performance of applications executed within asynchronous processor architectures. In an embodiment, a method for improving execution time of compiled synchronized source code on an asynchronous processor architecture includes receiving, by a processing system, synchronized source code comprising synchronization instructions to synchronize execution of the synchronized source code on different pipelines of the asynchronous processor architecture. The method also includes analyzing, by the processing system, the synchronized source code to determine whether the synchronized source code includes a broken code condition.
    Type: Application
    Filed: July 20, 2020
    Publication date: January 7, 2021
    Inventors: Ahmed Mohammed ElShafiey Mohammed Eltantawy, Yaoqing Gao, Christopher Rodrigues, Lijuan Hai