Patents by Inventor Muawya M. Al-Otoom

Muawya M. Al-Otoom has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

  • Patent number: 11630670
    Abstract: Techniques are disclosed relating to signature-based instruction prefetching. In some embodiments, processor pipeline circuitry executes a computer program that includes control transfer instructions, such that the execution follows a taken path through the computer program. First signature prefetch table circuitry indicates prefetch addresses for signatures generated using a first signature generation technique and second signature prefetch table circuitry indicates prefetch addresses for signatures generated using a second, different signature generation technique. Signature prefetch circuitry, in response to a prefetch training event, determines a first signature according to the first technique and a second signature according to the second technique and selects one but not both of the first and second signature prefetch tables to train using the first signature or the second signature.
    Type: Grant
    Filed: July 21, 2021
    Date of Patent: April 18, 2023
    Assignee: Apple Inc.
    Inventors: Douglas C. Holman, Ian D. Kountanis, Amit Kumar, Muawya M. Al-Otoom
  • Publication number: 20230023860
    Abstract: Techniques are disclosed relating to signature-based instruction prefetching. In some embodiments, processor pipeline circuitry executes a computer program that includes control transfer instructions, such that the execution follows a taken path through the computer program. First signature prefetch table circuitry indicates prefetch addresses for signatures generated using a first signature generation technique and second signature prefetch table circuitry that indicates prefetch addresses for signatures generated using a second, different signature generation technique. Signature prefetch circuitry, in response to a prefetch training event: determines a first signature according to the first technique and a second signature according to the second technique and selects one but not both of the first and second signature prefetch tables to train using the first signature or the second signature.
    Type: Application
    Filed: July 21, 2021
    Publication date: January 26, 2023
    Inventors: Douglas C. Holman, Ian D. Kountanis, Amit Kumar, Muawya M. Al-Otoom
  • Patent number: 11416254
    Abstract: Systems, apparatuses, and methods for implementing zero cycle load bypass operations are described. A system includes a processor with at least a decode unit, control logic, mapper, and free list. When a load operation is detected, the control logic determines if the load operation qualifies to be converted to a zero cycle load bypass operation. Conditions for qualifying include the load operation being in the same decode group as an older store operation to the same address. Qualifying load operations are converted to zero cycle load bypass operations. A lookup of the free list is prevented for a zero cycle load bypass operation and a destination operand of the load is renamed with a same physical register identifier used for a source operand of the store. Also, the data of the store is bypassed to the load.
    Type: Grant
    Filed: December 5, 2019
    Date of Patent: August 16, 2022
    Assignee: Apple Inc.
    Inventors: Deepankar Duggal, Kulin N. Kothari, Conrado Blasco, Muawya M. Al-Otoom
  • Patent number: 11379240
    Abstract: In an embodiment, an indirect branch predictor generates indirect branch predictions based on one or more register values. The register values may be the contents of registers on which the indirect branch instruction is directly or indirectly dependent for generating the branch target address, for example. In an embodiment, at least one of the registers may be a source for a load instruction, and the indirect branch may be dependent (directly or indirectly) on the target of the load. In an embodiment, the indirect branch predictor may be one of at least two indirect branch predictors in a processor. The other indirect branch predictor may be based on a fetch address, or PC, associated with the indirect branch instruction. The other indirect branch predictor may generate a first predicted target address, and the indirect branch predictor may generate a second predicted target address for the same indirect branch instruction.
    Type: Grant
    Filed: January 31, 2020
    Date of Patent: July 5, 2022
    Assignee: Apple Inc.
    Inventors: Muawya M. Al-Otoom, Ian D. Kountanis, Conrado Blasco, Haoyan Jia, Amit Kumar
  • Patent number: 11200062
    Abstract: Systems, apparatuses, and methods for implementing a physical register last reference scheme are described. A system includes a processor with a mapper, history file, and freelist. When an entry in the mapper is updated with a new architectural register-to-physical register mapping, the processor creates a new history file entry for the given instruction that caused the update. The processor also searches the mapper to determine if the old physical register that was previously stored in the mapper entry is referenced by any other mapper entries. If there are no other mapper entries that reference this old physical register, then a last reference indicator is stored in the new history file entry. When the given instruction retires, the processor checks the last reference indicator in the history file entry to determine whether the old physical register can be returned to the freelist of available physical registers.
    Type: Grant
    Filed: August 26, 2019
    Date of Patent: December 14, 2021
    Assignee: Apple Inc.
    Inventors: Deepankar Duggal, Conrado Blasco, Muawya M. Al-Otoom, Richard F. Russo
  • Publication number: 20210240477
    Abstract: In an embodiment, an indirect branch predictor generates indirect branch predictions based on one or more register values. The register values may be the contents of registers on which the indirect branch instruction is directly or indirectly dependent for generating the branch target address, for example. In an embodiment, at least one of the registers may be a source for a load instruction, and the indirect branch may be dependent (directly or indirectly) on the target of the load. In an embodiment, the indirect branch predictor may be one of at least two indirect branch predictors in a processor. The other indirect branch predictor may be based on a fetch address, or PC, associated with the indirect branch instruction. The other indirect branch predictor may generate a first predicted target address, and the indirect branch predictor may generate a second predicted target address for the same indirect branch instruction.
    Type: Application
    Filed: January 31, 2020
    Publication date: August 5, 2021
    Inventors: Muawya M. Al-Otoom, Ian D. Kountanis, Conrado Blasco, Haoyan Jia, Amit Kumar
  • Publication number: 20210173654
    Abstract: Systems, apparatuses, and methods for implementing zero cycle load bypass operations are described. A system includes a processor with at least a decode unit, control logic, mapper, and free list. When a load operation is detected, the control logic determines if the load operation qualifies to be converted to a zero cycle load bypass operation. Conditions for qualifying include the load operation being in the same decode group as an older store operation to the same address. Qualifying load operations are converted to zero cycle load bypass operations. A lookup of the free list is prevented for a zero cycle load bypass operation and a destination operand of the load is renamed with a same physical register identifier used for a source operand of the store. Also, the data of the store is bypassed to the load.
    Type: Application
    Filed: December 5, 2019
    Publication date: June 10, 2021
    Inventors: Deepankar Duggal, Kulin N. Kothari, Conrado Blasco, Muawya M. Al-Otoom
  • Publication number: 20210064376
    Abstract: Systems, apparatuses, and methods for implementing a physical register last reference scheme are described. A system includes a processor with a mapper, history file, and freelist. When an entry in the mapper is updated with a new architectural register-to-physical register mapping, the processor creates a new history file entry for the given instruction that caused the update. The processor also searches the mapper to determine if the old physical register that was previously stored in the mapper entry is referenced by any other mapper entries. If there are no other mapper entries that reference this old physical register, then a last reference indicator is stored in the new history file entry. When the given instruction retires, the processor checks the last reference indicator in the history file entry to determine whether the old physical register can be returned to the freelist of available physical registers.
    Type: Application
    Filed: August 26, 2019
    Publication date: March 4, 2021
    Inventors: Deepankar Duggal, Conrado Blasco, Muawya M. Al-Otoom, Richard F. Russo
  • Patent number: 10838729
    Abstract: A system and method for efficiently reducing the latency and power of memory access operations. A processor includes a stack pointer (SP) load-store dependence (LSD) predictor which predicts whether a memory dependence exists on a store instruction. The processor also includes a register file (RF) LSD predictor which predicts whether a memory dependence exists on a store instruction or a load instruction by a subsequent load instruction in program order. Each of the SP-LSD predictor and the RF-LSD predictor predicts and performs register renaming in a pipeline stage earlier than a renaming pipeline stage. The RF-LSD predictor also determines whether any intervening instructions between a producer memory instruction and a consumer memory instruction modify a predicted dependence.
    Type: Grant
    Filed: March 21, 2018
    Date of Patent: November 17, 2020
    Assignee: Apple Inc.
    Inventors: Muawya M. Al-Otoom, Conrado Blasco, Deepankar Duggal, Kulin N. Kothari, Richard F. Russo
  • Patent number: 10719327
    Abstract: In some embodiments, a branch prediction unit includes a plurality of branch prediction circuits and selection logic. At least two of the branch prediction circuits are configured, based on an address of a branch instruction and different sets of history information, to provide a corresponding branch prediction for the branch instruction. At least one storage element of the at least two branch prediction circuits is set associative. The selection logic is configured to select a particular branch prediction output by one of the branch prediction circuits as a current branch prediction output of the branch prediction unit. In some instances, the branch prediction unit may be less likely to replace branch prediction information, as compared to a different branch prediction unit that does not include a set associative storage element. In some embodiments, this arrangement may lead to increased performance of the branch prediction unit.
    Type: Grant
    Filed: May 19, 2015
    Date of Patent: July 21, 2020
    Assignee: Apple Inc.
    Inventors: Muawya M. Al-Otoom, Ian D. Kountanis, Conrado Blasco
  • Patent number: 10209989
    Abstract: A vector reduction instruction is executed by a processor to provide efficient reduction operations on an array of data elements. The processor includes vector registers. Each vector register is divided into a plurality of lanes, and each lane stores the same number of data elements. The processor also includes execution circuitry that receives the vector reduction instruction to reduce the array of data elements stored in a source operand into a result in a destination operand using a reduction operator. Each of the source operand and the destination operand is one of the vector registers. Responsive to the vector reduction instruction, the execution circuitry applies the reduction operator to two of the data elements in each lane, and shifts one or more remaining data elements when there is at least one of the data elements remaining in each lane.
    Type: Grant
    Filed: March 7, 2017
    Date of Patent: February 19, 2019
    Assignee: Intel Corporation
    Inventors: Paul Caprioli, Abhay S. Kanhere, Jeffrey J. Cook, Muawya M. Al-Otoom
  • Publication number: 20170242699
    Abstract: A vector reduction instruction is executed by a processor to provide efficient reduction operations on an array of data elements. The processor includes vector registers. Each vector register is divided into a plurality of lanes, and each lane stores the same number of data elements. The processor also includes execution circuitry that receives the vector reduction instruction to reduce the array of data elements stored in a source operand into a result in a destination operand using a reduction operator. Each of the source operand and the destination operand is one of the vector registers. Responsive to the vector reduction instruction, the execution circuitry applies the reduction operator to two of the data elements in each lane, and shifts one or more remaining data elements when there is at least one of the data elements remaining in each lane.
    Type: Application
    Filed: March 7, 2017
    Publication date: August 24, 2017
    Inventors: PAUL CAPRIOLI, ABHAY S. KANHERE, JEFFREY J. COOK, MUAWYA M. AL-OTOOM
  • Publication number: 20170212825
    Abstract: A hardware profiling mechanism implemented by performance monitoring hardware enables page level automatic binary translation. The hardware during runtime identifies a code page in memory containing potentially optimizable instructions. The hardware requests allocation of a new page in memory associated with the code page, where the new page contains a collection of counters and each of the counters corresponds to one of the instructions in the code page. When the hardware detects a branch instruction having a branch target within the code page, it increments one of the counters that has the same position in the new page as the branch target in the code page. The execution of the code page is repeated and the counters are incremented when branch targets fall within the code page. The hardware then provides the counter values in the new page to a binary translator for binary translation.
    Type: Application
    Filed: January 10, 2017
    Publication date: July 27, 2017
    Inventors: Paul Caprioli, Matthew C. Merten, Muawya M. Al-Otoom, Omar M. Shaikh, Abhay S. Kanhere, Suresh Srinivas, Koichi Yamada, Vivek Thakkar, Pawel Osciak
  • Patent number: 9652234
    Abstract: A dynamic optimization of code for a processor-specific dynamic binary translation of hot code pages (e.g., frequently executed code pages) may be provided by a run-time translation layer. A method may be provided to use an instruction look-aside buffer (iTLB) to map original code pages and translated code pages. The method may comprise fetching an instruction from an original code page, determining whether the fetched instruction is a first instruction of a new code page and whether the original code page is deprecated. If both determinations return yes, the method may further comprise fetching a next instruction from a translated code page. If either determinations returns no, the method may further comprise decoding the instruction and fetching the next instruction from the original code page.
    Type: Grant
    Filed: September 30, 2011
    Date of Patent: May 16, 2017
    Assignee: Intel Corporation
    Inventors: Paul Caprioli, Martin G. Dixon, Brett L. Toll, Muawya M. Al-Otoom, Omar M. Shaikh
  • Patent number: 9632791
    Abstract: Techniques are disclosed relating to a cache for patterns of instructions. In some embodiments, an apparatus includes an instruction cache and is configured to detect a pattern of execution of instructions by an instruction processing pipeline. The pattern of execution may involve execution of only instructions in a particular group of instructions. The instructions may include multiple backward control transfers and/or a control transfer instruction that is taken in one iteration of the pattern and not taken in another iteration of the pattern. The apparatus may be configured to store the instructions in the instruction cache and fetch and execute the instructions from the instruction cache. The apparatus may include a branch predictor dedicated to predicting the direction of control transfer instructions for the instruction cache. Various embodiments may reduce power consumption associated with instruction processing.
    Type: Grant
    Filed: January 21, 2014
    Date of Patent: April 25, 2017
    Assignee: Apple Inc.
    Inventors: Muawya M. Al-Otoom, Ian D. Kountanis, Ronald P. Hall, Michael L. Karm
  • Publication number: 20170097891
    Abstract: Systems, apparatuses, and methods for improving TM throughput using a TM region indicator (or color) are described. Through the use of TM region indicators younger TM regions can have their instructions retired while waiting for older TM regions to commit.
    Type: Application
    Filed: December 16, 2016
    Publication date: April 6, 2017
    Inventors: Omar M. Shaikh, Ravi Rajwar, Paul Caprioli, Muawya M. Al-Otoom
  • Publication number: 20170097826
    Abstract: Systems, apparatuses, and methods for improving TM throughput using a TM region indicator (or color) are described. Through the use of TM region indicators younger TM regions can have their instructions retired while waiting for older TM regions to commit.
    Type: Application
    Filed: December 16, 2016
    Publication date: April 6, 2017
    Inventors: Omar M. Shaikh, Ravi Rajwar, Paul Caprioli, Muawya M. Al-Otoom
  • Patent number: 9588766
    Abstract: A vector reduction instruction is executed by a processor to provide efficient reduction operations on an array of data elements. The processor includes vector registers. Each vector register is divided into a plurality of lanes, and each lane stores the same number of data elements. The processor also includes execution circuitry that receives the vector reduction instruction to reduce the array of data elements stored in a source operand into a result in a destination operand using a reduction operator. Each of the source operand and the destination operand is one of the vector registers. Responsive to the vector reduction instruction, the execution circuitry applies the reduction operator to two of the data elements in each lane, and shifts one or more remaining data elements when there is at least one of the data elements remaining in each lane.
    Type: Grant
    Filed: September 28, 2012
    Date of Patent: March 7, 2017
    Assignee: Intel Corporation
    Inventors: Paul Caprioli, Abhay S. Kanhere, Jeffrey J. Cook, Muawya M. Al-Otoom
  • Patent number: 9542191
    Abstract: A hardware profiling mechanism implemented by performance monitoring hardware enables page level automatic binary translation. The hardware during runtime identifies a code page in memory containing potentially optimizable instructions. The hardware requests allocation of a new page in memory associated with the code page, where the new page contains a collection of counters and each of the counters corresponds to one of the instructions in the code page. When the hardware detects a branch instruction having a branch target within the code page, it increments one of the counters that has the same position in the new page as the branch target in the code page. The execution of the code page is repeated and the counters are incremented when branch targets fall within the code page. The hardware then provides the counter values in the new page to a binary translator for binary translation.
    Type: Grant
    Filed: March 30, 2012
    Date of Patent: January 10, 2017
    Assignee: Intel Corporation
    Inventors: Paul Caprioli, Matthew C. Merten, Muawya M. Al-Otoom, Omar M. Shaikh, Abhay S. Kanhere, Suresh Srinivas, Koichi Yamada, Vivek Thakkar, Pawel Osciak
  • Publication number: 20160350221
    Abstract: Systems, apparatuses, and methods for improving TM throughput using a TM region indicator (or color) are described. Through the use of TM region indicators younger TM regions can have their instructions retired while waiting for older TM regions to commit.
    Type: Application
    Filed: August 9, 2016
    Publication date: December 1, 2016
    Inventors: Omar M. Shaikh, Ravi Rajwar, Paul Caprioli, Muawya M. Al-Otoom