Patents by Inventor Muawya M. Al-Otoom
Muawya M. Al-Otoom has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).
-
Patent number: 11630670Abstract: Techniques are disclosed relating to signature-based instruction prefetching. In some embodiments, processor pipeline circuitry executes a computer program that includes control transfer instructions, such that the execution follows a taken path through the computer program. First signature prefetch table circuitry indicates prefetch addresses for signatures generated using a first signature generation technique and second signature prefetch table circuitry indicates prefetch addresses for signatures generated using a second, different signature generation technique. Signature prefetch circuitry, in response to a prefetch training event, determines a first signature according to the first technique and a second signature according to the second technique and selects one but not both of the first and second signature prefetch tables to train using the first signature or the second signature.Type: GrantFiled: July 21, 2021Date of Patent: April 18, 2023Assignee: Apple Inc.Inventors: Douglas C. Holman, Ian D. Kountanis, Amit Kumar, Muawya M. Al-Otoom
-
Publication number: 20230023860Abstract: Techniques are disclosed relating to signature-based instruction prefetching. In some embodiments, processor pipeline circuitry executes a computer program that includes control transfer instructions, such that the execution follows a taken path through the computer program. First signature prefetch table circuitry indicates prefetch addresses for signatures generated using a first signature generation technique and second signature prefetch table circuitry that indicates prefetch addresses for signatures generated using a second, different signature generation technique. Signature prefetch circuitry, in response to a prefetch training event: determines a first signature according to the first technique and a second signature according to the second technique and selects one but not both of the first and second signature prefetch tables to train using the first signature or the second signature.Type: ApplicationFiled: July 21, 2021Publication date: January 26, 2023Inventors: Douglas C. Holman, Ian D. Kountanis, Amit Kumar, Muawya M. Al-Otoom
-
Patent number: 11416254Abstract: Systems, apparatuses, and methods for implementing zero cycle load bypass operations are described. A system includes a processor with at least a decode unit, control logic, mapper, and free list. When a load operation is detected, the control logic determines if the load operation qualifies to be converted to a zero cycle load bypass operation. Conditions for qualifying include the load operation being in the same decode group as an older store operation to the same address. Qualifying load operations are converted to zero cycle load bypass operations. A lookup of the free list is prevented for a zero cycle load bypass operation and a destination operand of the load is renamed with a same physical register identifier used for a source operand of the store. Also, the data of the store is bypassed to the load.Type: GrantFiled: December 5, 2019Date of Patent: August 16, 2022Assignee: Apple Inc.Inventors: Deepankar Duggal, Kulin N. Kothari, Conrado Blasco, Muawya M. Al-Otoom
-
Patent number: 11379240Abstract: In an embodiment, an indirect branch predictor generates indirect branch predictions based on one or more register values. The register values may be the contents of registers on which the indirect branch instruction is directly or indirectly dependent for generating the branch target address, for example. In an embodiment, at least one of the registers may be a source for a load instruction, and the indirect branch may be dependent (directly or indirectly) on the target of the load. In an embodiment, the indirect branch predictor may be one of at least two indirect branch predictors in a processor. The other indirect branch predictor may be based on a fetch address, or PC, associated with the indirect branch instruction. The other indirect branch predictor may generate a first predicted target address, and the indirect branch predictor may generate a second predicted target address for the same indirect branch instruction.Type: GrantFiled: January 31, 2020Date of Patent: July 5, 2022Assignee: Apple Inc.Inventors: Muawya M. Al-Otoom, Ian D. Kountanis, Conrado Blasco, Haoyan Jia, Amit Kumar
-
Patent number: 11200062Abstract: Systems, apparatuses, and methods for implementing a physical register last reference scheme are described. A system includes a processor with a mapper, history file, and freelist. When an entry in the mapper is updated with a new architectural register-to-physical register mapping, the processor creates a new history file entry for the given instruction that caused the update. The processor also searches the mapper to determine if the old physical register that was previously stored in the mapper entry is referenced by any other mapper entries. If there are no other mapper entries that reference this old physical register, then a last reference indicator is stored in the new history file entry. When the given instruction retires, the processor checks the last reference indicator in the history file entry to determine whether the old physical register can be returned to the freelist of available physical registers.Type: GrantFiled: August 26, 2019Date of Patent: December 14, 2021Assignee: Apple Inc.Inventors: Deepankar Duggal, Conrado Blasco, Muawya M. Al-Otoom, Richard F. Russo
-
Publication number: 20210240477Abstract: In an embodiment, an indirect branch predictor generates indirect branch predictions based on one or more register values. The register values may be the contents of registers on which the indirect branch instruction is directly or indirectly dependent for generating the branch target address, for example. In an embodiment, at least one of the registers may be a source for a load instruction, and the indirect branch may be dependent (directly or indirectly) on the target of the load. In an embodiment, the indirect branch predictor may be one of at least two indirect branch predictors in a processor. The other indirect branch predictor may be based on a fetch address, or PC, associated with the indirect branch instruction. The other indirect branch predictor may generate a first predicted target address, and the indirect branch predictor may generate a second predicted target address for the same indirect branch instruction.Type: ApplicationFiled: January 31, 2020Publication date: August 5, 2021Inventors: Muawya M. Al-Otoom, Ian D. Kountanis, Conrado Blasco, Haoyan Jia, Amit Kumar
-
Publication number: 20210173654Abstract: Systems, apparatuses, and methods for implementing zero cycle load bypass operations are described. A system includes a processor with at least a decode unit, control logic, mapper, and free list. When a load operation is detected, the control logic determines if the load operation qualifies to be converted to a zero cycle load bypass operation. Conditions for qualifying include the load operation being in the same decode group as an older store operation to the same address. Qualifying load operations are converted to zero cycle load bypass operations. A lookup of the free list is prevented for a zero cycle load bypass operation and a destination operand of the load is renamed with a same physical register identifier used for a source operand of the store. Also, the data of the store is bypassed to the load.Type: ApplicationFiled: December 5, 2019Publication date: June 10, 2021Inventors: Deepankar Duggal, Kulin N. Kothari, Conrado Blasco, Muawya M. Al-Otoom
-
Publication number: 20210064376Abstract: Systems, apparatuses, and methods for implementing a physical register last reference scheme are described. A system includes a processor with a mapper, history file, and freelist. When an entry in the mapper is updated with a new architectural register-to-physical register mapping, the processor creates a new history file entry for the given instruction that caused the update. The processor also searches the mapper to determine if the old physical register that was previously stored in the mapper entry is referenced by any other mapper entries. If there are no other mapper entries that reference this old physical register, then a last reference indicator is stored in the new history file entry. When the given instruction retires, the processor checks the last reference indicator in the history file entry to determine whether the old physical register can be returned to the freelist of available physical registers.Type: ApplicationFiled: August 26, 2019Publication date: March 4, 2021Inventors: Deepankar Duggal, Conrado Blasco, Muawya M. Al-Otoom, Richard F. Russo
-
Patent number: 10838729Abstract: A system and method for efficiently reducing the latency and power of memory access operations. A processor includes a stack pointer (SP) load-store dependence (LSD) predictor which predicts whether a memory dependence exists on a store instruction. The processor also includes a register file (RF) LSD predictor which predicts whether a memory dependence exists on a store instruction or a load instruction by a subsequent load instruction in program order. Each of the SP-LSD predictor and the RF-LSD predictor predicts and performs register renaming in a pipeline stage earlier than a renaming pipeline stage. The RF-LSD predictor also determines whether any intervening instructions between a producer memory instruction and a consumer memory instruction modify a predicted dependence.Type: GrantFiled: March 21, 2018Date of Patent: November 17, 2020Assignee: Apple Inc.Inventors: Muawya M. Al-Otoom, Conrado Blasco, Deepankar Duggal, Kulin N. Kothari, Richard F. Russo
-
Patent number: 10719327Abstract: In some embodiments, a branch prediction unit includes a plurality of branch prediction circuits and selection logic. At least two of the branch prediction circuits are configured, based on an address of a branch instruction and different sets of history information, to provide a corresponding branch prediction for the branch instruction. At least one storage element of the at least two branch prediction circuits is set associative. The selection logic is configured to select a particular branch prediction output by one of the branch prediction circuits as a current branch prediction output of the branch prediction unit. In some instances, the branch prediction unit may be less likely to replace branch prediction information, as compared to a different branch prediction unit that does not include a set associative storage element. In some embodiments, this arrangement may lead to increased performance of the branch prediction unit.Type: GrantFiled: May 19, 2015Date of Patent: July 21, 2020Assignee: Apple Inc.Inventors: Muawya M. Al-Otoom, Ian D. Kountanis, Conrado Blasco
-
Patent number: 10209989Abstract: A vector reduction instruction is executed by a processor to provide efficient reduction operations on an array of data elements. The processor includes vector registers. Each vector register is divided into a plurality of lanes, and each lane stores the same number of data elements. The processor also includes execution circuitry that receives the vector reduction instruction to reduce the array of data elements stored in a source operand into a result in a destination operand using a reduction operator. Each of the source operand and the destination operand is one of the vector registers. Responsive to the vector reduction instruction, the execution circuitry applies the reduction operator to two of the data elements in each lane, and shifts one or more remaining data elements when there is at least one of the data elements remaining in each lane.Type: GrantFiled: March 7, 2017Date of Patent: February 19, 2019Assignee: Intel CorporationInventors: Paul Caprioli, Abhay S. Kanhere, Jeffrey J. Cook, Muawya M. Al-Otoom
-
Publication number: 20170242699Abstract: A vector reduction instruction is executed by a processor to provide efficient reduction operations on an array of data elements. The processor includes vector registers. Each vector register is divided into a plurality of lanes, and each lane stores the same number of data elements. The processor also includes execution circuitry that receives the vector reduction instruction to reduce the array of data elements stored in a source operand into a result in a destination operand using a reduction operator. Each of the source operand and the destination operand is one of the vector registers. Responsive to the vector reduction instruction, the execution circuitry applies the reduction operator to two of the data elements in each lane, and shifts one or more remaining data elements when there is at least one of the data elements remaining in each lane.Type: ApplicationFiled: March 7, 2017Publication date: August 24, 2017Inventors: PAUL CAPRIOLI, ABHAY S. KANHERE, JEFFREY J. COOK, MUAWYA M. AL-OTOOM
-
Publication number: 20170212825Abstract: A hardware profiling mechanism implemented by performance monitoring hardware enables page level automatic binary translation. The hardware during runtime identifies a code page in memory containing potentially optimizable instructions. The hardware requests allocation of a new page in memory associated with the code page, where the new page contains a collection of counters and each of the counters corresponds to one of the instructions in the code page. When the hardware detects a branch instruction having a branch target within the code page, it increments one of the counters that has the same position in the new page as the branch target in the code page. The execution of the code page is repeated and the counters are incremented when branch targets fall within the code page. The hardware then provides the counter values in the new page to a binary translator for binary translation.Type: ApplicationFiled: January 10, 2017Publication date: July 27, 2017Inventors: Paul Caprioli, Matthew C. Merten, Muawya M. Al-Otoom, Omar M. Shaikh, Abhay S. Kanhere, Suresh Srinivas, Koichi Yamada, Vivek Thakkar, Pawel Osciak
-
Patent number: 9652234Abstract: A dynamic optimization of code for a processor-specific dynamic binary translation of hot code pages (e.g., frequently executed code pages) may be provided by a run-time translation layer. A method may be provided to use an instruction look-aside buffer (iTLB) to map original code pages and translated code pages. The method may comprise fetching an instruction from an original code page, determining whether the fetched instruction is a first instruction of a new code page and whether the original code page is deprecated. If both determinations return yes, the method may further comprise fetching a next instruction from a translated code page. If either determinations returns no, the method may further comprise decoding the instruction and fetching the next instruction from the original code page.Type: GrantFiled: September 30, 2011Date of Patent: May 16, 2017Assignee: Intel CorporationInventors: Paul Caprioli, Martin G. Dixon, Brett L. Toll, Muawya M. Al-Otoom, Omar M. Shaikh
-
Patent number: 9632791Abstract: Techniques are disclosed relating to a cache for patterns of instructions. In some embodiments, an apparatus includes an instruction cache and is configured to detect a pattern of execution of instructions by an instruction processing pipeline. The pattern of execution may involve execution of only instructions in a particular group of instructions. The instructions may include multiple backward control transfers and/or a control transfer instruction that is taken in one iteration of the pattern and not taken in another iteration of the pattern. The apparatus may be configured to store the instructions in the instruction cache and fetch and execute the instructions from the instruction cache. The apparatus may include a branch predictor dedicated to predicting the direction of control transfer instructions for the instruction cache. Various embodiments may reduce power consumption associated with instruction processing.Type: GrantFiled: January 21, 2014Date of Patent: April 25, 2017Assignee: Apple Inc.Inventors: Muawya M. Al-Otoom, Ian D. Kountanis, Ronald P. Hall, Michael L. Karm
-
Publication number: 20170097891Abstract: Systems, apparatuses, and methods for improving TM throughput using a TM region indicator (or color) are described. Through the use of TM region indicators younger TM regions can have their instructions retired while waiting for older TM regions to commit.Type: ApplicationFiled: December 16, 2016Publication date: April 6, 2017Inventors: Omar M. Shaikh, Ravi Rajwar, Paul Caprioli, Muawya M. Al-Otoom
-
Publication number: 20170097826Abstract: Systems, apparatuses, and methods for improving TM throughput using a TM region indicator (or color) are described. Through the use of TM region indicators younger TM regions can have their instructions retired while waiting for older TM regions to commit.Type: ApplicationFiled: December 16, 2016Publication date: April 6, 2017Inventors: Omar M. Shaikh, Ravi Rajwar, Paul Caprioli, Muawya M. Al-Otoom
-
Patent number: 9588766Abstract: A vector reduction instruction is executed by a processor to provide efficient reduction operations on an array of data elements. The processor includes vector registers. Each vector register is divided into a plurality of lanes, and each lane stores the same number of data elements. The processor also includes execution circuitry that receives the vector reduction instruction to reduce the array of data elements stored in a source operand into a result in a destination operand using a reduction operator. Each of the source operand and the destination operand is one of the vector registers. Responsive to the vector reduction instruction, the execution circuitry applies the reduction operator to two of the data elements in each lane, and shifts one or more remaining data elements when there is at least one of the data elements remaining in each lane.Type: GrantFiled: September 28, 2012Date of Patent: March 7, 2017Assignee: Intel CorporationInventors: Paul Caprioli, Abhay S. Kanhere, Jeffrey J. Cook, Muawya M. Al-Otoom
-
Patent number: 9542191Abstract: A hardware profiling mechanism implemented by performance monitoring hardware enables page level automatic binary translation. The hardware during runtime identifies a code page in memory containing potentially optimizable instructions. The hardware requests allocation of a new page in memory associated with the code page, where the new page contains a collection of counters and each of the counters corresponds to one of the instructions in the code page. When the hardware detects a branch instruction having a branch target within the code page, it increments one of the counters that has the same position in the new page as the branch target in the code page. The execution of the code page is repeated and the counters are incremented when branch targets fall within the code page. The hardware then provides the counter values in the new page to a binary translator for binary translation.Type: GrantFiled: March 30, 2012Date of Patent: January 10, 2017Assignee: Intel CorporationInventors: Paul Caprioli, Matthew C. Merten, Muawya M. Al-Otoom, Omar M. Shaikh, Abhay S. Kanhere, Suresh Srinivas, Koichi Yamada, Vivek Thakkar, Pawel Osciak
-
Publication number: 20160350221Abstract: Systems, apparatuses, and methods for improving TM throughput using a TM region indicator (or color) are described. Through the use of TM region indicators younger TM regions can have their instructions retired while waiting for older TM regions to commit.Type: ApplicationFiled: August 9, 2016Publication date: December 1, 2016Inventors: Omar M. Shaikh, Ravi Rajwar, Paul Caprioli, Muawya M. Al-Otoom