Patents by Inventor Mark Luttrell

Mark Luttrell has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

  • Publication number: 20140365734
    Abstract: Systems and methods for reliably using data storage media. Multiple processors are configured to access a persistent memory. For a given data block corresponding to a write access request from a first processor to the persistent memory, a cache controller prevents any read access of a copy of the given data block in an associated cache. The cache controller prevents any read access while detecting an acknowledgment that the given data block is stored in the persistent memory is not yet received. Until the acknowledgment is received, the cache controller allows write access of the copy of the given data block in the associated cache only for a thread in the first processor that originally sent the write access request. The cache controller invalidates any copy of the given data block in any cache levels below the associated cache.
    Type: Application
    Filed: June 10, 2013
    Publication date: December 11, 2014
    Inventors: William H. Bridge, JR., Paul Loewenstein, Mark A. Luttrell
  • Publication number: 20140075163
    Abstract: Techniques are disclosed relating to suspending execution of a processor thread while monitoring for a write to a specified memory location. An execution subsystem may be configured to perform a load instruction that causes the processor to retrieve data from a specified memory location and atomically begin monitoring for a write to the specified location. The load instruction may be a load-monitor instruction. The execution subsystem may be further configured to perform a wait instruction that causes the processor to suspend execution of a processor thread during at least a portion of an interval specified by the wait instruction and to resume execution of the processor thread at the end of the interval. The wait instruction may be a monitor-wait instruction. The processor may be further configured to resume execution of the processor thread in response to detecting a write to a memory location specified by a previous monitor instruction.
    Type: Application
    Filed: September 7, 2012
    Publication date: March 13, 2014
    Inventors: Paul N. Loewenstein, Mark A. Luttrell, Paul J. Jordan
  • Publication number: 20130297910
    Abstract: Systems and methods for efficient thread arbitration in a threaded processor with dynamic resource allocation. A processor includes a resource shared by multiple threads. The resource includes entries which may be allocated for use by any thread. Control logic detects long latency instructions. Long latency instructions have a latency greater than a given threshold. One example is a load instruction that has a read-after-write (RAW) data dependency on a store instruction that misses a last-level data cache. The long latency instruction or an immediately younger instruction is selected for replay for an associated thread. A pipeline flush and replay for the associated thread begins with the selected instruction. Instructions younger than the long latency instruction are held at a given pipeline stage until the long latency instruction completes. During replay, this hold prevents resources from being allocated to the associated thread while the long latency instruction is being serviced.
    Type: Application
    Filed: May 3, 2012
    Publication date: November 7, 2013
    Inventors: Jared C. Smolens, Robert T. Golla, Mark A. Luttrell, Paul J. Jordan
  • Publication number: 20130290675
    Abstract: Systems and methods for efficient thread arbitration in a threaded processor with dynamic resource allocation. A processor includes a resource shared by multiple threads. The resource includes an array with multiple entries, each of which may be allocated for use by any thread. Control logic detects a load miss to memory, wherein the miss is associated with a latency greater than a given threshold. The load instruction or an immediately younger instruction is selected for replay for an associated thread. A pipeline flush and replay for the associated thread begins with the selected instruction. Instructions younger than the load instruction are held at a given pipeline stage until the load instruction completes. During replay, this hold prevents resources from being allocated to the associated thread while the load instruction is being serviced.
    Type: Application
    Filed: April 26, 2012
    Publication date: October 31, 2013
    Inventors: Yuan C. Chou, Robert T. Golla, Mark A. Luttrell
  • Patent number: 8412911
    Abstract: A system and method for invalidating obsolete virtual/real address to physical address translations may employ translation lookaside buffers to cache translations. TLB entries may be invalidated in response to changes in the virtual memory space, and thus may need to be demapped. A non-cacheable unit (NCU) residing on a processor may be configured to receive and manage a global TLB demap request from a thread executing on a core residing on the processor. The NCU may send the request to local cores and/or to NCUs of external processors in a multiprocessor system using a hardware instruction to broadcast to all cores and/or processors or to multicast to designated cores and/or processors. The NCU may track completion of the demap operation across the cores and/or processors using one or more counters, and may send an acknowledgement to the initiator of the demap request when the global demap request has been satisfied.
    Type: Grant
    Filed: June 29, 2009
    Date of Patent: April 2, 2013
    Assignee: Oracle America, Inc.
    Inventors: Gregory F. Grohoski, Paul J. Jordan, Mark A. Luttrell, Zeid Hartuon Samoail
  • Patent number: 8301865
    Abstract: A system and method for servicing translation lookaside buffer (TLB) misses may manage separate input and output pipelines within a memory management unit. A pending request queue (PRQ) in the input pipeline may include an instruction-related portion storing entries for instruction TLB (ITLB) misses and a data-related portion storing entries for potential or actual data TLB (DTLB) misses. A DTLB PRQ entry may be allocated to each load/store instruction selected from the pick queue. The system may select an ITLB- or DTLB-related entry for servicing dependent on prior PRQ entry selection(s). A corresponding entry may be held in a translation table entry return queue (TTERQ) in the output pipeline until a matching address translation is received from system memory. PRQ and/or TTERQ entries may be deallocated when a corresponding TLB miss is serviced. PRQ and/or TTERQ entries associated with a thread may be deallocated in response to a thread flush.
    Type: Grant
    Filed: June 29, 2009
    Date of Patent: October 30, 2012
    Assignee: Oracle America, Inc.
    Inventors: Gregory F. Grohoski, Paul J. Jordan, Mark A. Luttrell, Zeid Hartuon Samoail, Robert T. Golla
  • Patent number: 8230177
    Abstract: Systems and methods for efficient handling of store misses. A processor comprises a store queue that stores data for committed store instructions. Coupled to the store queue is a cache responsible for ensuring consistent ordering of store operations for all consumers, which may be accomplished by maintaining a corresponding cache line be in an exclusive state before executing a store operation. In response to a first committed store instruction missing in the cache, the store queue is configured to convey to the cache a second entry of the plurality of queue entries as a speculative prefetch instruction. This second entry corresponds to a committed store instruction that follows in program order the first committed store instruction of a given thread. If the prefetch instruction misses in the cache, the latency for acquiring a corresponding cache line overlaps with the latency of the first store instruction.
    Type: Grant
    Filed: May 28, 2009
    Date of Patent: July 24, 2012
    Assignee: Oracle America, Inc.
    Inventor: Mark A. Luttrell
  • Patent number: 8166251
    Abstract: In an embodiment, a processor includes a data cache and a prefetch unit coupled to the data cache. The prefetch unit is configured to identify a prefetch stream in cache misses from the data cache, and the prefetch unit is configured to issue prefetches predicted by the prefetch stream to prefetch data into the data cache. More particularly, the prefetch unit implements one or more stream engines that generate prefetches for respective prefetch streams. Each stream engine is configured to maintain limit data that indicates a number of prefetches that are permitted to be outstanding beyond a most recent demand access. The stream engine is configured to increase the limit responsive to the number of demand accesses that consume prefetched data at least equaling the limit.
    Type: Grant
    Filed: April 20, 2009
    Date of Patent: April 24, 2012
    Assignee: Oracle America, Inc.
    Inventor: Mark A. Luttrell
  • Patent number: 8140769
    Abstract: In an embodiment, a processor includes a data cache and a prefetch unit coupled to the data cache. The prefetch unit is configured to detect one or more prefetch streams corresponding to load operations that miss the data cache, and includes a memory configured to store data corresponding to potential prefetch streams. The prefetch unit is configured to confirm a prefetch stream in response to N or more demand accesses to addresses in the prefetch stream, where N is a positive integer greater than one and is dependent on a prefetch pattern being detected. The prefetch unit comprises a plurality of stream engines, each stream engine configured to generate prefetches for a different prefetch stream assigned to that stream engine. The prefetch unit is configured to assign the confirmed prefetch stream to one of the plurality of stream engines.
    Type: Grant
    Filed: April 20, 2009
    Date of Patent: March 20, 2012
    Assignee: Oracle America, Inc.
    Inventor: Mark A. Luttrell
  • Patent number: 8099586
    Abstract: A system and method for reducing branch misprediction penalty. In response to detecting a mispredicted branch instruction, circuitry within a microprocessor identifies a predetermined condition prior to retirement of the branch instruction. Upon identifying this condition, the entire corresponding pipeline is flushed prior to retirement of the branch instruction, and instruction fetch is started at a corresponding address of an oldest instruction in the pipeline immediately prior to the flushing of the pipeline. The correct outcome is stored prior to the pipeline flush. In order to distinguish the mispredicted branch from other instructions, identification information may be stored alongside the correct outcome. One example of the predetermined condition being satisfied is in response to a timer reaching a predetermined threshold value, wherein the timer begins incrementing in response to the mispredicted branch detection and resets at retirement of the mispredicted branch.
    Type: Grant
    Filed: December 30, 2008
    Date of Patent: January 17, 2012
    Assignee: Oracle America, Inc.
    Inventors: Yuan C. Chou, Robert T. Golla, Mark A. Luttrell, Paul J. Jordan, Manish Shah
  • Patent number: 8099566
    Abstract: Systems and methods for efficient load-store ordering. A processor comprises a store buffer that includes an array. The store buffer dynamically allocates any entry of the array for an out-of-order (o-o-o) issued store instruction independent of a corresponding thread. Circuitry within the store buffer determines a first set of entries of the array entries that have store instructions older in program order than a particular load instruction, wherein the store instructions have a same thread identifier and address as the load instruction. From the first set, the logic locates a single final match entry of the first set corresponding to the youngest store instruction of the first set, which may be used for read-after-write (RAW) hazard detection.
    Type: Grant
    Filed: May 15, 2009
    Date of Patent: January 17, 2012
    Assignee: Oracle America, Inc.
    Inventor: Mark A. Luttrell
  • Patent number: 8006075
    Abstract: Systems and methods for storage of writes to memory corresponding to multiple threads. A processor comprises a store queue, wherein the queue dynamically allocates a current entry for a committed store instruction in which entries of the array may be allocated out of program order. For a given thread, the store queue conveys store data to a memory in program order. The queue is further configured to identify an entry of the plurality of entries that corresponds to an oldest committed store instruction for a given thread and determine a next entry of the array that corresponds to a next committed store instruction in program order following the oldest committed store instruction of the given thread, wherein said next entry includes data identifying the entry. The queue marks an entry as unfilled upon successful conveying of store data to the memory.
    Type: Grant
    Filed: May 21, 2009
    Date of Patent: August 23, 2011
    Assignee: Oracle America, Inc.
    Inventor: Mark A. Luttrell
  • Patent number: 7996632
    Abstract: A multithreaded processor with a banked cache is provided. The instruction set includes at least one atomic operation which is executed in the L2 cache if the atomic memory address source data is aligned. The core executing the instruction determines whether the atomic memory address source data is aligned. If it is aligned, the atomic memory address is sent to the bank that contains the atomic memory address source data, and the operation is executed in the bank. In one embodiment, if the instruction is mis-aligned, the operation is executed in the core.
    Type: Grant
    Filed: December 22, 2006
    Date of Patent: August 9, 2011
    Assignee: Oracle America, Inc.
    Inventors: Greg F. Grohoski, Mark A. Luttrell, Manish Shah
  • Publication number: 20100332787
    Abstract: A system and method for servicing translation lookaside buffer (TLB) misses may manage separate input and output pipelines within a memory management unit. A pending request queue (PRQ) in the input pipeline may include an instruction-related portion storing entries for instruction TLB (ITLB) misses and a data-related portion storing entries for potential or actual data TLB (DTLB) misses. A DTLB PRQ entry may be allocated to each load/store instruction selected from the pick queue. The system may select an ITLB- or DTLB-related entry for servicing dependent on prior PRQ entry selection(s). A corresponding entry may be held in a translation table entry return queue (TTERQ) in the output pipeline until a matching address translation is received from system memory. PRQ and/or TTERQ entries may be deallocated when a corresponding TLB miss is serviced. PRQ and/or TTERQ entries associated with a thread may be deallocated in response to a thread flush.
    Type: Application
    Filed: June 29, 2009
    Publication date: December 30, 2010
    Inventors: Gregory F. Grohoski, Paul J. Jordan, Mark A. Luttrell, Zeid Hartuon Samoail, Robert T. Golla
  • Publication number: 20100332786
    Abstract: A system and method for invalidating obsolete virtual/real address to physical address translations may employ translation lookaside buffers to cache translations. TLB entries may be invalidated in response to changes in the virtual memory space, and thus may need to be demapped. A non-cacheable unit (NCU) residing on a processor may be configured to receive and manage a global TLB demap request from a thread executing on a core residing on the processor. The NCU may send the request to local cores and/or to NCUs of external processors in a multiprocessor system using a hardware instruction to broadcast to all cores and/or processors or to multicast to designated cores and/or processors. The NCU may track completion of the demap operation across the cores and/or processors using one or more counters, and may send an acknowledgement to the initiator of the demap request when the global demap request has been satisfied.
    Type: Application
    Filed: June 29, 2009
    Publication date: December 30, 2010
    Inventors: Gregory F. Grohoski, Paul J. Jordan, Mark A. Luttrell, Zeid Hartuon Samoail
  • Publication number: 20100332804
    Abstract: Systems and methods for efficient picking of instructions for out-of-order issue and execution in a processor. In one embodiment, a processor comprises a unified pick queue that is dynamically allocated. Each entry is configured to store age and dependency information relative to other decoded instructions. Also, each entry stores a picked field, which when asserted indicates the decoded instruction has already been picked for out-of-order issue and execution. When asserted, a trigger field indicates a result of a corresponding decoded instruction will be available a predetermined number of clock cycles afterward. A younger instruction dependent on a result of an older instruction is ready to be picked before the result of the older instruction is available. In this case, the older instruction has asserted picked and trigger fields.
    Type: Application
    Filed: June 29, 2009
    Publication date: December 30, 2010
    Inventors: Robert T. Golla, Matthew B. Smittle, Mark A. Luttrell, Xiang Shan Li
  • Publication number: 20100306477
    Abstract: Systems and methods for efficient handling of store misses. A processor comprises a store queue that stores data for committed store instructions. Coupled to the store queue is a cache responsible for ensuring consistent ordering of store operations for all consumers, which may be accomplished by maintaining a corresponding cache line be in an exclusive state before executing a store operation. In response to a first committed store instruction missing in the cache, the store queue is configured to convey to the cache a second entry of the plurality of queue entries as a speculative prefetch instruction. This second entry corresponds to a committed store instruction that follows in program order the first committed store instruction of a given thread. If the prefetch instruction misses in the cache, the latency for acquiring a corresponding cache line overlaps with the latency of the first store instruction.
    Type: Application
    Filed: May 28, 2009
    Publication date: December 2, 2010
    Inventor: Mark A. Luttrell
  • Publication number: 20100299508
    Abstract: Systems and methods for storage of writes to memory corresponding to multiple threads. A processor comprises a store queue, wherein the queue dynamically allocates a current entry for a committed store instruction in which entries of the array may be allocated out of program order. For a given thread, the store queue conveys store data to a memory in program order. The queue is further configured to identify an entry of the plurality of entries that corresponds to an oldest committed store instruction for a given thread and determine a next entry of the array that corresponds to a next committed store instruction in program order following the oldest committed store instruction of the given thread, wherein said next entry includes data identifying the entry. The queue marks an entry as unfilled upon successful conveying of store data to the memory.
    Type: Application
    Filed: May 21, 2009
    Publication date: November 25, 2010
    Inventor: Mark A. Luttrell
  • Publication number: 20100293347
    Abstract: Systems and methods for efficient load-store ordering. A processor comprises a store buffer that includes an array. The store buffer dynamically allocates any entry of the array for an out-of-order (o-o-o) issued store instruction independent of a corresponding thread. Circuitry within the store buffer determines a first set of entries of the array entries that have store instructions older in program order than a particular load instruction, wherein the store instructions have a same thread identifier and address as the load instruction. From the first set, the logic locates a single final match entry of the first set corresponding to the youngest store instruction of the first set, which may be used for read-after-write (RAW) hazard detection.
    Type: Application
    Filed: May 15, 2009
    Publication date: November 18, 2010
    Inventor: Mark A. Luttrell
  • Publication number: 20100268893
    Abstract: In an embodiment, a processor comprises a data cache and a prefetch unit coupled to the data cache. The prefetch unit is configured to identify a prefetch stream in cache misses from the data cache, and the prefetch unit is configured to issue prefetches predicted by the prefetch stream to prefetch data into the data cache. More particularly, the prefetch unit implements one or more stream engines that generate prefetches for respective prefetch streams. Each stream engine is configured to maintain limit data that indicates a number of prefetches that are permitted to be outstanding beyond a most recent demand access. The stream engine is configured to increase the limit responsive to the number of demand accesses that consume prefetched data at least equaling the limit.
    Type: Application
    Filed: April 20, 2009
    Publication date: October 21, 2010
    Inventor: Mark A. Luttrell