Scalable rundown protection for object lifetime management
A system and method for object rundown protection that scales with the number of processors in a shared-memory computer system is disclosed. In an embodiment of the present invention, prior to object rundown, a cache-aware reference count data structure is used to prevent cache-pinging that would otherwise result from data sharing across processors in a multiprocessor computer system. In this data structure, a counter of positive references and negative dereferences, aligned on a particular cache line, is maintained for each processor. When an object is to be destroyed, a rundown wait process is begun, during which new references on the object are prohibited, and the total number of outstanding references is added to an on-stack global counter. Destruction is delayed until the global reference count is reduced to zero. In an embodiment of the invention suited to implementation on non-uniform memory access multiprocessor machines, each processor's reference count is additionally allocated in a region of main memory that is physically close to that processor.
Latest Microsoft Patents:
The patent application is a divisional of U.S. patent application Ser. No. 10/461,755, entitled “Scalable Rundown Protection for Object Lifetime Management,” filed on Jun. 13, 2003 and assigned to the same assignee as this application. The aforementioned patent application is expressly incorporated herein by reference.
TECHNICAL FIELDThe present invention relates generally to computer systems, and more particularly to the management and control of access to and deletion of objects in shared-memory multiprocessor computer machines.
BACKGROUND OF THE INVENTIONThe desire to improve computer system performance has driven innovation in both computer architecture and operating systems design. In many contexts, an important goal is to maximize throughput (the number of tasks processed in a given unit of time). One persistent obstacle has been the so-called Von Neumann bottleneck, the gap between faster processors and slower memory. Two general architectural approaches to minimizing this gap and increasing throughput are the use of a multilevel memory hierarchy and the use of multiple processors. In a memory hierarchy, faster, smaller memories are placed closer to the processor, and efforts are made to keep recently-accessed items in the fastest memories. The fastest level of memory, apart from the processor's registers, is the cache, which can itself have one or more levels, and which is organized into cache lines of a fixed size. A cache hit occurs when the processor finds a requested data item in the cache. If the data is not found in the cache, a cache miss occurs, and a line-size block of data that includes the requested item is retrieved from the main memory and placed in a particular cache line.
Most modern multiprocessor machines belong to one of two groups, distinguished by how their memory is organized. The first group includes machines with a relatively small number of processors sharing a single centralized main memory connected to the processors by a bus. These machines are called UMAs (uniform memory access multiprocessors) because the time to access any main memory location is uniform for each processor. The second group comprises machines in which memory is physically distributed among the processors, allowing larger numbers of processors to be supported. Those machines in this group that feature a logically-shared main memory address space are known as NUMAs (non-uniform memory access multiprocessors). In NUMA machines, a processor's access time for a particular data word in main memory depends on the location of that word.
Modem operating systems maximize processor utilization and minimize memory latency through their support for multiprogramming, the ability of several concurrently-executing programs, referred to as processes or threads, to share computer resources. When a first process is being executed by a processor, and the operating system then permits a second process to execute, the operating system performs a context switch. A context switch involves saving the current state of the first process so that its execution can continue at a later time, and restoring the previous state of the program that is about to return to execution.
In systems supporting multiprogramming, there is potentially a high degree of interaction among concurrent processes. The effective coordination of multiple processes is a central problem in operating systems design. It is typical for several processes to require access to some object residing in shared memory, which may itself include methods or data made available by another process. At some point, it may be necessary for such objects to be “run down” (destroyed). For example, the shared object might be a loaded antivirus driver, which typically has a long lifetime (that is, it will be kept loaded while the operating system is running). Periodically, however, it will be necessary for the driver to be unloaded so that an updated version of the driver can be loaded in its place. In such situations, there is a danger that a process might attempt to access an object that has already been deleted or made unavailable, leading to erroneous and unpredictable program and system behavior. It is important, therefore, to guard against the premature destruction or removal of a shared object while references to the object are outstanding.
Solutions to this problem have generally made use of synchronization mechanisms provided by the operating system or the hardware. One conventional approach is to protect the object by placing it under a mutually exclusive lock that can only be acquired by one process at a time. If a second process requires access to the object, it must wait until the lock is released by the first process. This is undesirable for a number of reasons. The time spent on acquisitions and releases of locks may exceed the time needed for access to the object itself, causing a performance bottleneck. Moreover, the use of such locks may lead to deadlocked processes vying for access to the same object. Efforts to minimize the occurrence of deadlocks typically require the diversion of substantial computing resources. Locks also involve the consumption of considerable memory space.
A more finely-grained solution, called rundown protection, has been implemented in Microsoft Corporation's “WINDOWS XP” operating system. Under this approach, a global reference count associated with a particular object is used to ensure that destruction or removal of the object will be delayed until all of the accesses that have already been granted to the object have completed and been released. Access serialization is achieved using atomic interlocked hardware instructions rather than mutually exclusive software locks. Rundown protection is optimized for rapid and lightweight accesses and releases of object references. This is sensible, because typically protection is desired in situations involving a long-lived object which is referenced and dereferenced many times throughout its lifetime. In the loaded antivirus driver example, the references are I10 requests, with the dereferences occurring when the YO is completed, and rundown protection may be invoked by a kernel component, such as a file system filter manager, to guard against premature unloading of the driver.
Despite the advantages of rundown protection over previous solutions, its routines for acquiring and releasing references have been discovered to cause significantly degraded performance on multiprocessor machines when used by 110-coordinating file system filter managers to manage the unloading of file system filter drivers, the subject of a copending commonly-assigned U.S. patent application filed Jun. 13, 2003, bearing U.S. Ser. No. 10/461,078 and referenced by the attorney docket number 221227. Interlocked increments and decrements were found to cause cache pinging: following a reference count increment or decrement, corresponding cache lines on all the processors would update their caches with the new value, flushing and refreshing the cache in accordance with the machine's coherence protocols. The degradation worsens as more processors are included in the system. It would be desirable, therefore, to provide an improved form of rundown protection that would scale with the number of processors in a multiprocessor computer system.
SUMMARY OF THE INVENTIONThe invention provides a system and method for rundown protection on shared-memory multiprocessors that scales with the number of processors in the system while preserving the advantages of rundown protection over other approaches to guarding against premature destruction of shared objects. Separate reference counts are associated with each processor. The reference counts are stored in data structures designed to ensure that each per-processor reference count will be cached in a different cache line, eliminating the cache pinging problem associated with references and dereferences. In an embodiment of the invention intended to be used on NUMA machines, the per-processor reference counts are additionally or alternatively stored in regions of main memory that are physically proximate to their corresponding processors. A global reference count is used instead of per-processor counts when an object is finally to be run down.
BRIEF DESCRIPTION OF THE DRAWINGSWhile the appended claims set forth the features of the invention with particularity, the invention and its advantages are best understood by referring to the following detailed description taken in conjunction with the accompanying drawings, of which:
The present invention is directed to a system and method for protecting an object residing in shared main memory from being destroyed or rendered inaccessible while outstanding references on the object exist.
Prior to proceeding with a description of the invention, a description of representative computer systems in which the various embodiments of the invention may be practiced is provided. The invention will be described in the general context of computer-executable instructions, such as programs, being executed within a computer device. Generally, programs include routines, methods, procedures, functions, objects, data structures and the like that perform particular tasks or implement particular data types. The term “computer”, “computer device” or “machine” as used herein includes any device that electronically executes one or more programs.
Aspects of the invention described herein may be implemented on various kinds of computers, including uniprocessor computers, but the invention is especially intended for multiprocessor computers containing more than one processing unit sharing a main memory. In the description that follows, the invention will be described with reference to acts and symbolic representations of operations performed or executed by a computer. It is understood that these acts and operations include the manipulation by one or more processing units of electrical signals representing data in a structured form. This manipulation transforms the data or maintains it at locations in the memory system of the computer, which reconfigures or otherwise alters the operation of the computer in a manner well-understood by those skilled in the art. The data structures in which data is maintained are physical locations of the memory that have particular properties defined by the format of the data.
Each processing unit in a computer is capable of executing a set of machine language instructions in accordance with embodiments of the invention. A computer program executed by a processing unit incorporates one or more sequences of the set of machine language instructions recognized by that processing unit. A program, in the form of machine language object code, typically lies quiescent in storage until needed. When needed, the object code is incorporated into an active process, for example by being loaded into memory or using a virtual memory technique. A processing unit can then execute each machine language instruction in an active process. The machine language instructions can be executed in sequence unless program execution is redirected by an execution branch instruction, terminated by a process termination instruction, or the last instruction in the sequence is executed.
Referring to
Referring now to
The present invention achieves for multiprocessor systems, without sacrificing scalability, the objective of rundown protection: protecting objects from being prematurely removed or destroyed. Access to a protected object requires the taking of a rundown reference on that object. When the protected object must be deleted, a rundown wait routine is invoked. The rundown wait ensures that no additional references may be taken on the object, and it guarantees that the object will not be deleted while there are still outstanding references on it which have not yet completed.
To eliminate the problem of contention on corresponding cache lines, a cache-aware rundown reference data structure is used. Instead of a global reference count, a separate reference count is maintained for each processor in the system, aligned in such a way that each reference count will be cached in a distinct cache line, without overlap. Turning to
Each per-processor reference count 303a, 303b, 303c, 303d is itself contained in a data structure that includes a field to hold the actual count, along with padding to ensure that the reference count structure is the size of a cache line block. In an embodiment of the invention, for reasons that will become clear in the discussion below, each per-processor count is stored in an integer data field that is large enough to hold a main memory address pointer. It will therefore be the size of the basic addressing width of the processors in the machine. On a 32-bit ×86 machine, for example, a 32-bit data type would be used; similarly, on a 64-bit machine, a 64-bit data type would be used. The least significant bit of the count data field is depicted in the data structure 301 as a shaded square 305a, 305b, 305c, 305d at the right end of each per-processor reference count 303a, 303b, 303c, 303d. This bit, which will be referred to as the “w-bit”, is used to indicate whether a rundown wait is in progress, with 0 indicating that rundown wait has not yet begun and 1 indicating that a rundown wait is active. In order to preserve the value of the w-bit, the per-processor count is incremented in twos.
It is important to recognize that a reference on an object may be acquired by one processor in the system but may be released by a different processor (as a result of context switches and processor scheduling by the operating system, for example). A release of a reference executed by a processor is marked by a decrementing of that processor's reference count. For this reason, each per-processor count may be negative as well as positive or zero. Since the least significant bit of the count is reserved for use as the w-bit, and since a bit must be used as a sign bit because the count can go below zero, the effective number of outstanding references possible at any given time on an N-bit machine is 2N-2. This is not a significant limitation in practice, since even on a 32-bit machine this will be a very large number (230 is greater than one billion). It is reasonable to assume that having one billion operations outstanding on a given object at any instant is extremely unlikely. It can therefore be assumed that at any given moment of time, the sum of all the per-processor counts represents the number of outstanding references on the object, and that if the sum of the counts equals zero then all references have completed.
On an UMA machine, the set of per-processor reference counts 303a, 303b, 303c, 303d will typically be stored as the elements of an array. The entire rundown reference data structure 301 will include the array or a pointer to the array, and padding to ensure that the beginning of the array is aligned on a cache-line boundary, so that each per-processor reference count data structure is cache-line-aligned. The rundown reference data structure may include additional fields, such as a count of the total number of processors in the system.
An array of cache-aligned per-processor reference count structures is suitable for an UMA machine, but is not suitable for a NUMA machine such as the eight-node, one-processor-per-node machine 201 illustrated in
The flowchart of
In step 410 an attempt is made to atomically exchange the locally-stored new reference count value for the old value stored in the per-processor count by invoking an interlocked hardware synchronization primitive, so that the exchange will not succeed if the per-processor count value has changed since the local copy was made. In step 412, the result of using the primitive is examined to determine whether the atomic exchange succeeded. If it was successful, the algorithm returns in step 414 with an indication that the requested rundown references are obtained, and the algorithm ends in step 418. If the atomic exchange failed, the local copy of the reference count is updated to the current value in step 416, and the algorithm branches back to the test of the w-bit in step 404, since it is possible that a rundown wait has begun in the meantime.
The rundown dereferencing primitive can best be understood following an explanation of the rundown wait routine. Referring therefore to
In step 502, the new global rundown wait block data structure is allocated. Turning to
Step 506 marks the beginning of a loop over all the processors. The loop will mark each per-processor count, by setting the w-bit, to indicate to rundown reference acquire and release primitives that rundown is active, and it will extract each per-processor count so that they may be summed. The loop will also switch the non-w-bit portion of the per-processor count with a pointer to the on-stack rundown wait block. This is why the data type of the reference counts must be wide enough to hold a basic address. A local copy of the reference count of the processor currently being iterated over is saved in step 508. The address of the on-stack rundown wait block is locally stored in step 510, with the least significant bit set to 1. Then, in step 512, an attempt is made to atomically exchange the resulting value for the current processor's reference count value, using an interlocked synchronization primitive as in the algorithm for obtaining rundown references. In step 514, it is determined whether the atomic exchange was successful. If it failed, the local copy of the per-processor reference count is updated to the current value in step 516, and the algorithm branches back to step 510 to attempt an atomic exchange again. If the atomic exchange succeeded, the local copy of the per-processor reference count is added to a running total of all reference counts in step 518. Note that at this point no further references requested by this processor will succeed, because the w-bit has been set to 1. In addition, the per-processor count contains a pointer to the on-stack rundown wait block, apart from the set least significant bit. The loop counter is incremented in step 520, and the counter is tested in step 522. When the loop has completed iterating over all the processors, the algorithm proceeds to step 524.
The total of all atomically extracted per-processor reference counts must be no less than zero, because there cannot have been a greater number of dereferences than references. The most significant bit of the reference count tally (the sign bit) must therefore be zero. Step 524 determines whether the per-processor reference count total is zero. If so, there are no outstanding references, so the algorithm terminates immediately in step 536 without having to block. Otherwise, in step 526 the reference count total is shifted right once, halving the total, in order to obtain the actual number of references, since references have been incremented and decremented in twos. Step 528 initializes the synchronization object which, in an embodiment of the invention, is a field of the on-stack rundown wait block. In step 530 the sum of outstanding references is atomically added to the on-stack global count. Step 532 then determines whether the global count is now equal to zero. While the global count can never be incremented, it is possible that rundown releases have already been pointed to the global count and have decremented it below zero. If the result of adding the tallied reference counts to the global count is now zero, because dereferences for all the outstanding accesses have already come in, the algorithm simply terminates in step 536. Otherwise, it blocks on the synchronization event in step 534 until it is signaled that the global reference count has dropped to zero, at which point it ends the rundown wait in step 536.
The flowchart of
If it is determined at step 604 that the w-bit is 0, a rundown wait has not yet been initiated. A procedure similar to the one used in acquiring rundown references is then used to release the references. In step 616 an updated reference count value for the processor is determined by doubling the desired number of new references and adding the result to the locally-copied old value. The result is stored in a separate local copy. In step 618 an attempt is made to atomically exchange the locally-stored decremented reference count value for the old value stored in the per-processor count. In step 620, the result of using the primitive is examined to determine whether the atomic exchange succeeded. If it was successful, the algorithm terminates at step 614, with the references successfully released. If the atomic exchange failed, the local copy of the reference count is updated to the current value in step 622, and the algorithm branches back to the test of the w-bit in step 604, since it is possible that a rundown wait has begun in the meantime.
Referring now to
The terms “a”, “an”, “the” and similar referents used in the context of describing the invention, especially in the context of the following claims, are to be construed to cover both the singular and plural, unless otherwise indicated herein or clearly contradicted by context. The terms “comprising”, “having”, “including”, and “containing” are to be construed as open-ended terms, unless otherwise noted. All methods described herein can be performed in any suitable order unless otherwise indicated herein or otherwise clearly contradicted by context. The use of any and all examples, or exemplary language provided herein is intended merely to better illuminate the invention and does not pose a limitation on the scope of the invention unless otherwise claimed. No language in the specification should be construed as indicating that any non-claimed element is essential to the practice of the invention.
Preferred embodiments of this invention are described herein, including the best mode known to the inventors for carrying out the invention. Variations of those preferred embodiments may become apparent to those of ordinary skill in the art upon reading the foregoing description. The inventors expect skilled artisans to employ such variations as appropriate, and the inventors intend for the invention to be practiced otherwise than as specifically described herein. Accordingly, this invention includes all modifications and equivalents of the subject matter recited in the claims appended hereto as permitted by applicable law. Moreover, any combination of the above-described elements in all possible variations thereof is encompassed by the invention unless otherwise indicated herein or otherwise clearly contradicted by context.
Claims
1. A computer system comprising:
- a plurality of processors;
- a memory hierarchy including a main memory shared by the processors and at least one faster-access memory level comprising a plurality of units, each associated with one of the processors; and
- routines for preventing a shared object from being destroyed or otherwise rendered inaccessible while at least one reference on the object exists, the routines including a plurality of per-processor reference counts, maintained in the units of the memory.
2. The computer system of claim 1 wherein the at least one faster-access memory level includes a level of local cache memory units, wherein each local cache unit is private to the respective processor associated with the cache unit.
3. The computer system of claim 1 wherein the computer system is a non-uniform memory access (NUMA) multiprocessor machine, and the at least one faster-access memory level includes a level comprising subsets of the main memory, wherein each subset is physically close to the processor with which it is associated.
Type: Application
Filed: Apr 11, 2006
Publication Date: Sep 7, 2006
Applicant: Microsoft Corporation (Redmond, WA)
Inventors: Ravisankar Pudipeddi (Sammamish, WA), Neill Clift (Kirkland, WA), Neal Christiansen (Bellevue, WA)
Application Number: 11/402,493
International Classification: G06F 12/00 (20060101);