Dynamic Base Disk Mirroring for Linked Clones
Techniques for implementing dynamic base disk mirroring for linked clones are provided. In one set of embodiments, a first node in a distributed storage system can monitor a congestion level of a base disk residing on the first node, where the base disk is shared by a plurality of linked clones. Upon determining that the congestion level exceeds a threshold, the first node can send, to a second node, a request to create a mirror of the base disk on that second node. Upon receiving an acknowledgement from the second node that the mirror has been successfully created, the first node can update a mirror set associated with the base disk to include an entry identifying the mirror. The first node can then communicate the updated mirror set to one or more other nodes.
Unless otherwise indicated, the subject matter described in this section is not prior art to the claims of the present application and is not admitted as being prior art by inclusion in this section.
A linked clone is a virtual machine (VM) that is created from a point-in-time snapshot of another (i.e., parent) VM and shares with the parent VM—and with other linked clones created from the same snapshot—a base disk corresponding to the snapshotted state of the parent VM's virtual disk. Ongoing changes made by the parent VM to the virtual disk do not affect the linked clone because they are written to a separate delta disk that is specific to the parent VM. Similarly, changes made by the linked clone to the virtual disk do not affect the parent VM or other linked clones because they are written to a separate linked clone delta disk that is specific to that linked clone. Read requests issued by the parent VM or the linked clone to the virtual disk are first directed to their respective delta disks; if the read requests cannot be fulfilled there (which will be true for read requests for unmodified data), they are redirected to the shared base disk.
While linked cloning provides several benefits (e.g., high storage efficiency, fast clone creation, etc.) over alternative VM cloning mechanisms such as full cloning, creating a large number of linked clones from a single parent VM snapshot can impose a heavy I/O burden on the base disk due to the need to serve many, potentially concurrent read requests from the linked clones. This in turn can cause the base disk to become a bottleneck that limits the linked clones' performance. This issue is particularly problematic for virtual desktop deployments in which hundreds or thousands of virtual desktop linked clones may be created from a single “golden master” VM snapshot.
In the following description, for purposes of explanation, numerous examples and details are set forth in order to provide an understanding of various embodiments. It will be evident, however, to one skilled in the art that certain embodiments can be practiced without some of these details or can be practiced with modifications or equivalents thereof.
1. OverviewThe present disclosure is directed to techniques that can be implemented in a distributed storage system for dynamically mirroring base disks that are shared by linked clones. As used herein, a “mirror” of a base disk (or “base disk mirror”) is a read-only copy of that base disk.
In one set of embodiments, a distributed storage system can monitor a congestion level of a base disk residing at a first node of the system and, upon determining that the congestion level has exceeded a threshold, create a mirror of the base disk at a second node different from the first node. The distributed storage system can then add the newly created mirror to a list of mirrors (i.e., mirror set) for the base disk and communicate the mirror set to all nodes (or all interested nodes) in the system. With an up-to-date mirror set for the base disk in place at each node, when a read request is received that needs to be redirected to the base disk, the distributed storage system can select, based on current disk congestion levels and/or other criteria, either the base disk or one of its mirrors as the actual target for serving the read request and send the request to the selected target. In this way, the distributed storage system can effectively load balance read I/O across the base disk and its mirrors, thereby reducing the base disk's congestion level and improving the storage performance of the linked clones sharing the base disk. In certain embodiments this mirror creation process can be repeated multiple times, resulting in the creation of multiple mirrors of the base disk up to, e.g., a user-configurable high watermark.
In a further set of embodiments, the distributed storage system can monitor the overall load (and/or other metrics) of the base disk and its mirrors created via the process above and, upon determining that the overall load has fallen below a threshold, delete one of the base disk mirrors. The distributed storage system can then remove the deleted mirror from the mirror set for the base disk and communicate the mirror set to all nodes (or all interested nodes) in the system. The result of these steps is that the deleted base disk mirror will no longer be used as a load balancing target for future read requests destined for the base disk. In addition, the storage space previously consumed by the deleted base disk mirror can be reused for other purposes. As with the mirror creation process, this mirror deletion process can be repeated multiple times until the total number of mirrors of the base disk drops to zero (or reaches a user-configurable low watermark).
2. Solution ArchitectureGenerally speaking, storage agents 106(1)-(N) of nodes 102(1)-(N) are configured to manage the storage of persistent data in physical storage resources 108(1)-(N) and make these resources available as a storage backend to storage clients. It is assumed that the persistent data managed by storage agents 106(1)-(N) and maintained in physical storage resources 108(1)-(N) include virtual disks used by VMs, and more specifically base disks and delta disks used by linked clones. For example,
In addition,
As noted in the Background section, one issue with linked cloning is that the base disk shared by a parent VM and its linked clones can become a performance bottleneck as the number of linked clones scales upward. For example, although
A workaround for this problem is to employ full cloning rather than linked cloning. Unlike a linked clone, a full clone does not share a base disk with its parent VM or with other full clones; instead, each full clone is given its own, independent copy of the parent VM's virtual disk, which eliminates the shared base disk as a point of congestion. However, full cloning is not a practical solution for many environments because it suffers from poor storage efficiency and slow clone creation times.
To address the foregoing and other similar issues, distributed storage system 100 of
For example, with respect to the scenario depicted in
Further, at the time storage agents 106(1) and 106(3) receive read requests from linked clones 202 and 204 that cannot be fulfilled by their respective linked clone delta disks 210 and 212, enhanced read I/O handlers 112(1) and 112(3) of storage agents 106(1) and 106(3) can select, based on the current congestion levels of base disk 208 and its mirrors (and/or other criteria such as network locality, etc.), one of those disks as the target for serving the read requests. The enhanced read I/O handlers can then redirect the read requests to the selected target. This is illustrated in
Yet further, concurrently with the above, mirror manager 110(2) (either alone in cooperation with other mirror managers) can monitor the overall load and/or other metrics of base disk 208 and its mirrors. Upon determining that the overall load has fallen below a threshold, mirror manager 110(2) can cause one of the base disk mirrors to be deleted. This is illustrated in
With the high-level approach described above, a number of advantages are achieved. First, during time periods in which base disk 208 is not congested, there is no change in how base disk 208 is accessed by linked clones 202 and 204 and no change in the amount of storage space consumed on distributed storage system 100. However, once base disk 208 become congested beyond a threshold, this approach dynamically trades off some storage efficiency for performance by creating one or more base disk mirrors, which allows the read I/O load on base disk 208 to be load balanced across those mirrors and thus prevents base disk 208 from becoming a bottleneck. Conversely, once one or more of the base disk mirrors are no longer needed, this approach deletes those mirrors and thus reclaims the storage space they previously consumed. As a result, dynamic base disk mirroring is a flexible solution that significantly mitigates the performance problems of sharing a single base disk among many linked clones, while maintaining good storage efficiency when congestion levels are low (and significantly better average storage efficiency than full cloning). In certain embodiments, an administrator of distributed storage system 100 can tune this mechanism by configuring the congestion/load thresholds at which new mirrors are created and existing mirrors are deleted, as well as setting high and low watermarks that indicate the maximum and minimum number of mirrors allowed for a given base disk.
Second, dynamic base disk mirroring is transparent to the storage clients (i.e., linked clones) accessing the base disk; from the perspective of those storage clients, all read requests that cannot be served by their delta disks appear to be served by the base disk, even though the read requests may in fact be redirected to and served by a base disk mirror. Accordingly, there is no need for any storage client-side modifications with this approach.
It should be appreciated that the system architecture depicted in
Starting with block 402, mirror manager 110(i) can monitor the current congestion level of base disk B. This congestion level can be based on one or more statistics such as the number of concurrent I/O requests being serviced by base disk B, the status of base disk B's buffer, the average latency base disk B over a moving time window, and so on.
At block 404, mirror manager 110(i) can check whether the current congestion level is greater than a congestion threshold. If the answer is no, mirror manager 110(i) can return to its monitoring at block 402.
However, if the answer at block 404 is yes, mirror manager 110(i) can further check whether the current number of base disk mirrors for base disk B is less than a high watermark (block 406). If the answer at block 406 is no, mirror manager 110(i) can return to its monitoring at block 402 (and/or generate a notification for the system administrator that the mirror set of base disk B cannot be expanded).
However, if the answer at block 406 is yes, mirror manager 110(i) can identify another node of distributed storage system 100 as a candidate for holding a new mirror of base disk B (block 408) and can transmit a message to the mirror manager of that node requesting creation of a base disk mirror there (block 410). In one set of embodiments, mirror manager 110(i) can identify this candidate based on the node topology of distributed storage system 100; for example, mirror manager 110(i) may select a node that is furthest away from node 102(i). In other embodiments, mirror manager 110(i) can identify this candidate based on criteria such as the locations of the linked clone delta disks associated with base disk B, the amount of free storage space on each node, etc.
At block 412, mirror manager 110(i) can receive an acknowledgement from the mirror manager of the candidate node that the new base disk mirror has been successfully created. In response, mirror manager 110(i) can update a mirror set for base disk B to include an entry for the newly created mirror (block 414). This entry can comprise, e.g., an identifier and network address of the mirror.
Finally, at block 416, mirror manager 110(i) can communicate the updated mirror set for base disk B to all of the mirror managers in distributed storage system 100, or alternatively to a subset of mirror managers that have an interest in base disk B. Such a subset may include mirror managers residing at nodes that maintain either a base disk mirror or a linked clone delta disk that is associated with base disk B.
4. Read Request HandlingStarting with blocks 502 and 504, enhanced read I/O handler 112(j) can receive the read request from the linked clone and check whether the requested data in the read request is present in the linked clone's linked clone delta disk. If the answer is yes, enhanced read I/O handler 112(j) can forward the read request to the linked clone delta disk for servicing/fulfillment (block 506) and workflow 500 can end.
However, if the answer at block 504 is no, enhanced read I/O handler 112(j) can retrieve the current mirror set for base disk B (block 508) and select, using a congestion-based algorithm, one of the mirrors in the mirror set (or base disk B itself) as the best target for serving the read request (block 510). In one set of embodiments, this algorithm can involve checking the current congestion level of base disk B and each mirror and selecting the disk with the lowest congestion level. If there is a tie in congestion level, the algorithm can further determine the network locality of base disk B and each mirror with respect to the handler's node (i.e., node 102(j)) and select the disk whose node is closest to node 102(j). If there is a tie in network locality, the algorithm can use a round robin scheme to select the target.
Finally, at block 512, enhanced read I/O handler 112(j) can redirect/forward the read request to the selected target and workflow 500 can end.
5. Mirror Set ContractionStarting with blocks 602 and 604, mirror manager 110(i) can monitor the current load (and/or other metrics) of base disk B and its mirrors and check whether the load is greater than a load threshold. If the answer is yes, mirror manager 110(i) can return to its monitoring at block 402.
However, if the answer at block 604 is no, mirror manager 110(i) can further check whether the current number of base disk mirrors for base disk B greater than a low watermark (block 606). If the answer at block 606 is no, mirror manager 110(i) can return to its monitoring at block 602 (and/or generate a notification for the system administrator that the mirror set of base disk B cannot be contracted).
However, if the answer at block 606 is yes, mirror manager 110(i) can identify one of the mirrors of base disk B as candidate for deletion (block 608) and can transmit a message to the mirror manager at the node of that mirror requesting that it be deleted (block 610). As with the mirror creation workflow, mirror manager 110(i) can identify this deletion candidate based on the node topology of distributed storage system 100 and/or other criteria (e.g., the locations of the linked clone delta disks associated with base disk B, the amount of free storage space on each node, etc.).
At block 612, mirror manager 110(i) can receive an acknowledgement that the base disk mirror has been successfully deleted. Finally, mirror manager 110(i) can update the mirror set for base disk B to remove the entry for the deleted mirror (block 614) and communicate the updated mirror set for base disk B to all of the mirror managers in distributed storage system 100 (or the subset of mirror managers interested in base disk B) (block 616).
Certain embodiments described herein can employ various computer-implemented operations involving data stored in computer systems. For example, these operations can require physical manipulation of physical quantities—usually, though not necessarily, these quantities take the form of electrical or magnetic signals, where they (or representations of them) are capable of being stored, transferred, combined, compared, or otherwise manipulated. Such manipulations are often referred to in terms such as producing, identifying, determining, comparing, etc. Any operations described herein that form part of one or more embodiments can be useful machine operations.
Yet further, one or more embodiments can relate to a device or an apparatus for performing the foregoing operations. The apparatus can be specially constructed for specific required purposes, or it can be a general-purpose computer system selectively activated or configured by program code stored in the computer system. In particular, various general-purpose machines may be used with computer programs written in accordance with the teachings herein, or it may be more convenient to construct a more specialized apparatus to perform the required operations. The various embodiments described herein can be practiced with other computer system configurations including handheld devices, microprocessor systems, microprocessor-based or programmable consumer electronics, minicomputers, mainframe computers, and the like.
Yet further, one or more embodiments can be implemented as one or more computer programs or as one or more computer program modules embodied in one or more non-transitory computer readable storage media. The term non-transitory computer readable storage medium refers to any data storage device that can store data which can thereafter be input to a computer system. The non-transitory computer readable media may be based on any existing or subsequently developed technology for embodying computer programs in a manner that enables them to be read by a computer system. Examples of non-transitory computer readable media include a hard drive, network attached storage (NAS), read-only memory, random-access memory, flash-based nonvolatile memory (e.g., a flash memory card or a solid-state disk), a CD (Compact Disc) (e.g., CD-ROM, CD-R, CD-RW, etc.), a DVD (Digital Versatile Disc), a magnetic tape, and other optical and non-optical data storage devices. The non-transitory computer readable media can also be distributed over a network coupled computer system so that the computer readable code is stored and executed in a distributed fashion.
In addition, while certain virtualization methods referenced herein have generally assumed that virtual machines present interfaces consistent with a particular hardware system, persons of ordinary skill in the art will recognize that the methods referenced can be used in conjunction with virtualizations that do not correspond directly to any particular hardware system. Virtualization systems in accordance with the various embodiments, implemented as hosted embodiments, non-hosted embodiments or as embodiments that tend to blur distinctions between the two, are all envisioned. Furthermore, certain virtualization operations can be wholly or partially implemented in hardware.
Many variations, modifications, additions, and improvements are possible, regardless the degree of virtualization. The virtualization software can therefore include components of a host, console, or guest operating system that performs virtualization functions. Plural instances can be provided for components, operations, or structures described herein as a single instance. Finally, boundaries between various components, operations, and data stores are somewhat arbitrary, and particular operations are illustrated in the context of specific illustrative configurations. Other allocations of functionality are envisioned and may fall within the scope of the invention(s). In general, structures and functionality presented as separate components in exemplary configurations can be implemented as a combined structure or component. Similarly, structures and functionality presented as a single component can be implemented as separate components.
As used in the description herein and throughout the claims that follow, “a,” “an,” and “the” includes plural references unless the context clearly dictates otherwise. Also, as used in the description herein and throughout the claims that follow, the meaning of “in” includes “in” and “on” unless the context clearly dictates otherwise.
The above description illustrates various embodiments along with examples of how aspects of particular embodiments may be implemented. These examples and embodiments should not be deemed to be the only embodiments and are presented to illustrate the flexibility and advantages of particular embodiments as defined by the following claims. Other arrangements, embodiments, implementations, and equivalents can be employed without departing from the scope hereof as defined by the claims.
Claims
1. A method comprising:
- monitoring, by a first node in a distributed storage system, a congestion level of a base disk residing on the first node, wherein the base disk is shared by a plurality of linked clones created from a parent virtual machine (VM) snapshot, and wherein each linked clone is associated with a linked clone delta disk that is specific to the linked clone and comprises changes made by the linked clone to data in the base disk;
- upon determining that the congestion level exceeds a first threshold and a number of existing mirrors for the base disk is less than a high watermark, sending, by the first node to a second node in the distributed storage system, a request to create a mirror of the base disk on the second node, the mirror being a read-only copy of the base disk;
- upon receiving an acknowledgement from the second node that the mirror has been successfully created on the second node, updating, by the first node, a mirror set associated with the base disk to include an entry identifying the mirror; and
- communicating, by the first node, the mirror set to one or more other nodes in the distributed storage system.
2. The method of claim 1 further comprising:
- receiving, by a third node in the distributed storage system, a read request from a first linked clone in the plurality of linked clones for reading data on the base disk;
- upon determining that the data is not present on a linked clone delta disk of the first linked clone: retrieving, by the third node, the mirror set communicated by the first node; selecting, by the third node, either the base disk or the mirror identified in the mirror set as a target for serving the read request; and redirecting, by the third node, the read request to the selected target.
3. The method of claim 2 wherein the selecting is based on current congestion levels of the base disk and the mirror.
4. The method of claim 3 wherein if there is a tie in current congestion levels of the base disk and the mirror, the selecting is further based on network locality between the first node and the third node and between the second node and the third node.
5. The method of claim 1 further comprising:
- monitoring a load metric of the base disk and the mirror;
- upon determining that the load metric is below a second threshold, sending, to the second node, a request to delete the mirror;
- upon receiving an acknowledgement from the second node that the mirror has been successfully deleted, updating the mirror set to remove the entry identifying the mirror; and
- communicating the mirror set to the one or more other nodes.
6. The method of claim 1 wherein the congestion level corresponds to a number of concurrent I/O requests being served by the base disk or a status of a buffer of the base disk.
7. The method of claim 1 wherein the first threshold and the high watermark are defined by an administrator of the distributed storage system.
8. A non-transitory computer readable storage medium having stored thereon program code executable by a first node of a distributed storage system, the program code embodying a method comprising:
- monitoring a congestion level of a base disk residing on the first node, wherein the base disk is shared by a plurality of linked clones created from a parent virtual machine (VM) snapshot, and wherein each linked clone is associated with a linked clone delta disk that is specific to the linked clone and comprises changes made by the linked clone to data in the base disk;
- upon determining that the congestion level exceeds a first threshold and a number of existing mirrors for the base disk is less than a high watermark, sending, to a second node in the distributed storage system, a request to create a mirror of the base disk on the second node, the mirror being a read-only copy of the base disk;
- upon receiving an acknowledgement from the second node that the mirror has been successfully created on the second node, updating a mirror set associated with the base disk to include an entry identifying the mirror; and
- communicating the mirror set to one or more other nodes in the distributed storage system.
9. The non-transitory computer readable storage medium of claim 8 wherein a third node in the distributed storage system:
- receives a read request from a first linked clone in the plurality of linked clones for reading data on the base disk;
- upon determining that the data is not present on a linked clone delta disk of the first linked clone: retrieves the mirror set communicated by the first node; selects either the base disk or the mirror identified in the mirror set as a target for serving the read request; and redirects the read request to the selected target.
10. The non-transitory computer readable storage medium of claim 9 wherein the selecting is based on current congestion levels of the base disk and the mirror.
11. The non-transitory computer readable storage medium of claim 10 wherein if there is a tie in current congestion levels of the base disk and the mirror, the selecting is further based on network locality between the first node and the third node and between the second node and the third node.
12. The non-transitory computer readable storage medium of claim 8 wherein the method further comprises:
- monitoring a load metric of the base disk and the mirror;
- upon determining that the load metric is below a second threshold, sending, to the second node, a request to delete the mirror;
- upon receiving an acknowledgement from the second node that the mirror has been successfully deleted, updating the mirror set to remove the entry identifying the mirror; and
- communicating the mirror set to the one or more other nodes.
13. The non-transitory computer readable storage medium of claim 8 wherein the congestion level corresponds to a number of concurrent I/O requests being served by the base disk or a status of a buffer of the base disk.
14. The non-transitory computer readable storage medium of claim 8 wherein the first threshold and the high watermark are defined by an administrator of the distributed storage system.
15. A node in a distributed storage system comprising:
- a processor; and
- a non-transitory computer readable medium having stored thereon program code that, when executed, causes the processor to: monitor a congestion level of a base disk residing on the node, wherein the base disk is shared by a plurality of linked clones created from a parent virtual machine (VM) snapshot, and wherein each linked clone is associated with a linked clone delta disk that is specific to the linked clone and comprises changes made by the linked clone to data in the base disk; upon determining that the congestion level exceeds a first threshold and a number of existing mirrors for the base disk is less than a high watermark, send, to another node in the distributed storage system, a request to create a mirror of the base disk on said another node, the mirror being a read-only copy of the base disk; upon receiving an acknowledgement from said another node that the mirror has been successfully created, update a mirror set associated with the base disk to include an entry identifying the mirror; and communicate the mirror set to one or more other nodes in the distributed storage system.
16. The node of claim 15 wherein yet another node in the distributed storage system:
- receives a read request from a first linked clone in the plurality of linked clones for reading data on the base disk;
- upon determining that the data is not present on a linked clone delta disk of the first linked clone: retrieves the mirror set communicated by the node; selects either the base disk or the mirror identified in the mirror set as a target for serving the read request; and redirects the read request to the selected target.
17. The node of claim 16 wherein the selecting is based on current congestion levels of the base disk and the mirror.
18. The node of claim 17 wherein if there is a tie in current congestion levels of the base disk and the mirror, the selecting is further based on network locality between the node and said yet another node and between said another node and said yet another node.
19. The node of claim 15 wherein the program code further causes the processor to:
- monitor a load metric of the base disk and the mirror;
- upon determining that the load metric is below a second threshold, send, to said another node, a request to delete the mirror;
- upon receiving an acknowledgement from said another node that the mirror has been successfully deleted, update the mirror set to remove the entry identifying the mirror; and
- communicate the mirror set to the one or more other nodes.
20. The node of claim 15 wherein the congestion level corresponds to a number of concurrent I/O requests being served by the base disk or a status of a buffer of the base disk.
21. The node of claim 15 wherein the first threshold and the high watermark are defined by an administrator of the distributed storage system.
Type: Application
Filed: Apr 5, 2021
Publication Date: Oct 6, 2022
Inventors: Jyothir Ramanan (Hyderabad), Matthew B. Amdur (Cambridge, MA), Wenguang Wang (Santa Clara, CA), Enning Xiang (San Jose, CA)
Application Number: 17/222,621