MULTI-PHASE SOFTWARE DELIVERY
A service manager component associated with a set of hosts receives an update to be implemented on the set of hosts. The service manager component can then determine a penalty model that approximates the likely impact associated with an error in the update. Based on the penalty model, the service manager component selects a first subset of hosts to receive and implement the update and an observation window to determine whether an error has emerged or has been detected. If no errors are detected during the observation window, the service manager component can select additional subsets and observations windows and repeat the process or, alternatively, implement the update in the remaining set of hosts and monitor the system until it receives the next update.
This application is a continuation of U.S. patent application Ser. No. 13/076,163, entitled MULTI-PHASE SOFTWARE DELIVERY, and filed on Mar. 30, 2011, the entirety of which is incorporated herein by reference.
BACKGROUNDGenerally described, computing devices utilize a communication network, or a series of communication networks, to exchange data. Companies and organizations operate computer networks that interconnect a number of computing devices to support operations or provide services to third parties. The computing systems can be located in a single geographic location or located in multiple, distinct geographic locations (e.g., interconnected via private or public communication networks). Specifically, data centers or data processing centers, herein generally referred to as a “data center,” may include a number of interconnected computing systems to provide computing resources to users of the data center. The data centers may be private data centers operated on behalf of an organization or public data centers operated on behalf, or for the benefit of, the general public. To facilitate increased utilization of data center resources, virtualization technologies may allow a single physical computing device to host one or more instances of virtual machines that appear and operate as independent computing devices to users of a data center. With virtualization, a single physical computing device can create, maintain, delete, or otherwise manage virtual machines in a dynamic matter.
Regardless of whether virtualization technologies are utilized, users, via client computing devices, can transmit requests to computing devices, such as computing devices at data centers, to process data provided by, or on behalf of, the requesting client computing device, often referred to as a “Web service” or “service.” The client computing devices can typically request the processing of data (e.g., a “service request”) through the transmission of data that has been organized in accordance with a pre-established format, such as an Application Protocol Interface (“API”). For example, a user can access various data processing services via a browser-based software application hosted on the client computing device. Based on the information included in a client computing device service request, the receiving computing device processes the request and can return responsive data or confirm that the requested data processing has been completed. From the perspective of the user at the client computing device, the utilization of such services can provide the impression that the requested services are implemented on the client computing device.
In a typical embodiment, one or more third party service providers maintain various computing devices in a data center that process the client computing device service requests, generally referred to as hosts. Periodically, the third party service provider or data center provider may wish to implement updates, upgrades or other modifications to the software maintained by the hosts and utilized in conjunction with the processing of service requests. Despite substantial testing of any proposed updates, upgrades, or modifications, any software modification to a service host may contain errors (e.g., “bugs”), some of which may be latent or not fully realized until the update, upgrade or modification has been implemented. However, the development or emergence of a software error once the software has been implemented across a fleet of hosts that provide a service can have a significant impact on the availability of the service and the potential damage caused to client computing devices or data associated with client computing devices.
The foregoing aspects and many of the attendant advantages will become more readily appreciated as the same become better understood by reference to the following detailed description, when taken in conjunction with the accompanying drawings, wherein:
Generally described, the present disclosure relates to the implementation and management of software in a networked based environment. More specifically, the present disclosure relates to the management of software that is to be implemented on multiple computing devices in accordance with a multi-phased distribution. Illustratively, aspects of the present disclosure will be described with regard to the implementation, configuration, or incorporation of software updates, upgrades, or other modifications, which will be generally referred to as “updates.” Additionally, aspects of the present disclosure will be described with regard to the management of the updates on a set of client computing device hosts that correspond to service providers providing networked-based services. Although utilized for purposes of illustrative examples, one skilled in the art will appreciate that such illustrative examples should not necessarily be construed as limiting.
In accordance with an illustrative embodiment, a service manager component associated with a set of hosts receives an update to be implemented on the set of hosts. The service manager component can then determine a penalty model that approximates the likely impact associated with an error in the update. Based on the penalty model, the service manager component selects a first subset of hosts to receive and implement the update and an observation window to determine whether an error has emerged or has been detected. Illustratively, the selection of the number of hosts in the first subset and the observation window are based on optimizations of corresponding variables in a penalty model. If no errors are detected during the observation window, the service manager component can select additional subsets and observations windows and repeat the process or, alternatively, implement in the update in the remaining set of hosts.
In communication with the service clients 102 via the communication network 104, is a service provider network 106 that is associated with a set of hosts for providing network based services responsive to service client requests. As illustrated in
Also in communication with the service provider network 106, via the communication network 104, are one or more service vendors or application providers 112. Illustratively, the service vendors 112 may include third party vendors that provide the software applications executed by the hosts to provide services. The service vendors may also correspond to an entity that also provides the service provider network 106 and does not necessarily have to correspond to a true third party. As will be described below, the service vendor 112 can provide the updates to be implemented on the hosts 110.
With reference now to
Upon receiving of the update, a host manager component 108 identifies the set of hosts 110 that will receive the update. In one example, the host manager component 108 can utilize information included in the transmission of the update that identifies one or more hosts 110 to receive the update or provides criteria for selecting the set of hosts 110. In another example, the host manager 106 can perform identify the set of hosts 110 to receive the update based on an affiliation with the service vendor 112, an identification of software applications corresponding to the update, hosts associated with a particular entity, and the like.
With reference to
With reference now to
Turning now to
At block 306, the host manager component 108 determines a first subset of hosts to implement the update. At block 308, the host manager component 108 then determines an update observation window. As previously described, illustratively, the host manager component 108 can identify a penalty model that assesses a penalty based on a distribution of the time in which an error will first occur during the execution of the deployed update and the likely time between the emergence of an error and the ability to remedy the error. The penalty models may be generic to all updates for the service provider network 106. Alternatively, the penalty model may be customized to specific updates, types of updates, types of service vendors, and the like. Based on the identified penalty model, the host manager component 108 selects values for the number of hosts in the first subset and the observation windows to minimize the assessed impact, e.g., the penalty in the penalty model. A more detailed discussion of illustrative penalty models and optimizations will be described with regard to
At block 310, the host manager component 108 transmits the update to the selected hosts in the first subset of hosts. At decision block 312, a test is conducted to determine whether an error has occurred during the determined observation window. If so, at block 314, the host manager component 108 processes the error and routine 300 ends. Illustratively, the host manager component 108 can implement various mitigation techniques in the event an error is detected including restoring previous versions of software application, instantiating new virtual instances of a host (with previous versions of software applications), turning off features, and the like. One skilled in the relevant art will appreciate that decision block 312 may correspond to a loop that is executed throughout the observation window.
If no errors are detected during the observation window, at decision block 318, a test is conducted to determine whether the host manager component 108 should implement the update on additional subsets of hosts. Illustratively, the host manager component 108 may implement the update in a plurality of phases in which each phase may increase the number of hosts that receive the update. If additional subsets of hosts are to receive the update, at block 320, the host manager component 108 determines the next subset of hosts to receive the update and the routine 300 returns to block 308. Illustratively, the next subset of hosts can correspond to an incremental number of hosts to the first subset of hosts that have received and implemented the update. Alternatively, the next subset of hosts can correspond to an independently determined subset of hosts that may or may not have any overlap with the first subset of hosts. Still further, in another embodiment, the next subset of hosts can corresponds to any remaining hosts in the set of hosts that have not yet deployed the update. In this embodiment, the host manager component 108 can set the observation window determined in block 308 to be equal to, or least substantially close, to the time remaining for the deployment of the update in the set of hosts (e.g., observing the set of hosts for the remainder of the release cycle).
If no additional subsets of hosts are to receive the update, at block 322, the routine 300 terminates.
With reference now to
At block 404, the host manager component 108 determines a distribution of time to an occurrence of a first failure for the target update. Illustratively, the distribution of time to a first failure corresponds to an approximation of the time in which a failure based on execution of a deployed update will occur. Illustratively, the distribution of time to a first failure can be represented as an exponential distribution. The distribution of time to a first failure may be generic to all service providers or based, at least in part, on specific service providers, service vendors, types of updates, types of services, etc. The distribution of time to a first failure may also be based on historical data. At block 406, the host manager component 108 determines time parameters associated with a time between emergence of an error and the mitigation of the error. As previously described, the host manager component 108 may cause the implementation of various mitigation techniques that may depend on the type of error that may have emerged or the type of update involved. Depending on the complexity of the mitigation technique, there may be a specific amount of time in which an error is present (and perhaps known), but cannot yet be mitigated. During this time, the “penalty” is incurred by the service provider.
At block 408, the host manager component 108 calculates a penalty model based on the total number of hosts, distribution of time to a first failure and time parameters. Illustratively, the penalty model can correspond to assessment of the impact if an error is found during execution of the update on a first subset of hosts within the current observation window and if an error is found during execution of the update on a larger subset of hosts within the next observation window. The below equation corresponds to an illustrative penalty model for a two-phase deployment:
P(n,t)=n·λ·D·(1−e−n·λ·t)+e−n·λ·t·N·λ·D·(1−e−n·λ·(C−t))
-
- where 0<n≦N and 0<D≦t≦T<<C
- and where
- P: assessed penalty for errors in an update
- n: number of hosts to be selected in the first subset of hosts;
- t: observation window/delay (e.g., a current observation window)
- λ: rate parameter of an exponential distribution modeling time to first failures; 1/λ is the expected time to first failures
- D: time interval between error occurrence and error detection and removal N: total number of hosts
- T: upper bound for current observation window
- C: time for next scheduled update (e.g., the release cycle)
At block 410, the host manager component determines values for the number of hosts to be selected in the first subset of hosts and observation window based on the penalty model. As illustrated above, the illustrative penalty model factors in a penalty for an error occurring during execution of the update on a first subset, “n,” within the first observation window namely, n·λ·D·(1−e−n·λ·t). The illustrative penalty model also factors in a penalty for an error occurring during execution of the update on all the hosts within the second observation window (whose size is the release cycle minus the size of the first observation window), namely, e−n·λ·t·N·λ·D·(1−e−N·λ·(C−t)). Accordingly, the host manager component attempts to optimize the penalty model such that the penalty for an error occurring during implementation in all the hosts is minimal. In one embodiment, the optimization of the penalty model is achieved via selection of different values for number hosts, “n,” and the observation window, “t.”
Illustratively, the host manager component 108 can implement the optimization process by selecting a value for “n” to minimize the penalty model for any given value of t. Specifically, the host manager component 108 can utilize the first and second partial derivatives of the penalty model equation with respect to “n” to model the change in penalty model. Illustratively, based on a single root of the first partial derivative, the host manager component 108 can determine an optimal “n” having a value between “0” (e.g., indicative of an exclusive subset from the set of hosts) and “N” (e.g., indicative of an inclusive subset including all the set of hosts). Additionally, “T” is always an optimal value of “t” for this model. Still further, when “N” is an optimal value of “n,” the two-phase deployment degenerates to one-phase deployment, and the two observation windows merge into a single observation window whose size is “C,” the release cycle.
At block 412, the subroutine 400 terminates with the identification of the optimal values for n and t.
It will be appreciated by those skilled in the art and others that all of the functions described in this disclosure may be embodied in software executed by one or more processors of the disclosed components and mobile communication devices. The software may be persistently stored in any type of non-volatile storage.
Conditional language, such as, among others, “can,” “could,” “might,” or “may,” unless specifically stated otherwise, or otherwise understood within the context as used, is generally intended to convey that certain embodiments include, while other embodiments do not include, certain features, elements, and/or steps. Thus, such conditional language is not generally intended to imply that features, elements and/or steps are in any way required for one or more embodiments or that one or more embodiments necessarily include logic for deciding, with or without user input or prompting, whether these features, elements and/or steps are included or are to be performed in any particular embodiment.
Any process descriptions, elements, or blocks in the flow diagrams described herein and/or depicted in the attached figures should be understood as potentially representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps in the process. Alternate implementations are included within the scope of the embodiments described herein in which elements or functions may be deleted, executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those skilled in the art. It will further be appreciated that the data and/or components described above may be stored on a computer-readable medium and loaded into memory of the computing device using a drive mechanism associated with a computer readable storing the computer executable components such as a CD-ROM, DVD-ROM, or network interface further, the component and/or data can be included in a single device or distributed in any manner. Accordingly, general purpose computing devices may be configured to implement the processes, algorithms, and methodology of the present disclosure with the processing and/or execution of the various data and/or components described above.
It should be emphasized that many variations and modifications may be made to the above-described embodiments, the elements of which are to be understood as being among other acceptable examples. All such modifications and variations are intended to be included herein within the scope of this disclosure and protected by the following claims.
Claims
1. A method for managing updates to a set of hosts associated with a service provider, comprising:
- obtaining an update to be implemented on a set of computing devices;
- obtaining a characterization of a numerical measure that quantifies an error occurring during deployment of the update in a first subset of the set of computing devices and an error occurring during deployment of the update in the set of computing devices;
- determining a size of the first subset based, at least in part, on a minimization of the numerical measure;
- causing a deployment of the update in the first subset of computing devices corresponding to the determined size of the first subset; and
- determining a timing for a deployment of the update in the set of computing devices based, at least in part, on an attribute associated with the deployment of the update in the first subset of computing devices;
- the method performed programmatically by one or more computing systems under control of executable program code.
2. The method of claim 1, wherein the update corresponds to at least one of an upgrade, modification, patch, or configuration.
3. The method of claim 1, wherein the numerical measure includes an amount of time in which at least one error associated with the deployment of the update is present and not yet mitigated.
4. The method of claim 1 further comprising determining a duration of time associated with the deployment of the update in the first subset.
5. The method of claim 4, wherein determining the duration of time comprises determining the duration of time based, at least in part, on a minimization of the numerical measure.
6. The method of claim 4, wherein determining the duration of time comprises determining the duration of time based, at least in part, on the size of the first subset.
7. The method of claim 1, wherein the characterization of the numerical measure includes at least one mathematical formula for the numerical measure.
8. The method of claim 7, wherein the at least one mathematical formula is based, at least in part, on a distribution of time before an initial error occurs during the deployment of the update.
9. The method of claim 7, wherein the minimization of the numerical measure is based, at least in part, on partial derivatives applicable to the at least one mathematical formula.
10. The method of claim 1, wherein the attribute associated with the deployment of the update in the first subset corresponds to a timing of detecting an error during the deployment of the update in the first subset.
11. A system for managing updates comprising:
- a set of hosts for providing services to service clients, the set of hosts each including a processor and maintaining software applications;
- a host manager component, implemented on a computing system including one or more processors and memory, the host manager component operative to: obtain an update to be deployed on the set of hosts; determine a size of a first subset of the set of hosts based, at least in part, on a minimization of a first quantification of an error occurring during deployment of the update in the first subset of hosts and a second quantification of an error occurring during deployment of the update in the set of hosts; cause the deployment of the update in the first subset of hosts corresponding to the determined size of the first subset; and determine a timing for a deployment of the update in the set of hosts based, at least in part, on an attribute associated with the deployment of the update in the first subset of hosts.
12. The system of claim 11, wherein the attribute associated with the deployment of the update in the first subset includes an amount of time in which no error is detected during the deployment of the update in the first subset.
13. The system of claim 11, wherein the host manager component is further operative to obtain at least one of a criterion for identifying the set of hosts, timing for completion of the update, or information for testing verification.
14. The system of claim 11, wherein the host manager component is further operative to detect an error during the deployment of the update in the first subset of hosts.
15. The system of claim 14, wherein the host manage component is further operative to cause mitigation of the detected error.
16. The system of claim 15, wherein mitigation of the detected error includes at least one of recalling the update, restoring a previous version of software application, instantiating a new virtual instance of a host, turning off a feature, or reconfiguration of the first subset of hosts.
17. The system of claim 11, wherein the first and second quantifications are characterized in accordance with a penalty model.
18. A non-transitory computer readable storage medium storing computer executable instructions that instruct one or more processors to perform operations comprising:
- obtaining an update to be deployed on a set of computing devices;
- determining a size of a first subset of the set of computing devices based, at least in part, on a minimization of a numerical measure that quantifies an error occurring during deployment of the update in the first subset of computing devices and an error occurring during deployment of the update in the set of computing devices in accordance with a characterization of the numerical measure;
- causing a deployment of the update in the first subset of computing devices corresponding to the determined size of the first subset; and
- causing a deployment of the update in the set of computing devices based, at least in part, on a timing of error occurrence during the deployment of the update in the first subset of computing devices.
19. The non-transitory computer readable storage medium of claim 18, wherein the characterization of the numerical measure includes at least one of a duration of time associated with the deployment of the update in the first subset, a time interval between error occurrence and error mitigation, or a time for next scheduled update.
20. The non-transitory computer readable storage medium of claim 18, wherein the operations further comprise:
- determining a size of a second subset of the set of computing devices; and
- causing a deployment of the update in the second subset prior to the deployment of the update in the set of computing devices.
21. The non-transitory computer readable storage medium of claim 20, wherein the size of the second subset is larger than the size of the first subset.
Type: Application
Filed: Jun 10, 2015
Publication Date: Oct 1, 2015
Inventor: Fancong Zeng (Kenmore, WA)
Application Number: 14/736,067