Patents by Inventor Lu Xun

Lu Xun has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

  • Publication number: 20230396682
    Abstract: The disclosure herein describes platform-level migration for deep learning training (DLT) jobs from a checkpointed stated between a source node and a destination node. The checkpointing is performed through capturing GPU state (e.g., device state) and CPU state (e.g., host state). The GPU state includes GPU data (e.g., model parameters, optimizer state, etc.) that is located in the GPU and GPU context (e.g., the default stream in GPU, various handles created by libraries). Restoring the DLT job on the destination node involves resumption of processing of a destination GPU at the same checkpointed state.
    Type: Application
    Filed: June 7, 2023
    Publication date: December 7, 2023
    Inventors: Dharma Kiritkumar SHUKLA, Muthian SIVATHANU, Lu XUN, Rimma Vladimirovna NEHME
  • Patent number: 11722573
    Abstract: The disclosure herein describes platform-level migration for deep learning training (DLT) jobs from a checkpointed stated between a source node and a destination node. The checkpointing is performed through capturing GPU state (e.g., device state) and CPU state (e.g., host state). The GPU state includes GPU data (e.g., model parameters, optimizer state, etc.) that is located in the GPU and GPU context (e.g., the default stream in GPU, various handles created by libraries). Restoring the DLT job on the destination node involves resumption of processing of a destination GPU at the same checkpointed state.
    Type: Grant
    Filed: June 25, 2021
    Date of Patent: August 8, 2023
    Assignee: Microsoft Technology Licensing, LLC
    Inventors: Dharma Kiritkumar Shukla, Muthian Sivathanu, Lu Xun, Rimma Vladimirovna Nehme
  • Publication number: 20220311832
    Abstract: The disclosure herein describes platform-level migration for deep learning training (DLT) jobs from a checkpointed stated between a source node and a destination node. The checkpointing is performed through capturing GPU state (e.g., device state) and CPU state (e.g., host state). The GPU state includes GPU data (e.g., model parameters, optimizer state, etc.) that is located in the GPU and GPU context (e.g., the default stream in GPU, various handles created by libraries). Restoring the DLT job on the destination node involves resumption of processing of a destination GPU at the same checkpointed state.
    Type: Application
    Filed: June 25, 2021
    Publication date: September 29, 2022
    Inventors: Dharma Kiritkumar SHUKLA, Muthian SIVATHANU, Lu XUN, Rimma Vladimirovna NEHME
  • Patent number: 11150958
    Abstract: Systems and methods that restore a failed reconfiguration of nodes in a distributed environment. By analyzing reports from read/write quorums of nodes associated with a configuration, automatic recovery for data partitions can be facilitated. Moreover, a configuration manager component tracks current configurations for replication units and determines whether a reconfiguration is required (e.g., due to node failures, node recovery, replica additions/deletions, replica moves, or replica role changes, and the like.) Reconfigurations of data activated as being replicated from an old configuration to being replicated on a new configuration may be performed in a transactionally consistent manner based on dynamic quorums associated with the new configuration and the old configuration.
    Type: Grant
    Filed: August 22, 2019
    Date of Patent: October 19, 2021
    Assignee: Microsoft Technology Licensing, LLC
    Inventors: Santeri Olavi Voutilainen, Gopala Krishna Reddy Kakivaya, Ajay Kalhan, Lu Xun
  • Patent number: 11099747
    Abstract: Techniques are described herein for storing instances of data among nodes of a distributed store based on hierarchical scale unit values. Hierarchical scale unit values are assigned to the respective nodes of the distributed store. A first instance (e.g., a primary instance) of a data module is stored in a first node having a first hierarchical scale unit value. A primary instance of the data module with respect to a data operation is an instance of the data module at which the data operation with respect to the data module is initiated or initially directed. A second instance (e.g., a primary or secondary instance) of the data module is stored in a second node having a second hierarchical scale unit value based on a magnitude of a difference between the first hierarchical scale unit value and the second hierarchical scale unit value. A secondary instance is essentially a “back-up” instance.
    Type: Grant
    Filed: April 23, 2016
    Date of Patent: August 24, 2021
    Assignee: Microsoft Technology Licensing, LLC
    Inventors: Muralidhar Krishnaprasad, Radhakrishnan Srikanth, Lu Xun
  • Publication number: 20200287961
    Abstract: In various implementations, methods and systems resource balancing in a distributed computing environment are provided. A client defined resource metric is received that represents a resource of nodes of the cloud computing platform. A placement plan for job instances of service applications is generated. The placement plan includes one or more movements that are executable to achieve a target placement of the job instances on the nodes. It is determined that the placement plan complies with placement rules. Each placement rule dictates whether a given job instance of the job instances is suitable for placement on a given node of the nodes. The placement plan is executed based on determining that the target placement of the job instances improves balance of resources across the nodes of the cloud computing platform based on the resource represented by the client defined resource metric.
    Type: Application
    Filed: February 27, 2020
    Publication date: September 10, 2020
    Inventors: Matthew Snider, Anurag Gupta, Lu Xun, Yang Li, Gopal Kakivaya, Hua-Jun Zeng
  • Patent number: 10623481
    Abstract: In various implementations, methods and systems resource balancing in a distributed computing environment are provided. A client defined resource metric is received that represents a resource of nodes of the cloud computing platform. A placement plan for job instances of service applications is generated. The placement plan includes one or more movements that are executable to achieve a target placement of the job instances on the nodes. It is determined that the placement plan complies with placement rules. Each placement rule dictates whether a given job instance of the job instances is suitable for placement on a given node of the nodes. The placement plan is executed based on determining that the target placement of the job instances improves balance of resources across the nodes of the cloud computing platform based on the resource represented by the client defined resource metric.
    Type: Grant
    Filed: September 11, 2015
    Date of Patent: April 14, 2020
    Assignee: Microsoft Technology Licensing, LLC
    Inventors: Matthew Snider, Anurag Gupta, Lu Xun, Yang Li, Gopal Kakivaya, Hua-Jun Zeng
  • Patent number: 10609137
    Abstract: Examples of the disclosure provide for providing a global timestamp in a distributed system. A plurality of nodes are connected via a network. A logical clock is implemented on a node of the plurality of nodes. The logical clock sets a global time for the network, such that another node of the plurality of nodes labels a local event with a global timestamp interval based on the global time provided by the logical clock.
    Type: Grant
    Filed: December 8, 2015
    Date of Patent: March 31, 2020
    Assignee: MICROSOFT TECHNOLOGY LICENSING, LLC
    Inventor: Lu Xun
  • Publication number: 20200050495
    Abstract: Systems and methods that restore a failed reconfiguration of nodes in a distributed environment. By analyzing reports from read/write quorums of nodes associated with a configuration, automatic recovery for data partitions can be facilitated. Moreover, a configuration manager component tracks current configurations for replication units and determines whether a reconfiguration is required (e.g., due to node failures, node recovery, replica additions/deletions, replica moves, or replica role changes, and the like.) Reconfigurations of data activated as being replicated from an old configuration to being replicated on a new configuration may be performed in a transactionally consistent manner based on dynamic quorums associated with the new configuration and the old configuration.
    Type: Application
    Filed: August 22, 2019
    Publication date: February 13, 2020
    Inventors: Santeri Olavi Voutilainen, Gopala Krishna Reddy Kakivaya, Ajay Kalhan, Lu Xun
  • Patent number: 10423460
    Abstract: Systems and methods that restore failed reconfiguration of nodes in distributed systems. By analyzing reports from read/write quorums of nodes associated with a configuration, automatic recovery for data partitions can be facilitated. Moreover, a configuration manager component tracks current configurations for replication units and determines whether a reconfiguration is to be performed (e.g., due to node failures, node recovery, replica additions/deletions, replica moves, or replica role changes, and the like.) Reconfigurations of data activated as being replicated from a first configuration to a second configuration may be performed in a transactionally consistent manner based on dynamic quorums associated with the second configuration and the first configuration.
    Type: Grant
    Filed: January 7, 2017
    Date of Patent: September 24, 2019
    Assignee: Microsoft Technology Licensing, LLC
    Inventors: Santeri Olavi Voutilainen, Gopala Krishna Reddy Kakivaya, Ajay Kalhan, Lu Xun
  • Patent number: 10069941
    Abstract: Embodiments are directed to a cluster of nodes configured to run one or more services. A gateway on each node is configured to provide communication between the node and clients. The gateways manage filters designated by the clients, wherein the filters are named based upon a naming format used to name the services. The gateways receive configuration updates broadcast by a failover manager. The configuration updates comprise service endpoint changes for services running on the nodes. The gateways provide notifications to clients when a filter name matches at least a portion of a service name in a configuration update. The configuration updates identify a configuration version, and the gateways may provide notifications with endpoint information from more than one configuration update version. Notifications may be configured to be sent only when the endpoint for a primary replica of a service has changed.
    Type: Grant
    Filed: September 23, 2015
    Date of Patent: September 4, 2018
    Assignee: MICROSOFT TECHNOLOGY LICENSING, LLC
    Inventors: Alex Wun, Vipul A. Modi, Lu Xun, Mansoor Mohsin
  • Patent number: 9832104
    Abstract: Architecture that provides reliable communications of broadcast data (e.g., a message) in a collection of nodes. Each node in the collection is assigned a range of identifiers in a token. The union of the tokens for all nodes is the entire identifier range space. Each node that receives a reliable broadcast message from an originator node acknowledges receipt of the message using its token. One or more intermediate nodes forward the message from the originator node to other nodes with which the originator node has no direct communications (multi-level node structure). The indirect nodes each send acknowledgements back to the parent nodes (which can be an intermediate node) which combine the tokens to ensure the entire range space for the associated assigned token range is covered. The originator node ultimately receives tokens to compute if all nodes have received the message.
    Type: Grant
    Filed: February 11, 2010
    Date of Patent: November 28, 2017
    Assignee: Microsoft Technology Licensing, LLC
    Inventors: Rishi Rakesh Sinha, Lu Xun, Gopala Krishna Reddy Kakivaya
  • Patent number: 9806951
    Abstract: Determining if a new federation should be formed. A method includes a voter node determining that it has sufficient votes to form a new federation. The method further includes determining that no node existing in a previous federation already has a quorum of valid global tickets. Each global ticket indicates that a node agrees that a federation should continue to exist. As a result, the voter node forms a new federation.
    Type: Grant
    Filed: January 18, 2013
    Date of Patent: October 31, 2017
    Assignee: Microsoft Technology Licensing, LLC
    Inventors: Lu Xun, Alex Wun, Gopala Krishna R. Kakivaya, Rishi Rakesh Sinha
  • Patent number: 9727394
    Abstract: Techniques are described herein that are capable of establishing causality order of computer trace records. A computer trace record is information that indicates an event that occurs with regard to execution of a computer program. For instance, machines in a distributed computer system may generate such computer trace records as the machines execute the computer program. The computer trace records may be ordered in an aggregated trace to accurately reflect the causality order of the computer trace records. A causality order of computer trace records is a temporal sequence of the computer trace records in which each cause event is indicated to occur before each effect event that is caused by the cause event. A cause event is an event that causes an effect event. An effect event is an event that is caused by a cause event.
    Type: Grant
    Filed: April 27, 2015
    Date of Patent: August 8, 2017
    Assignee: Microsoft Technology Licensing, LLC
    Inventors: Lu Xun, Gopala Krishna R. Kakivaya, Mihail Gavril Tarta
  • Publication number: 20170201447
    Abstract: Maintaining ring consistency when a node leaves a ring includes a first adjacent node to a leaving node receiving an indication, from the leaving node, indicating its intent to leave the ring. The first adjacent node sends a first indication to a second adjacent node indicating i) acceptance of the leaving node's intent to leave id-space ownership for a portion of the id-space between the leaving node and the first adjacent node, and ii) establishment of a one-way monitoring relationship between the first adjacent node and the second adjacent node. The first adjacent node receives a second indication, from the second adjacent node, indicating i) acceptance of the first adjacent node's intent to assume id-space ownership for the portion of the id-space between the leaving node and the first adjacent node, and ii) establishment of the one-way monitoring relationship between the second adjacent node and the first adjacent node.
    Type: Application
    Filed: March 24, 2017
    Publication date: July 13, 2017
    Inventors: Richard L. Hasha, Lu Xun, Gopala Krishna R. Kakivaya, Dahlia Malkhi
  • Publication number: 20170132047
    Abstract: Systems and methods that restore a failed reconfiguration of nodes in a distributed environment. By analyzing reports from read/write quorums of nodes associated with a configuration, automatic recovery for data partitions can be facilitated. Moreover, a configuration manager component tracks current configurations for replication units and determines whether a reconfiguration is required (e.g., due to node failures, node recovery, replica additions/deletions, replica moves, or replica role changes, and the like.) Reconfigurations of data activated as being replicated from an old configuration to being replicated on a new configuration may be performed in a transactionally consistent manner based on dynamic quorums associated with the new configuration and the old configuration.
    Type: Application
    Filed: January 7, 2017
    Publication date: May 11, 2017
    Inventors: Santeri Olavi Voutilainen, Gopala Krishna Reddy Kakivaya, Ajay Kalhan, Lu Xun
  • Patent number: 9647917
    Abstract: The present invention extends to methods, systems, and computer program products for a joining node to join a ring of nodes within a rendezvous federation. Embodiments of the invention include establishing a neighborhood of multiple nodes on the ring of nodes. The joining node indicates its intent to take id-space ownership for a portion of the id-space between the joining node and a selected immediately adjacent node. The joining node initiates a one-way monitoring relationship with the selected immediately adjacent node. The joining node receiving an indication from the selected immediately adjacent node that indicates acceptance of the joining node's intent to take id-space ownership for a portion of the id-space between the joining node and the selected node and indicates establishment of a one-way monitoring relationship between the selected node and the joining node. The joining node agrees to participate in a one-way monitoring relationship with the selected node.
    Type: Grant
    Filed: April 8, 2015
    Date of Patent: May 9, 2017
    Assignee: Microsoft Technology Licensing, LLC
    Inventors: Richard L. Hasha, Lu Xun, Gopala Krishna R. Kakivaya, Dahlia Malkhi
  • Publication number: 20170115978
    Abstract: Examples of the disclosure provide for monitoring upgrades using health information. An upgrade domain includes a set of one or more nodes from a cluster of nodes. As the upgrade domain is upgraded, the health of the upgrade domain and applications hosted by nodes of the upgrade domain is monitored. Health information is received from the applications and the nodes of the upgrade domain, and is evaluated against health policies at a health check to determine if the upgrade is successful.
    Type: Application
    Filed: October 26, 2015
    Publication date: April 27, 2017
    Inventors: Vipul A. Modi, Chacko P. Daniel, Oana G. Platon, Daniel J. Mastrian, JR., Todd F. Pfleiger, Alex Wun, Lu Xun
  • Publication number: 20170063985
    Abstract: Examples of the disclosure provide for providing a global timestamp in a distributed system. A plurality of nodes are connected via a network. A logical clock is implemented on a node of the plurality of nodes. The logical clock sets a global time for the network, such that another node of the plurality of nodes labels a local event with a global timestamp interval based on the global time provided by the logical clock.
    Type: Application
    Filed: December 8, 2015
    Publication date: March 2, 2017
    Inventor: Lu Xun
  • Patent number: 9542465
    Abstract: Systems and methods that restore a failed reconfiguration of nodes in a distributed environment. By analyzing reports from read/write quorums of nodes associated with a configuration, automatic recovery for data partitions can be facilitated. Moreover, a configuration manager component tracks current configurations for replication units and determines whether a reconfiguration is required (e.g., due to node failures, node recovery, replica additions/deletions, replica moves, or replica role changes, and the like.) Reconfigurations of data activated as being replicated from an old configuration to being replicated on a new configuration may be performed in a transactionally consistent manner based on dynamic quorums associated with the new configuration and the old configuration.
    Type: Grant
    Filed: April 12, 2013
    Date of Patent: January 10, 2017
    Assignee: Microsoft Technology Licensing, LLC
    Inventors: Santeri Olavi Voutilainen, Gopala Krishna Reddy Kakivaya, Ajay Kalhan, Lu Xun