HYPERSCALE SERVER ARCHITECTURE

Info

Publication number: 20200057679
Type: Application
Filed: Jan 27, 2017
Publication Date: Feb 20, 2020
Applicant: Kaleao Limited (Cambridge, Cambridgeshire)
Inventors: John GOODACRE (Cambridge), Giampietro TECCHIOLLI (Cambridge)
Application Number: 16/340,073

Abstract

In a switch fabric-based infrastructure a flexible scalable server is obtained by physical disaggregation of converged resources to obtain pools of a plurality of operationally independent resource element types such as storage, computing, networking and more. A plurality of computing facilities can be created either dynamically or statically by resource element managers by composing instances of resources from such pools of a plurality of resource element types expressed across a single disaggregated logical resource plane.

Description

Description

TECHNICAL SECTOR

The present invention relates to a scalable server architecture and a method for implementing a scalable server.

BACKGROUND OF THE INVENTION

Traditionally, a compute node has a processor (or CPU) with defined processing capability, a local memory, allocated for the compute node and its IO interfaces.

This compute node creates an independently capable server with the compute/memory/networking resources that are generally enough to be able to manage the most complex tasks.

The CPU is the master of the node, with sole ownership over the attached memory, and the I/O operations which provides its interface to the external world. The processor I/O is the link the processor has with the other system resources, such as persistent storage (HDD/SDD) and Networking (Ethernet NIC).

This architecture was made popular when the desktop PC became commodity, and hasn't fundamentally changed since, even when adopted into the server market. However, to scale to the computing demands of today's applications, servers must scale beyond the resources that can be supplied by a single compute node.

Existing solutions to these requirements can be summarized as follows.

- More processing performance: Use higher performance CPU, along with multicore processing, and NUMA-processing to create a larger more capable processing node.
- Increase the number of CPU memory channels to local memory, so as to increase the total available memory, and the bandwidth to memory.
- Increase the number and speed of the I/O interfaces to support fast/larger storage and network interfaces.
- Duplicate multiples of independent compute nodes and use clustering software (and more recently, hyper-convergent software above a virtualization layer) to try and manage the multiples compute nodes as a single datacentre/cluster.

There are various restrictions to future applicability of these approaches and fundamental physical aspects that means these approaches are reaching the end of their applicability.

- Faster processors have reached the economic limits of fabrication technology, bringing the end to the benefits of silicon scaling, and the consequential limit to power/density to further increase CPU performance. Likewise the memory capability is limited by physical silicon size (limited by fabrication and thermal issues) and the number of pins to connect to memory, along with the physical distance memory can be placed away from the processor element.
- To scale beyond a multicore, NUMA-processing enables a small number of compute nodes to share their common view of the memory and IO of the multi-socket server. However, to maintain this illusion of unity, significant complexity is required, with any returns through such scaling becoming negligible after 4 to 6 compute nodes.
- Virtualization software provides balancing between a system's computing resources that must be averaged between multiple applications, however at the cost of loss of performance, and complex management to get right. This is why many cloud providers can show only 10% system utilization on their servers because the balance of resources is not appropriate.

Several solutions have been developed to try to overcome the above limitations of present approaches.

In US 2011/0271014 it is presented a system and a method for scaling memory capacity by identifying a memory page that is accessible via a common physical address, providing direct access to an I/O device by a virtual machine with memory managed using memory disaggregation. In this solution the process is controlled by a single processor which manages the mapping of physical addresses.

In US 2016/0216982 it is presented a forward fabric platform system to scale IO resources comprising a plurality of nodes, an interconnect backplane coupled between the plurality of nodes and owned by the CPU and a Forward Fabric Manager (FFM). The Fabric computing system has an embedded software defined network whose frontend is managed by a security manager which is physically in a node. In this solution everything is controlled and dependent on the host CPU.

In US2012017037 it is presented a distributed storage system to scale available storage comprising a plurality of compute nodes executing one or more application processes capable of accessing a persistent shared memory implemented by solid state devices physically maintained on the nodes, with the application processes that communicate with a shared data fabric (SDF) to access the memory. In this solution each persistent memory is controlled by a controller on the CPU internal to the node.

In US 2014/0122560 is presented a flexible scalable server comprising a plurality of tiled compute nodes, each node comprising a plurality of cores formed of a processor and a switching circuitry. The switching circuitry couples the processor to a network among the cores and the cores implement networking functions within the compute node. In this solution, the inter-node routing is done via software on the computing node, so the processing of the inter-node routing is made by CPUs in the node.

All the above solutions have limitations concerning the need of a CPU in the node that somehow manage the access to the resource elements.

Since the processing element/CPU is the master of the node, interactions between different nodes, and the resources of a node must be controlled and managed by the CPU, creating inefficiencies due to the software processing of I/O transactions, and a limit to the capabilities of any given storage or networking resource. For example, no existing software on a system can manage the bandwidth of a 100 Gb/s Ethernet connection.

In addition, there is no flexibility in the system architecture other than what the CPU enables. For example, if a given processing load needs twice as much IO networking bandwidth to a given compute level, this can only be addressed by a completely different system designed with twice the networking bandwidth interfacing with the processing element. This IO bottleneck is well understood, and effects for example GPU accelerators as well as high speed network interfaces that must today connect to an external host CPU through its PCIe IO interface or within a SoC and the internal host CPU and its comparable IO interface.

SUMMARY OF THE INVENTION

It is an object of the present invention to propose a scalable server architecture able to overcome the above discussed limits of existing solutions.

According to a first aspect of the present invention, the above objects and further more are attained by a compute node comprising a plurality of physical resource elements defined across a physically converged substrate, and a switch fabric used to couple the physical resource elements each other by using a processor aware addressing scheme to physically disaggregate various types of resource elements so that they form pools of a plurality of operationally independent resource element types expressed within a single plane of disaggregated logical resources. The switch fabric also bridged to an external IO resource through an instance of a resource element type.

According to another aspect of the present invention, the above objects and further more are attained by a scalable server comprising plural compute nodes, each compute node comprising a plurality of physical resource elements defined across a physically converged substrate, and a switch fabric used to couple the physical resource elements to each other by using a processor aware addressing scheme to physically disaggregate resource elements so that they form pools of a plurality of operationally independent resource element types, and wherein the switch fabric is used to couple said compute nodes each other for extending the physically converged substrates in a global physical converged substrate, wherein said pools of a plurality of operationally independent resource elements of each compute node are expressed together within a single plane of disaggregated logical resources. The switch fabric also bridged to external IO resource through an instance of a resource element type.

According to another aspect of the present invention, the above objects and further more are attained by a method of implementing a scalable server machine comprising one or more physically converged substrates, across each physical converged substrate being defined a plurality of physical resource elements, fabric switches for connecting the physical resource elements across the physical converged substrates using a processor native addressing, wherein the method comprises: physically disaggregating the physical resource elements; expressing the disaggregated physical resource elements as pools of a plurality of operationally independent logical resource element types within a single plane of disaggregated logical resources; and abstracting a computing facility from said pools of logical resource elements types by collecting instances of logical resource elements from said pools of logical resource elements types.

The method above defined can adopt and use the most capable of processor devices, along with their physical memory interface capability, to implement the processing element. This element only requires the CPU functionality along with its memory interface, plus at least one link to the global resource switch fabric. This permits a system according to the invention to use the best processors, in a system that do not need costly and market limiting integration of the other system resources.

In addition, since each element of the system can be selected and integrated in different configurations, the solution can address any market with a high return on investment.

Furthermore, since resources are locally attached, then the highest performance and lowest cost can be achieved through integration and resource locality. However, since each compute node also exposes further its share (i.e. everything it can share) to the global resource pool, all resource elements can arbitrate remote access thus creating disaggregated pools of a resource element type.

Finally, since each element can instantiate its physical interfaces anywhere in the global resource substrate, then the capability of any element can be accessed as if that resource was physically attached within the other resource elements. For example, physical storage of a remotely defined resource element can be exposed directly along with the storage of a resource element local to a node. The IO buffers of a device resource element can be placed directly within the memory of any resource element. Such physicalization has the ability to remove the physical limitations of attaching a resource to any single processing element, limitations such as pin count, distance, thermal, fabrication.

BRIEF DESCRIPTION OF THE DRAWINGS

For a better comprehension of advantages and features, an embodiment of the invention is illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings in which:

FIG. 1 shows a block diagram of physical resource elements arranged on a physical converged substrate in a compute node according to the invention;

FIG. 2 shows a block diagram of a set of physical converged substrates arranged in a physical server machine in a scalable server according to the invention;

FIG. 3 shows a block diagram of a plane of disaggregated logical resources according to the invention;

FIG. 4 shows a block diagram of the main steps of a method according to the invention.

DESCRIPTION OF PREFERRED EMBODIMENTS

With reference to FIG. 1, it is indicated as a whole with 10 a compute node in a fabric-switched infrastructure, which physically consists in a plurality of physical resource elements defined across a physically converged substrate, 20, which is a printed circuit board. The physical resource elements are hardware or reconfigurable hardware or other such computing technology, that in the present specific, but not limiting, embodiment of the invention comprise: processing elements, 21, local memory (DRAM) elements, 22, NV Cache (flash) elements, 23, a reconfigurable Network Fabric element, 24, a reconfigurable storage fabric element, 25, an SSD storage element, 26, and a power management element, 27. Each of a processing element 21, a local memory element 22 and a NV cache element 23 form a computing unit, 28. The converged printed substrate 20 comprises four computing units 28. The reconfigurable network fabric element 24 is configured to provide the functionalities of a network resource element for the compute node and is adapted to bridge to an external physical network through the network port 29. A switch fabric, is composed of one or more fabric switches 30 aggregated or distributed in one or more of the physical resource elements, 21, 22, 23, 24, 25, and is adapted to couple the plurality of other physical resource elements to each other by using a processor native addressing or processor aware addressing scheme. In particular, in FIG. 1 the physical connections of the fabric switches 30 through the reconfigurable storage fabric 25 and the reconfigurable network fabric 24, with each processor element 21, and with the network port 29 of the physical converged substrate 20 are shown with arrows. The SSD storage element 26 is coupled to the physical converged substrate 10 through a specific slot for a physical SSD resource.

Alternatively, a switch fabric composed of one or more fabric switches is defined independently within the compute node 10 and interfaces to one or more of the physical resource elements 21, 22, 23, 24, 25, 26. Even with this last physical arrangement, the switch fabric is adapted to couple the plurality of physical resource elements 21, 22, 23, 24, 25, 26 of the compute node 10 to each other by using a processor native addressing or processor aware addressing scheme.

Advantageously, the compute node 10 is designed for providing convergence of processing, memory, storage and networking system resources with a physical balanced ratio of capabilities.

Anyway, in different embodiments of the invention, across the physical converged substrate 20 can be defined more or different physical resource elements such as accelerators and any other resource element type used within a computer facility that are either as a master or as a slave to another physical resource element.

In any case, across a physical converged substrate 20 according to the invention are defined a plurality of resource elements selected among processing, storage, networks, accelerators, memories and any such other element type.

According to the present invention, a compute node 10 is the disaggregation of the physical resource elements defined across the physically converged substrate 20, whose result is a plurality of operationally independent resource element types each resource element type being composed of a pool of resource elements.

With reference to FIG. 2, an embodiment of a scalable server machine according to the invention comprises a set of physically converged substrates 20, mounted in a frame, 40, also providing an additional fabric switch, 50, which is for instance an embedded multiport Ethernet switch that together form part of the switch fabric, along with a system and power management facility, 60.

The additional fabric switch 50 operates as a server mounted network resource element to further extend the switch fabrics of the compute nodes 10 in a single common switch fabric and then the physically converged substrate 20 in a global physical converged substrate and to bridge it through an external network.

In FIG. 2, four compute nodes 10 are connected to a server mounted network resource element, the additional fabric switch 50, which creates a bridge between the switch fabric exposed by each network resource element 24 for the compute node to create a single common switch fabric between all compute nodes 10, and further creating a bridge to an external network.

In the embodiment of FIG. 2 four identical compute nodes 10 are mounted in the frame 40, but obviously different compute nodes 10 according to the invention can be mounted in a same frame.

Alternatively, the additional fabric switch 50, i.e. the server mounted network resource element, is not present and the network resource elements 24 for the compute nodes are connected directly to an external network. Accordingly, the work of the above additional fabric switch 50 is performed directly by the fabric switches 30 of the compute nodes 10.

In any case, thanks to networking resource elements, a single common switch fabric is created which extends the switch fabrics of the compute nodes 10. With reference to FIG. 3, the plurality of pools of resource element types of the various compute nodes 10 can then be expressed in a single disaggregated logical resource plane, 71. With reference to the embodiment of FIG. 2, the single disaggregated logical resource plane 71 expressing the set of compute nodes 10 provides the following resource element types: a processor element type, 72, a memory element type, 73, a persistent storage element type, 74, and a networking element type, 75.

Obviously, where the physical resource elements defined across any physically converged substrate 20 of a scalable server according to the invention contain other type of resources, such as accelerators, the disaggregated logical resource plane, 71, also contain such resource element type.

With reference to FIG. 4, it has to be stressed that each resource element in a pool of resource elements of a resource element type in the disaggregated logical resource plane 71 operates independently of any other resource element so that a plurality of them can be encapsulated, one or more instances for each resource element type, by a disaggregated resource element manager, 81, to create a computing facility, 82. Accordingly, a computing facility 82 is created by selecting any number of instances of any type of logical resource element from said pools of logical resource element types of the disaggregated logical resource plane. The disaggregation of the physical resource elements and subsequent encapsulation of instances of logical resource elements performed by the disaggregated resource element manager, 81, is made possible thanks to the processor native or processor aware addressing scheme which is adopted, opposed to conventional networking aware schemes of physical disaggregation.

More specifically, a computing facility 82 can be created, dynamically or statically, by a) physicalization of resource elements through a common physical address space or b) virtualization of resource elements over any form of abstracted communication or c) any combination thereof.

In the logical view of the method the invention, each logical resource element type (72, 73, 74, 75) becomes a logical pool of resources built internally using traditional processor SoC addressing schemes into a global pool of resources. For example, no single resource element is the master of the compute facility 82, and, as such, networking can serve storage without processor element involvement. It also means the capabilities of each compute node 10 can be independently defined and instantiated without the traditional cost of building a new SoC with different IO resource capabilities.

In the physical view of the method of the invention, each compute facility 82 is created with the convergence of processing, memory, storage and networking system resources using a physical balanced ratio of capabilities required to create the compute node 10. A single compute facility 82 therefore can include any number of processing elements 21, storage elements 25, 26 or network resource element 24 for the compute node to create a compute facility 82. Each physical resource element (21, 22, 23, 24, 25, 26) cannot exist independently but only when connected with one or more of the other physical resource element types. The resource elements are arranged using a processor aware addressing scheme of physical disaggregation and therefore they also need to interconnect to become a meaningful system.

There is not a precise CPU allocation and a memory dedicated to the processing unit but a pool of memories distributed across the compute elements which can be used by the different processing units and the different processing units can be connected together to adapt the processing capability to the requirement of the specific required tasks. Likewise, the global resource pool addressing scheme allows the physical IO resources placed anywhere in the system to be attached to a processing element as if the resource was physically attached to the local address bus of the processor.

All logical resources are therefore considered at the same level of importance in the system.

Additionally, it is not necessary to access the CPU to ‘speak’ with the memory or the resources physically associated with a compute facility 82, but access is possible directly through the global resource address without management by any other element of the system, (assuming the appropriate security and access privileges).

It is understood that what described above is a pure and not limiting example, therefore, possible detail variants which could be necessary for technical and/or functional reasons, are considered from now on within the protective scope defined by the claims below.

Claims

1. A compute node comprising a plurality of physical resource elements defined across a physically converged substrate, and a switch fabric configured to couple the physical resource elements each other by using a processor native addressing or processor aware addressing scheme to physically disaggregate various types of resource elements so that they form pools of a plurality of operationally independent resource element types expressed within a single plane of disaggregated logical resources, the switch fabric being also bridged through an external physical network through networking resource elements.

2. The compute node according to claim 1 characterized in that said switch fabric is composed of one or more fabric switches aggregated or distributed in one or more of said physical resource elements.

3. The compute node according to claim 1 characterized in that said switch fabric is defined independently and interfaces to one or more of said physical resource elements.

4. The compute node according to claim 1 characterized in that said physical resource elements defined across said physically converged substrate comprise at least a processing element, a storage element, a memory element or a network resource element.

5. A scalable server comprising a plurality of compute nodes, each compute node comprising a plurality of physical resource elements defined across a physically converged substrate, and a switch fabric configured to couple the physical resource elements to each other by using a processor native addressing or processor aware addressing scheme to physically disaggregate various types of resource elements so that they form pools of a plurality of operationally independent resource element types, the switch fabrics of each compute node being also bridged through an external or embedded physical network through networking resource elements, a single common switch fabric being created from said switch fabrics adapted to couple said compute nodes each other for extending the physically converged substrates into a global physical converged substrate, wherein said pools of a plurality of operationally independent resource element types of each compute node are expressed together within a single plane of disaggregated logical resources.

6. The scalable server according to claim 5 characterized in that said switch fabric is composed of one or more fabric switches aggregated or distributed in one or more of said physical resource elements.

7. The scalable server according to claim 5 characterized in that said switch fabric is composed of one or more independent fabric switches connecting one or more of said physical resource elements.

8. The scalable server according to claim 5 characterized in that said networking resource elements comprise network resource elements for the compute nodes and an additional fabric switch that operates as a server mounted network resource element to further extend the switch fabrics of the compute nodes in a single common switch fabric.

9. The scalable server according to claim 8 characterized in that a plurality of compute nodes are connected to a server mounted network resource element which creates a bridge between the switch fabrics exposed by each network resource element for the compute node to create a single common switch fabric between all compute nodes, and further creating a bridge to an external physical network.

10. The scalable server according to claim 5 characterized in that said networking resource elements comprise network resource elements for the compute nodes connected directly to an external network, said fabric switches of the compute nodes being adapted to create a bridge between the switch fabrics exposed by each network resource element for the compute node to create a single common switch fabric between all compute nodes, and further to create a bridge to an external physical network.

11. A method of implementing a scalable server comprising one or more physically converged substrates, across each physical converged substrate being defined a plurality of physical resource elements, fabric switches for connecting the physical resource elements each other across the physical converged substrates using a processor native addressing scheme, wherein the method comprises:

physically disaggregating the physical resource elements;

expressing the disaggregated physical resource elements as pools of a plurality of operationally in-dependent logical resource element types within a single plane of disaggregated logical resources; and

abstracting a computing facility from said pools of logical resource element types by selecting instances of logical resource elements from said pools of logical resource element types.

12. The method of implementing a scalable server according to claim 11 characterized in that said physical resource elements defined across said physically converged substrate, comprise at least a processing element, a storage element, a memory element and a network resource element.

13. The method of implementing a scalable server according to claim 11 characterized in that each disaggregated physical resource element in a pool of disaggregated physical resource elements of a operationally independent logical resource element type in the disaggregated logical resource plane operates independently of any other disaggregated physical resource element so that a plurality of them can be encapsulated, one or more instances for each logical resource element type, by a disaggregated resource element manager to create said computing facility.

14. The method of implementing a scalable server according to claim 11 characterized in that said computing facility is created, dynamically or statically, by physicalization of resource elements through a common physical address space.

15. The method of implementing a scalable server according to claim 11 characterized in that said computing facility is created, dynamically or statically, by virtualization of resource elements over any form of abstracted communication.

16. The method of implementing a scalable server according to claim 11 characterized in that said computing facility is created, dynamically or statically, by a combination of a physicalization of resource elements through a common physical address space and a virtualization of resource elements over any form of abstracted communication.