Shared memory in high performance computing (HPC) messaging systems
A first process executing on a first HPC compute node of an HPC cluster sends a write command, including a global identifier for a memory window that is globally accessible to processes executing in the HPC cluster, to an HPC memory node of the HPC cluster instructing the HPC memory node to write first data to the memory window allocated at the HPC memory node. The first data is sent by the first process to a memory window allocated on the HPC memory node. The first data is written by the HPC memory node to the memory window that includes randomly accessed, addressable memory locations.
Latest Hewlett Packard Patents:
A high performance computing (HPC) cluster can be described in general terms as a collection of compute nodes that respectively include one or more local processors and local memory, and are interconnected by a dedicated high-bandwidth low-latency network. HPC clusters aggregate the computational power of multiple compute nodes to perform large-scale workloads. HPC clusters provide flexibility and scalability of HPC resources so that computing power can be well matched to current and evolving workload needs. HPC clusters can be flexibly configured to handle task parallelization, data distribution, parallel execution, cluster monitoring and control, and may combine the output of parallelized computations. Applications can execute on an HPC cluster in a local or distributed manner, such as on a single HPC compute node or on multiple HPC compute nodes.
However, large HPC clusters, including HPC clusters operating as supercomputers, typically are not single shared-memory systems, and thus, architectures based on shared memory parallel execution might not be particularly suited for execution on HPC clusters, which typically have distributed memory at each HPC compute node.
Aspects of the present disclosure are best understood from the following detailed description when read with the accompanying figures.
Corresponding numerals and symbols in the different figures generally refer to corresponding parts unless otherwise indicated. The figures are drawn to clearly illustrate the relevant aspects of the disclosure and are not necessarily drawn to scale.
The following disclosure provides many different examples for implementing different features. Specific examples of components and arrangements are described below to simplify the present disclosure. These are, of course, merely examples and are not intended to be limiting.
Throughout this disclosure, a hyphenated form of a reference numeral refers to a specific instance of an element and the un-hyphenated form of the reference numeral refers to the element generically or collectively. Thus, as an example (not shown in the drawings), device “12-1” refers to an instance of a device class, which may be referred to collectively as devices “12” and any one of which may be referred to generically as a device “12.” In the figures and the description, like numerals are intended to represent like elements.
In HPC, many use cases exist for a first process to stream a very large data set, while a second process takes some action on the results of the stream processing (see also
HPC relates to the field of aggregating computing power to provide significantly higher performance than an individual computer can provide. As just one example, HPC may be beneficial when workloads exceed the capacity of individual computer systems and users are faced with resource constraints for execution of such workloads, such as time, processor capacity, workload complexity, and management of computing infrastructure.
HPC includes systems that process data and execute calculations at a rate far exceeding other computers. The aggregate computing power of HPC serves different science, business, and engineering organizations to solve large problems that would otherwise be unapproachable or intractable on smaller computing systems. For example, HPC systems can perform quadrillions of calculations per second, while average computers can perform billions of calculations per second, which is about 10,000 times slower in some examples.
HPC has important applications for many types of workloads due to the speed and efficiency of computation provided. The application areas of HPC include scientific research, product design and development, data analysis, personalized medicine, drug discovery, molecular modeling, among other research and engineering domains, even expanding to industrial processes, disaster response, and atmospheric research.
One implementation of HPC is a supercomputer, which can be generalized as a large computer that is made up of many individual subsidiary computing elements, such as computer systems, processors, memory. Supercomputers concentrate the resources of multiple computer systems working in parallel to execute workloads involving massively complex, data-intensive processing.
A disadvantage of early supercomputers was the initial high cost and resource complexity involved with developing customized platforms for aggregating computing power. Such customized platforms were also not typically flexible or expandable to changing workload demands and a rapidly evolving technology sector. In more recent designs, a cluster architecture may be used for HPC. However, large HPC clusters operating as supercomputers typically are not single shared-memory systems, and thus, software architectures based on shared memory parallel execution might not be particularly suited for execution on some HPC clusters, which have distributed memory at each HPC compute node. Instead, software architectures based on message passing have become a preferred programming choice for HPC clusters and are used for parallel programming of HPC clusters having distributed memory at HPC compute nodes. Message passing refers to exchanging messages between applications, processes, data objects, subroutines, or other objects and may result in the starting or stopping of processes on various HPC nodes.
Executing workloads in parallel, either with or without concurrency for related computations, may be referred to as distributed computing or grid computing, in which individual tasks may be disparate in nature, leading to the use of interprocess communication for execution and workload management. Some aspects of programming parallel tasks in HPC clusters include balancing workloads among compute nodes, as well as forking and joining different computational threads or processes, to execute in a parallel manner with concurrency among HPC compute nodes, to gain performance.
Message passing is a form of interprocess communication that may be used for communication among applications and their processes executing in a parallel concurrent manner among the nodes of an HPC cluster. In a parallel and concurrent execution of a workload on an HPC cluster, each HPC node may perform tasks related to a portion of the overall workload, while message passing can be used to synchronize tasks at each HPC node, exchange data between HPC nodes, and coordinate and control the overall workload to proper completion.
Message passing can be thought of as a method for executing code using an object model for data and processes. Instead of the conventional execution method of calling the name of a program for execution, message passing involves sending a message to a receiver object and having the receiver object execute code based on the content of the message. In this manner message passing combines features of encapsulation and distribution that support parallel concurrent execution of workloads among HPC compute nodes in an HPC cluster.
Encapsulation is an attribute arising from object-oriented programming that allows software objects to invoke the execution of services at other objects without having to know internal details of the services, but rather, allows the executing object to determine the suitable service based on attributes of the message received. The executing object for the service can encapsulate the details and implementations of the service, and can include logic or decision-making ability related to execution of the service that the executing object can independently perform.
Distributed message passing allows services to be run on various HPC nodes of an HPC cluster without having to program the distribution of the services among the HPC nodes. Instead, the message passing system itself can handle identifying distributed processes, queueing distributed processes, returning distributed process results, and attending to distributed process transactional issues. Furthermore, message passing systems can support local objects or distributed objects, while the distributed objects can execute on disparate environments in different contexts.
Example message passing systems include local interprocess communication (LPC), remote procedure call (RPC), remote method invocation (RMI), common object request broker architecture (CORBA), component object model (COM), distributed component object model (DCOM), data distribution service (DDS), Microsoft® message queuing (MSMQ), Microsoft Windows Mailslots, QNX operating system, Microsoft® .NET, Emarald programming language, simple object access protocol (SOAP), Convergent Technologies operating system (CTOS), OpenBinder, and D-Bus, among others.
Certain message passing code libraries have been developed as toolkits for developers to implement message passing for parallel concurrent execution of workloads on HPC clusters, with a particular example being the message passing interface (MPI), which may be used for HPC clusters of any size, including very large HPC clusters having thousands of HPC nodes. In such large scale HPC clusters, message passing in the distributed memory environment using MPI may provide near linear performance scaling in some applications. However, the lack of a shared memory in a distributed memory message passing system, such as MPI, can be a disadvantage from which some applications would otherwise benefit.
Furthermore, a dedicated high-bandwidth low-latency network used to interconnect nodes in the HPC cluster, also referred to as “fabric,” may provide shared or common memory among HPC compute nodes. Specifically, fabric-attached memory (FAM) may provide an interface to HPC memory nodes, providing very high speed storage that can be used as byte-addressable memory to augment local memory of processors at HPC compute nodes. FAM may allow applications running on HPC clusters to process huge data sets that exceed the capabilities of local memory at any HPC compute node. FAM can utilize persistent storage at HPC memory nodes to provide shared random access memory (RAM) at speeds suitable for large workloads on HPC clusters, and may not fit into existing software paradigms for memory access. That is, existing application programing interfaces (APIs) might provide inadequate support for FAM access by applications. Some APIs assume that memory is either persistent storage or local RAM for computations by processors. Some APIs allow applications to access FAM, such as the OpenFAM API specification for programming FAM.
As will be described in further detail, a method and solution for shared high speed memory in HPC messaging systems is disclosed. Certain implementations can be used to access shared memory at an HPC compute node using a global identifier that is accessible to processes executing on the HPC cluster. Certain implementations support simplified parallel concurrent execution of two or more applications that access a shared high speed memory that is byte-addressable and can be randomly accessed. Certain implementations support using FAM-based memory resources. Certain implementations support multi-tiered distributed applications that can pipeline large volumes of data in a computationally tractable manner. Certain implementations provide a high speed low latency memory that is accessible over a high speed memory used for networking HPC nodes in an HPC cluster. Certain implementations provide a byte-addressable shared memory space on HPC memory nodes that is not limited by the memory constraints of HPC compute nodes in an HPC cluster. Certain implementations allow applications and processes to access a global shared memory on HPC memory nodes using a remote memory access protocol that is a one-sided protocol. Certain implementations support simplified application access to a shared memory with minimal or no adaptation of application code.
Referring now to the drawings,
As shown,
Although shown uniformly in architecture 100, each instance of application 106 and process 110 can represent different executable code that can be loaded into memory of a computer system for execution. In some implementations, multiple applications 106 can execute the same executable code in parallel, or can execute related executable code in parallel, such as portions of a single distributed application 106. Similarly, multiple processes 110 can execute the same executable code in parallel, or can execute related executable code in parallel. In certain implementations, at least some portions of application 106 and process 110 can represent functionality implemented by logic circuits, such as in an integrated circuit (IC), a field-programmable gate array (FPGA), or other type of electronic circuit.
In
As shown in
In operation of an example implementation, one or multiple processes 110 may join or be included in a communication group that has access to memory window 103. The processes 110 in the communication group can be associated with application 106-1 or with multiple or different applications 106. In some implementations, the processes 110 in the communication group can be executed on different compute nodes executing in an HPC cluster (see, e.g., compute nodes 402 executing in HPC cluster 400 in
In
In
In data processing system 200 as shown in
In data processing system 200 shown in
Thus, as shown in data processing system 200 of
As described above, data processing system 200 is an example of using a memory window (e.g., memory window 103 of
Referring now to
In FAM API protocol 300 shown in
At the beginning (or top) of FAM API protocol 300, certain conditions may be assumed, namely that process 301-21 and 301-22 are already part of a communication group that can access memory window 303, but that process 301-11 is not yet part of the communication group. Further, it can be assumed that processes 301-11, 301-21, 301-22 can pass messages to each other outside of FAM API protocol 300, for example. In particular implementations of FAM API protocol 300, process 301-11 may be executed by a first application at a first HPC compute node, while processes 301-21, 301-22 may be executed by a second application at a second HPC compute node. Furthermore, it can be assumed that memory window 303 has not yet been created for reading or writing data at the start of FAM API protocol 300 (e.g., prior to sending a message 304). In some implementations, certain messages described below with respect to FAM API protocol 300 can be a function call, a response to a function call, or can include a parameter associated with a function call, such as a FAM function call.
In
In FAM API protocol 300 shown in
In particular implementations of FAM API protocol 300, the message passing system used is MPI, while the FAM API protocol 300 supports the OpenFAM standard protocol. OpenFAM is an API facilitating access to FAM, modelled closely around interfaces provided by one-sided partitioned global address space (PGAS) libraries such as OpenSHMEM, with additional interfaces for managing FAM data beyond the lifetime of a single program. The standard implementation of OpenFAM can provide access to FAM through memory nodes (e.g., a memory server node), accessible to compute nodes via remote memory access (RMA) using a Libfabric library. Libfabric, also known as Open Fabrics Interfaces (OFI), defines a communication API for high-performance parallel and distributed applications. Libfabric is a low-level communication library that abstracts diverse networking technologies. Libfabric is developed by the OFI Working Group (OFIWG), which is a subgroup of the OpenFabrics Alliance—OFA. OpenFam and Libfabric can be used with various fabric interconnection standards or technologies, such as Slingshot or Infiniband. OpenFAM enforces security on memory window 303 based on a user identifier (UID) and a group identifier (GID) of processes, such as processes 301.
Furthermore, certain concepts and constructs from MPI may be incorporated into FAM API protocol 300. A communication group of processes 301 that support MPI can collectively create memory window 303, which can be an MPI window that is remotely accessible memory, to which all processes 301 that are members of the communication group can read from/write to without explicitly synchronizing with a process 301. In this regard, FAM API protocol 300 is an example of a one-sided protocol.
Applications 106 or processes 301 that communicate with each other can form or join an MPI communication group, whose members have access to globally shared memory window 303 having a pre-defined window name, also referred to as a window handle or a handle. In some implementations, a modified MPI interface called MPI_Win_allocate_named(<window name>, . . . ) (see step 305) is used that accepts the name of memory window 303 as an input parameter. MPI_Win_allocate_named( ) can instantiate memory window 303 as a named global shared window (e.g., global asymmetric memory), residing on FAM 102 and accessible uniformly to all processes 301 within the MPI communication group. Further, MPI_Win_allocate_named( ) can return a corresponding MPI_Win( ) handle to the allocated memory.
In particular implementations, a call to MPI_Win_allocate_named( ) can result in a call to fam_allocate( ) that is part of the OpenFAM API. The MPI_Win( ) handle can be modified to contain the FAM descriptor (e.g., the window name) returned by fam_allocate( ) that indicates whether or not the corresponding MPI Window is a global asymmetric memory (i.e., residing on FAM 102) or a conventional MPI window, to differentiate between conventional MPI Windows created by existing MPI interfaces such as MPI_Win_allocate( ). An additional customized interface MPI_Win_lookup(<window name>) may be used to locate an existing global shared memory window 303 for communication. MPI_Win_lookup( ) can call fam_lookup( ) that is part of the OpenFAM API to locate memory window 303 in FAM 102 and return the corresponding handle MPI_Win( ). Furthermore, modified MPI interfaces such as MPI_Get and MPI_Put can be used to detect if the target window is a global asymmetric memory (e.g., memory window 303) using the FAM descriptor in the MPI_Win( ) handle. If the target window is memory window 303, the calls to MPI_Get and MPI_Put can be rerouted, such as by FAM manager 302, to corresponding interfaces like fam_get and fam_put that are part of the OpenFAM API.
FAM manager 302 may ignore an existing target_rank parameter in MPI_Get and MPI_Put when global asymmetric memory window 303 is configured for use, because processes 301 are predefined within the communication group having access to memory window 303 by using non-standard customized MPI_Win_allocate_named or MPI_Win_lookup. Thereby, the MPI_Win handle returned by the new interfaces can be seamlessly used by conventional MPI interfaces such MPI_Get and MPI_Put to share data across processes 301 and applications 106.
In a particular implementation, a writer process can call MPI_Win_allocate_named( ) with a specific window name. As a result, the specific window name is sent to an HPC memory node that hosts FAM, which then allocates memory in FAM to instantiate and allocate physical memory for the memory window, and returns MPI_WIN to the writer process. Then, MPI_WIN may store access information for accessing the allocated window in FAM. Next, to write data to the memory window, the writer process can call MPI_Put( ). Since MPI_WIN has the access information, the data is written to the memory window. Furthermore, a reader process (different than the writer process and in a different MPI group) may first call MPI_Win_lookup( ) with the window name to get the access information about the memory window. This results in returning MPI_WIN to the reader process with the access information for the memory window in FAM. Then the reader process can call MPI_Get( ) to read data from the memory window. Since the memory window includes information for FAM, respective FAM APIs are called in both the writer process case and the reader process case.
In various implementations, the performance of FAM for read and, in particular, for write operations can represent a substantial improvement in comparison to read and write operations using file I/O. For example, considering write block sizes in gigabytes (GB), such as 4, 8, 16, 32, and 64 GB, it has been experimentally observed that write operation timings can be substantially faster using FAM than using file I/O, while the disparity to file I/O can become greater as the write block size increases. Accordingly, FAM can provide suitable performance for use as remote memory for applications and processes executing workloads in HPC clusters, as described herein.
HPC cluster 400 can be described in general terms as a collection of computing nodes 402 that respectively include a local processor and local memory, and are interconnected by a dedicated high-bandwidth low-latency network, shown as high-speed local network 422 in
As shown in
As shown in
As shown in
In
Also in
In
In compute node 500, I/O subsystem 540 may include a system, device, or apparatus generally operable to receive and transmit data to or from or internally within compute node 500. In different implementations, I/O subsystem 540 may be used to support various peripheral devices, such as a touch panel, a display adapter, a keyboard, a touch pad, or a camera, among other examples. I/O subsystem 540 may represent, for example, a variety of communication interfaces, graphics interfaces, video interfaces, user input interfaces, and peripheral interfaces. For example, I/O subsystem 540 may support various output or display devices, such as a screen, a monitor, a general display device, a liquid crystal display (LCD), a plasma display, a touchscreen, a projector, a printer, an external storage device, or another output device. In some instances, I/O subsystem 540 can support multimodal systems that allow a user to provide multiple types of I/O to communicate with compute node 500.
In
Further, in
As shown in
At least certain portions of compute node 500 may be implemented in circuitry. For example, the components of compute node 500 can include electronic circuits or other electronic hardware, which can include a programmable electronic circuit, a microprocessor, a graphics processing unit (GPU), a digital signal processor (DSP), a central processing unit (CPU), along with other suitable electronic circuits. Certain functionality incorporated into compute node 500 may be provided using executable code that is accessible to an electronic circuit, as described above, including computer software, firmware, program code, or various combinations thereof, to perform the methods and operations described herein. When specified, non-transitory media expressly exclude transitory media such as energy, carrier signals, light beams, and electromagnetic waves.
Method 600 begins at step 602 by sending, from a first process executing on a first HPC compute node of an HPC cluster, a third command to an HPC memory node of the HPC cluster, the third command instructing the HPC memory node to allocate a memory window that is globally accessible to processes executing in the HPC cluster. In certain implementations, the HPC memory node can be accessible using a FAM API, such as FAM API 108 in
At step 604, a write command including a global identifier for the memory window is sent from the first process to the HPC memory node, the command instructing the HPC memory node to write first data to the memory window. In certain implementations, the write command can be compliant with the message passing system executing on the HPC cluster, the message passing system supporting distributed memory access. The message passing system may be an MPI system.
At step 606, the first data is sent from the first process to the memory window, including causing the first data to be written to the memory window, where the memory window includes randomly accessed, addressable memory locations. In particular implementations, the HPC memory node receives the first data and writes the first data to the memory window.
At step 608, a read command is sent from the first process to the HPC memory node instructing the HPC memory node to send second data from the memory window to the first process, the second message including the global identifier. In certain implementations, at least some of the first data and at least some of the second data can be transmitted concurrently via a high speed network exclusive to the HPC cluster. In certain implementations, the read command can be compliant with the message passing system executing on the HPC cluster. At step 610, the second data is received by the first process from the HPC memory node. The HPC memory node can retrieve (or read) the second data from the memory window.
At step 612, the global identifier is received by the first process from the HPC memory node, where the global identifier is included in a global memory list globally accessible to processes executing in the HPC cluster. In certain implementations, the first process and a second process executing on a second HPC compute node can be included in a communication group configured in the message passing system. In certain implementations, access to the memory window can be limited to processes executing on HPC compute nodes included in the communication group, including the first process and the second process.
As described herein, a first process executing on a first HPC compute node of an HPC cluster sends a write command, including a global identifier for a memory window that is globally accessible to processes executing in the HPC cluster, to an HPC memory node of the HPC cluster instructing the HPC memory node to write first data to the memory window allocated at the HPC memory node. The first data is sent by the first process to a memory window allocated on the HPC memory node. The first data is written by the HPC memory node to the memory window that includes randomly accessed, addressable memory locations.
Various implementations of this disclosure are summarized here. Other implementations can also be understood from the entirety of the specification as well as the claims filed herein.
In one aspect, a first method includes sending, from a first process executing on a first HPC compute node of an HPC cluster, a write command including a global identifier for a memory window that is globally accessible to processes executing in the HPC cluster, to the memory window at an HPC memory node of the HPC cluster, the write command instructing the HPC memory node to write first data to the memory window. The first method can include sending, from the first process, the first data to the memory window, including causing the first data to be written to the memory window, the memory window including randomly accessed, addressable memory locations.
In any of the disclosed implementations, the first method can include sending, from the first process, a read command to the HPC memory node instructing the HPC memory node to send second data from the memory window to the first process, the second message including the global identifier, and responsive to sending the read command, receiving the second data by the first process from the HPC memory node.
In any of the disclosed implementations of the first method, the first process and a second process executing on a second HPC compute node can be included in a communication group configured in a message passing system.
In any of the disclosed implementations of the first method, access to the memory window can be limited to processes executing on HPC compute nodes included in the communication group, including the first process and the second process.
In any of the disclosed implementations of the first method, the HPC memory node, the first HPC compute node, and the second HPC compute node can communicate via a high speed network exclusive to the HPC cluster, and the HPC memory node can be a memory server accessible via the high speed network.
In any of the disclosed implementations of the first method, at least some of the first data and at least some of the second data can be transmitted concurrently via the high speed network.
In any of the disclosed implementations, the first method can include sending, from the first process, a third command to the HPC memory node prior to sending the write command, the third command instructing the HPC memory node to allocate the memory window, and in response to sending the third command, receiving, by the first process from the HPC memory node, the global identifier. In the first method, the global identifier can be included in a global memory list globally accessible to processes executing in the HPC cluster.
In any of the disclosed implementations of the first method, the message passing system can support distributed memory access. In any of the disclosed implementations of the first method, the message passing system can be a message passing interface (MPI) system.
In any of the disclosed implementations of the first method, the HPC memory node can be accessible using a FAM API.
In another aspect, a first HPC compute node, includes one or more processors, one or more non-transitory computer-readable storage media storing programming for execution by the one or more processors. In the first HPC compute node, the programming can include instructions to execute a first process, send, from the first process, a read command, including a global identifier for a memory window that is globally accessible to processes executing in an HPC cluster that includes the first HPC compute node, to an HPC memory node of the HPC cluster, the read command instructing the HPC memory node to send first data stored in the memory window configured on the HPC memory node to the first process. In the first HPC compute node, the programming can also include instructions to receive, by the first process in response to sending the first message, the first data stored in the memory window, the memory window including randomly accessed, addressable memory locations.
In any of the disclosed implementations of the first HPC compute node, the first HPC compute node and the HPC memory node can communicate via a high speed network exclusive to the HPC cluster.
In any of the disclosed implementations of the first HPC compute node, the HPC memory node can be a memory server accessible via the high speed network.
In any of the disclosed implementations of the first HPC compute node, prior to sending the write command, the programming can also include instructions to receive, by the first process, a read command from a second process executing on a second HPC compute node different from the first HPC compute node, the second message instructing the first process to read the first data.
In any of the disclosed implementations of the first HPC compute node, prior to sending the write command, the programming can further include instructions to send, from the first process, a third command to the HPC memory node requesting the global identifier, and, responsive to sending the third command, instructions to receive, by the first process, the global identifier from the HPC memory node.
In yet another aspect, an HPC cluster includes a first HPC compute node executing an application including executing a first process. In the HPC cluster, the first process can include instructions to send, from the first process, a write command, including a global identifier for a memory window globally accessible to processes executing in the HPC cluster, to an HPC memory node, the write command instructing the HPC memory node to write first data to the memory window, and instructions to send, from the first process, first data for the memory window to the HPC memory node; and a second HPC compute node executing a second application including executing a second process. In the HPC cluster, the second process can include instructions to: send, from the second process, a read command including the global identifier for the memory window, to the HPC memory node instructing the HPC memory node to send second data stored in the memory window, and responsive to sending the read command, instructions to receive, by the second process, the second data from the HPC memory node, the memory window including randomly accessed, addressable memory locations.
In any of the disclosed implementations of the HPC cluster, the first process and the second process can be simultaneously executed.
In any of the disclosed implementations of the HPC cluster, the second data can be generated using the first data.
In any of the disclosed implementations of the HPC cluster, the first HPC compute node, the second HPC compute node, and the HPC memory node can communicate via a high-speed network exclusive to the HPC cluster.
In any of the disclosed implementations of the HPC cluster, the first data and the second data can be simultaneously transmitted over the high-speed network.
The foregoing outlines features of several examples so that those skilled in the art may better understand the aspects of the present disclosure. Various modifications and combinations of the illustrative examples, as well as other examples, will be apparent to persons skilled in the art upon reference to the description. It is therefore intended that the appended claims encompass any such modifications.
Claims
1. A method, comprising:
- sending, from a first process executing on a first high performance computing (HPC) compute node of an HPC cluster, a write command including a global identifier for a memory window that is globally accessible to processes executing in the HPC cluster, to the memory window at an HPC memory node of the HPC cluster, the write command instructing the HPC memory node to write first data to the memory window; and
- sending, from the first process, the first data to the memory window, including causing the first data to be written to the memory window, the memory window including randomly accessed, addressable memory locations;
- sending, from the first process, a third command to the HPC memory node prior to sending the write command, the third command instructing the HPC memory node to allocate the memory window; and
- in response to sending the third command, receiving, by the first process from the HPC memory node, the global identifier, wherein the global identifier is included in a global memory list globally accessible to processes executing in the HPC cluster.
2. The method of claim 1, further comprising:
- sending, from the first process, a read command to the HPC memory node instructing the HPC memory node to send second data from the memory window to the first process, the read command including the global identifier; and
- responsive to sending the read command, receiving the second data by the first process from the HPC memory node.
3. The method of claim 2, wherein the first process and a second process executing on a second HPC compute node are included in a communication group configured in a message passing system.
4. The method of claim 3, wherein access to the memory window is limited to processes executing on HPC compute nodes included in the communication group, including the first process and the second process.
5. The method of claim 3, wherein the HPC memory node, the first HPC compute node, and the second HPC compute node communicate via a high speed network exclusive to the HPC cluster, and wherein the HPC memory node is a memory server accessible via the high speed network.
6. The method of claim 5, wherein at least some of the first data and at least some of the second data are transmitted concurrently via the high speed network.
7. The method of claim 3, wherein the message passing system supports distributed memory access.
8. The method of claim 7, wherein the message passing system is a message passing interface (MPI) system.
9. The method of claim 1, wherein the HPC memory node is accessible using a fabric-attached memory (FAM) application programming interface (API).
10. A first high performance computing (HPC) compute node, comprising:
- one or more processors;
- one or more non-transitory computer-readable storage media storing programming for execution by the one or more processors, the programming comprising instructions to: execute a first process; send, from the first process, a read command, including a global identifier for a memory window that is globally accessible to processes executing in an HPC cluster that includes the first HPC compute node, to an HPC memory node of the HPC cluster, the read command instructing the HPC memory node to send first data stored in the memory window configured on the HPC memory node to the first process; prior to sending the read command, receive, by the first process, a read message from a second process executing on a second HPC compute node different from the first HPC compute node, the read message instructing the first process to read the first data; and receive, by the first process in response to sending the read command, the first data stored in the memory window, the memory window including randomly accessed, addressable memory locations.
11. The first HPC compute node of claim 10, wherein the first HPC compute node and the HPC memory node communicate via a high speed network exclusive to the HPC cluster.
12. The first HPC compute node of claim 11, wherein the HPC memory node is a memory server accessible via the high speed network.
13. The first HPC compute node of claim 10, wherein the programming further comprises instructions to:
- prior to sending the read command, send, from the first process, a second command to the HPC memory node requesting the global identifier; and
- responsive to sending the second command, receive, by the first process, the global identifier from the HPC memory node.
14. A high performance computing (HPC) cluster, comprising:
- a first HPC compute node executing an application including executing a first process, the first process comprising instructions to: send, from the first process, a write command, including a global identifier for a memory window globally accessible to processes executing in the HPC cluster, to an HPC memory node, the write command instructing the HPC memory node to write first data to the memory window; and send, from the first process, first data for the memory window to the HPC memory node; and
- a second HPC compute node executing a second application including executing a second process, the second process comprising instructions to: send, from the second process, a read command including the global identifier for the memory window, to the HPC memory node instructing the HPC memory node to send second data stored in the memory window, wherein the second data are generated using the first data; and responsive to sending the read command, receive, by the second process, the second data from the HPC memory node, the memory window including randomly accessed, addressable memory locations.
15. The HPC cluster of claim 14, wherein the first process and the second process are simultaneously executed.
16. The HPC cluster of claim 14, wherein the first HPC compute node, the second HPC compute node, and the HPC memory node communicate via a high-speed network exclusive to the HPC cluster.
17. The HPC cluster of claim 16, wherein the first data and the second data are simultaneously transmitted over the high-speed network.
| 20090106771 | April 23, 2009 | Benner |
| 20190173981 | June 6, 2019 | Chalmers |
| 20220222196 | July 14, 2022 | Cui |
- “MPI_Win_allocate”, available online at <https://www.mpich.org/static/docs/v3.3/www3/MPI_Win_allocate.html>, retrieved from internat on Feb. 28, 2024, 2 pages.
- “OpenSHMEM Specification”, available online at <https://web.archive.org/web/20210303211945/http://www.openshmem.org/site/Specification/>, Mar. 3, 2021, 1 page.
- “The Message Passing Interface (MPI) standard”, available online at <qmcs.anl.gov/research/projects/mpi/>, Retrieved on Jun. 1, 2023, 2 pages.
- “Window Creation”, available online at <https://web.archive.org/web/20201112002109/https://www.mpi-forum.org/docs/mpi-2.2/mpi22-report/node231.htm>, Nov. 12, 2020, 3 pages.
- De Sensi et al., An In-Depth Analysis of the Slingshot Interconnect, Aug. 20, 2020 (13 pages).
- Github, “Open MPI”, available online at <https://github.com/open-mpi/ompi/blob/f0101dcf19ee9f10ea76ab423e0b1928cce70bad/ompi/win/win.h>, Jul. 26, 2022, 4 pages.
- Github, “OpenFAM: A library for programming Fabric-Attached Memory”, available online at <https://openfam.github.io/index.html>, 2021, 4 pages.
- OpenFabrics, “Libfabric Programmer's Manual”, available online at <https://ofiwg.github.io/libfabric/v1.9.1/man/fi_trigger.3.html>, 2023, 4 pages.
- Pfister, Gregory F., Chapter 42, an Introduction to the InfiniBand™ Architecture downloaded Jun. 10, 2023 (16 pages).
- Wikipedia, “Lustre Object Storage Service (OSS)”, available online at <https://wiki.lustre.org/Lustre_Object_Storage_Service_>, Nov. 8, 2017, 5 pages.
- Wikipedia, “PageRank”, available online at <https://en.wikipedia.org/w/index.php?title=PageRank&oldid=799442332>, Sep. 7, 2017, 18 pages.
- “Mpi.h.” Open MPI, Oct. 6, 2022. Accessed: Oct. 7, 2022. [Online]. Available: https://github.com/openmpi/ompi/blob/f0101dcf19ee9f10ea76ab423e0b1928cce70bad/ompi/win/win.h, software source code listing.
- “MPI_Win_allocate,” https://www.mpich.org/static/docs/v3.3/www3/MPI_Win_allocate.html , accessed Nov. 10, 2022, 2 pages.
- AWS, “High Performance Computing,” https://aws.amazon.com/hpc/, accessed Feb. 11, 2024, 12 pages.
- Gillis et al., “message passing interface (MPI),” TechTarget, https://www.techtarget.com/searchenterprisedesktop/definition/message-passing-interface-MPI, Jul. 2022, 7 pages.
- Hewlett Packard Enterprise, “High performance computing,” https://www.hpe.com/us/en/what-is/high-performance-computing.html, accessed Feb. 11, 2024, 10 pages.
- Hewlett Packard Enterprise Development Co, LLP, “OpenFAM Reference Implementation,” https://baramya.github.io/OpenFAM.github.io/fam_allocate.html, accessed Nov. 10, 2022, 2 pages.
- Hewlett Packard Enterprise Development Co, LLP, “OpenFAM Reference Implementation,” https://openfam.github.io/index.html, accessed Aug. 30, 2021, 2 pages.
- Hopkins, “OpenFAM: When fabric-attached memory opens up,” https://community.hpe.com/t5/advancing-life-work/openfam-when-fabric-attached-memory-opens-up/ba-p/7126923, Mar. 23, 2021, 4 pages.
- IBM. “What is high-performance computing (HPC)?,” https://www.ibm.com/topics/hpc, accessed Feb. 14, 2024, 8 pages.
- Libfabric, “Libfabric OpenFabrics,” Libfabric Programmer's Manual, https://ofiwg.github.io/libfabric/, accessed Jun. 22, 2021, 3 pages.
- Lustre Wiki, “Lustre Object Storage Service (OSS),” https://wiki.lustre.org/Lustre_Object_Storage_Service_(OSS), accessed Oct. 14, 2022, 5 pages.
- Microsoft, “Understanding Node Metrics and Properties in HPC Cluster Manager,” https://learn.microsoft.com/en-us/powershell/high-performance-computing/understanding-node-metrics-and-properties-in-hpc-cluster-manager?view=hpc19-ps, Jun. 8, 2020, 16 pages.
- MPI Forum, https://www.mpi-forum.org/, accessed Feb. 11, 2024, 3 pages.
- MPI Forum, “Message Passing Interface (MPI) standard,” https://www.mcs.anl.gov/research/projects/mpi/, accessed Oct. 7, 2022, 2 pages.
- MPI Forum, “Window Creation,” https://www.mpi-forum.org/docs/mpi-2.2/mpi22-report/node231.htm, accessed Oct. 7, 2022, 3 pages.
- Netapp, “What is High Performance Computing?,” https://www.netapp.com/data-storage/high-performance-computing/what-is-hpc/, accessed Feb. 11, 2024, 7 pages.
- Nvidia, “High-Performance Computing,” https://www.nvidia.com/en-us/glossary/high-performance-computing/, accessed Feb. 11, 2024, 11 pages.
- Open, “OpenSHMEM Specification,” OpenSHMEM.org., http://openshmem.org/site/Specification, accessed Jun. 22, 2021, 1 page.
- OpenFam, “OpenFAM: Programming Fabric-Attached Memory,” https://openfam.org/, accessed Mar. 3, 2022, 4 pages.
- Pfister, G.F. 2001. “An introduction to the infiniband architecture.” Excerpt: High performance mass storage and parallel I/O, 42(617-632), Wiley-IEEE Press, ISBN: 9780470544839 / 9780471208099, 16 pages.
- Quora, “What are the benefits of using OpenMP over MPI for distributed computing? Which one is preferred by industry experts and why?,” https://www.quora.com/What-are-the-benefits-of-using-OpenMP-over-MPI-for-distributed-computing-Which-one-is-preferred-by-industry-experts-and-why/log, Apr. 12, 2023, 1 page.
- Quora, “What are some popular alternatives to MPI?,” https://www.quora.com/What-are-some-popular-alternatives-to-MPI/log, Apr. 12, 2023, 1 page.
- Rico, “Accelerating HPC with Advanced Programming Techniques (1/2),” Arm Community, https://community.arm.com/arm-research/b/articles/posts/accelerating-hpc-with-advanced-programming-techniques-1-2, Oct. 8, 2020, 23 pages.
- Sensi et al., “An In-Depth Analysis of the Slingshot Interconnect,” in SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, Nov. 2020, pp. 1-14. doi: 10.1109/SC41405.2020.00039, 14 pages.
- Wikipedia, “High-performance computing,” https://en.wikipedia.org/wiki/High-performance_computing, accessed Feb. 11, 2024, 3 pages.
- Wikipedia, “PageRank,” https://en.wikipedia.org/w/index.php?title=PageRank&oldid=1114540750, Accessed: Oct. 7, 2022, 16 pages.
- Wikipedia, “Parallel computing,” https://en.wikipedia.org/wiki/Parallel_computing, accessed Feb. 11, 2024, 22 pages.
- Wikipedia, “Distributed computing,” https://en.wikipedia.org/wiki/Distributed_computing, accessed Feb. 11, 2024, 15 pages.
- Wikipedia, “Inter-process communication,” https://en.wikipedia.org/wiki/Inter-process_communication, accessed Feb. 11, 2024, 5 pages.
Type: Grant
Filed: Feb 14, 2024
Date of Patent: Oct 14, 2025
Patent Publication Number: 20250258767
Assignee: Hewlett Packard Enterprise Development LP (Spring, TX)
Inventors: Soumitra Chatterjee (Karnataka), Chinmay Ghosh (Karnataka), Mashood Abdulla Kodavanji (Karnataka), Sharad Singhal (Belmont, CA)
Primary Examiner: Than Nguyen
Application Number: 18/441,831
International Classification: G06F 12/02 (20060101);