LOW LATENCY REMOTING TO ACCELERATORS
A method of offloading performance of a workload includes receiving, on a first computing system acting as an initiator, a first function call from a caller, the first function call to be executed by an accelerator on a second computing system acting as a target, the first computing system coupled to the second computing system by a network; determining a type of the first function call; and generating a list of parameter values of the first function call.
This application claims the benefit of priority from and is a continuation of U.S. patent application Ser. No. 17/304,820 filed on Jun. 25, 2021, the full disclosure of which is incorporated herein by reference.
COPYRIGHT NOTICE/PERMISSIONPortions of the disclosure of this patent document may contain material that is subject to copyright protection. The copyright owner has no objection to the reproduction by anyone of the patent document or the patent disclosure as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever. The copyright notice applies to all data as described below, and in the accompanying drawings hereto, as well as to any software described below: Copyright © 2021, Intel Corporation. All Rights Reserved.
BACKGROUNDIn some cloud and high-volume data analytics computing environments, compute intensive workloads are often offloaded from processors to accelerators to achieve higher performance. In one scenario, at least part of the workload is offloaded to an accelerator in the same computing system as a processor executing other parts of the workload. In another scenario, at least part of the workload is offloaded to an accelerator in another computing system (sometimes called a disaggregated accelerator) coupled via a network to the computing system including the processor executing the other parts of the workload. In this scenario, the latency involved in offloading the workload across the network may negatively affect overall system performance.
The concepts described herein are illustrated by way of example and not by way of limitation in the accompanying figures. For simplicity and clarity of illustration, elements illustrated in the figures are not necessarily drawn to scale. Where considered appropriate, reference labels have been repeated among the figures to indicate corresponding or analogous elements.
The technology described herein reduces the effect of network latency associated with running a workload on a remote accelerator or other computing device by remoting function calls (e.g., calls received via application programming interfaces (APIs)) to improve overall system performance. The technology enables asynchronous execution of remoted functions to overlap network transmission of messages between computing systems. This helps offloading workloads to disaggregated accelerators become more efficient in computing systems.
While the concepts of the present disclosure are susceptible to various modifications and alternative forms, specific embodiments thereof have been shown by way of example in the drawings and will be described herein in detail. It should be understood, however, that there is no intent to limit the concepts of the present disclosure to the particular forms disclosed, but on the contrary, the intention is to cover all modifications, equivalents, and alternatives consistent with the present disclosure and the appended claims.
References in the specification to “one embodiment,” “an embodiment,” “an illustrative embodiment,” etc., indicate that the embodiment described may include a particular feature, structure, or characteristic, but every embodiment may or may not necessarily include that particular feature, structure, or characteristic. Moreover, such phrases are not necessarily referring to the same embodiment. Further, when a particular feature, structure, or characteristic is described in connection with an embodiment, it is submitted that it is within the knowledge of one skilled in the art to implement such feature, structure, or characteristic in connection with other embodiments whether or not explicitly described.
Referring now to
Second computing system 142 includes processor 160 to execute instructions (Instr) 153 stored in memory 152. Instructions 153 comprise at least callee application 144 and remoting manager 146. Callee application 144 comprises an application program having at least one workload to be processed. In some processing scenarios, callee application 144 accepts the offload of one or more workloads to accelerator 170 in second computing system 142 to be performed more efficiently than performing the one or more workloads on processor 108. In at least one embodiment, accelerator 170 is implemented as a field programmable gate array (FPGA). This offloading requires that caller 104 being executed by processor 108 in first computing system 102 communicate (via remoting manager 106 and remoting manager 146) with callee 144 being executed by processor 160 in second computing system 142 over network 120. Since the communication between processor 108 and accelerator 170 is from first computing system 102 (the initiator) to second computing system 142 (the target) over network 120, this communication is performed with a second latency. This second latency is greater than the first latency, resulting in degradation of overall system performance of computing environment 100. The technology described herein employs remoting manager 106 being executed by processor 108 in first computing system 102 interacting with remoting manager 146 being executed by processor 160 in second computing system 142 to improve overall system performance of computing environment 100 by lessening the second latency.
In various computing environments, there may be any number of processors 108 and accelerators 120 on first computing system 102, any number of processors 160 and accelerators 170 on second computing system 142, and any number of first computing systems coupled to any number of second computing systems. In some large-scale cloud computing environments, the number of caller applications 104, callee applications 144, first computing systems 102, second computing systems 104, and associated accelerators 120, 170 may be large (e.g., tens of systems, hundreds of systems, thousands of systems, tens of thousands of systems, thousands or millions of callers and callees, etc.). Thus, any lessening of the second latency can have a significant impact on the overall performance of computing environment 100.
First computing system 102 and second computing system 142 can be embodied as any type of device capable of performing the functions described herein. For example, computing systems 102, 142 can be implemented as, without limitation, a computer, a laptop computer, a tablet computer, a notebook computer, a mobile computing device, a smartphone, a wearable computing device, a multiprocessor system, a server, a disaggregated server, a workstation, and/or a consumer electronic device. As shown in
Processors 108, 160 can be implemented as any type of processor capable of performing the functions described herein. For example, processors 108, 160 can be embodied as a single or multi-core processor(s), digital signal processor, microcontroller, or other processor or processing/controlling circuit.
Memories 112, 142 can be implemented as any type of volatile or non-volatile memory or data storage capable of performing the functions described herein. In operation, the memories 112, 152 store various data and software used during operation of computing systems 102, 142 such as operating systems, applications, programs, libraries, and drivers. As shown, memories 112, 152 are communicatively coupled to the processors 108, 160 via I/O subsystems 110, 150 which are implemented as circuitry and/or components to facilitate input/output operations with processors 108, 120, memories 112, 152, and other components of the computing systems, respectively. For example, the I/O subsystems 110, 150 can be implemented as, or otherwise include, memory controller hubs, input/output control hubs, sensor hubs, host controllers, firmware devices, communication links (i.e., point-to-point links, bus links, wires, cables, light guides, printed circuit board traces, etc.) and/or other components and subsystems to facilitate the input/output operations. In some embodiments, memories 112, 152 are directly coupled to processors 108, 160, respectively, for example via an integrated memory controller hub. Additionally, in some embodiments, I/O subsystems 110, 150 form a portion of a system-on-a-chip (SoC) and are incorporated, along with processors 108, 160, memories 112, 152 accelerators 120, 170, respectively, and/or other components of the computing systems, on a single integrated circuit chip. Additionally or alternatively, in some embodiments processors 108, 160 include an integrated memory controller and a system agent, which may be embodied as a logic block in which data traffic from processor cores and I/O devices converges before being sent to memories 112, 152.
Data storage devices 114, 154 can be implemented as any type of device or devices configured for short-term or long-term storage of data such as, for example, memory devices and circuits, memory cards, hard disk drives, solid-state drives, non-volatile flash memory, or other data storage devices. Computing systems 102, 142 can also include communications subsystems 116, 156 which can be implemented as any communication circuit, device, or collection thereof, capable of enabling communications between computing systems 102, 142 over a network 120. Communications subsystems 116, 156 can be configured to use any one or more communication technology (e.g., wired or wireless communications) and associated protocols (e.g., Ethernet, Bluetooth®, WiMAX, 3G, 4G LTE, etc.) to implement such communication.
Accelerators 120, 170 can be implemented as a FPGA, an application-specific integrated circuit (ASIC), a coprocessor, or other digital logic device capable of performing accelerated functions (e.g., accelerated application functions, accelerated network functions, or other accelerated functions). Illustratively, accelerators 120, 170 are FPGAs, which are implemented as an integrated circuit including programmable digital logic resources that may be configured after manufacture. The FPGAs include, for example, a configurable array of logic blocks in communication over a configurable data interchange. Accelerators 120, 170 are coupled to the processors 108, 160 via high-speed connection interfaces such as peripheral buses (e.g., peripheral component interconnect (PCI) Express buses) or inter-processor interconnects (e.g., in-die interconnects (IDIs) or QuickPath Interconnects (QPIs)), or via any other appropriate interconnects. Accelerators 120, 170 receive data and/or commands for processing from the processors and return results data to the processors.
Computing systems 102, 142 further include one or more peripheral devices 118, 158. Peripheral devices 118, 158 include any number of additional input/output devices, interface devices, hardware accelerators, and/or other peripheral devices. For example, in some embodiments, peripheral devices 118, 158 include a touch screen, graphics circuitry, a graphical processing unit (GPU) and/or processor graphics, an audio device, a microphone, a camera, a keyboard, a mouse, a network interface, and/or other input/output devices, interface devices, and/or peripheral devices.
It is to be appreciated that lesser or more equipped computing systems than the examples described above may be preferred for certain implementations. Therefore, the configuration of computing systems 102, 142 can vary from implementation to implementation depending upon numerous factors, such as price constraints, performance requirements, technological improvements, or other circumstances. Examples of computing systems 102, 142 include (without limitation) a mobile device, a personal digital assistant, a mobile computing device, a smartphone, a cellular telephone, a handset, a one-way pager, a two-way pager, a messaging device, a computer, a personal computer (PC), a desktop computer, a laptop computer, a notebook computer, a handheld computer, a tablet computer, a server, a disaggregated server, a server array or server farm, a web server, a network server, an Internet server, a work station, a mini-computer, a main frame computer, a supercomputer, a network appliance, a web appliance, a distributed computing system, multiprocessor systems, processor-based systems, consumer electronics, programmable consumer electronics, television, digital television, set top box, wireless access point, base station, subscriber station, mobile subscriber center, radio network controller, router, hub, gateway, bridge, switch, machine, or combinations thereof.
The technology described herein can be implemented as any or a combination of one or more microchips or integrated circuits interconnected using a parent board, hardwired logic, software stored by a memory device and executed by a microprocessor, firmware, an application specific integrated circuit (ASIC), and/or an FPGA. The term “logic” includes, by way of example, software or hardware and/or combinations of software and hardware.
Remoting is the technique of executing a task by sending commands and data to a compute device or accelerator over a network. For example, an application running on one machine (such as caller application 104 on first computing system 102) might want to accelerate a task (for example, a machine learning (ML) inference) by running the task on an accelerator (such as accelerator 170) on a remote computing system (such as second computing system 142). The application or software libraries supporting the application use an API to communicate with a software device driver interfacing with the accelerator. If the accelerator was attached locally in the computing system running the application (e.g., first computing system 102), the communication between the application/library and the software device driver for the accelerator would occur in the form of local procedure calls. If, however, the accelerator was connected to a remote computing system (e.g., second computing system 142), the API function calls would have to be “remoted” to the software device driver running on the computing system connected to the accelerator over a network 120. In the technology described herein, this remoting is implemented by remoting managers on the two sides (e.g., remoting manager 106, remoting manager 146). First computing system 102 can be called an initiator (where the caller application 104 is located) and second computing system 142 can be called a target (where the callee application 144 and accelerator 170 are located). If the remoting is implemented naively, every API call that involved blocking (described below) would be made serially over the network 120 and incur the substantial overhead of network latency (for example, approximately 100 microseconds for a roundtrip on current datacenter networks).
Delay due to network latency is shown in the shaded blocks in
To reduce the effects of network latency, the technology described herein proposes overlapping code execution on the initiator and the target side with network transfers. The technology described herein makes the API function calls asynchronous, whenever possible, so that the initiator (e.g., caller application 104 and associated remoting manager 106) does not have to block and wait on every call. This requires exploitation of certain characteristics of the function being remoted. Analysis of functions typically used for remoting use cases reveals that not all function calls need to be synchronous in their operation. Some function calls do not return values that are consumed by the initiator before forward progress can be made.
Three types of function calls are identified: Type 0=asynchronously callable function without output dependencies, Type 1=asynchronously callable function with substitutable output parameters, and Type 2=synchronous function. A Type 0 function does not return a value that is used by the caller 104. While the function must be executed correctly, without errors, the caller does not expect the function to return a value that is used in any future computation. An example of such a function is a function that initializes a library. Assuming the function executes correctly on the target, it is possible to return to the caller 104 immediately, without blocking. A Type 1 function returns a value to the caller that it (the caller) may pass back as input to another function executed by the callee in the future. The value returned is not consumed by the caller in any other way. As an example, consider the creation of a command list to hold commands that will be submitted to the accelerator 170. When the list is created on the target by the invocation of, for example, a CreateCommandList( ) function by the caller, the callee returns an opaque handle that references the list. In the future, the caller would pass the handle of the command list back to the callee when it sends commands that must be appended to that command list. In this example, the CreateCommandList( ) function is a Type 1 function. A Type 1 function call can also return to the caller immediately with a pseudo-output value(s), but the remoting manager 146 and the target side (e.g., callee application 144) must keep track of the pseudo-return value(s) to identify the pseudo-return value later and substitute it with the real value (as described below). A Type 2 function returns a value to the caller that it (the caller) uses in its computation or is one that results in the transfer of some data from caller 104 to callee 144. For example, a function that submits a batch of commands to the accelerator 170 is a Type 2 function because the caller 104 might need the result of the computation or require the freeing of a resource (e.g., memory buffer) used in an earlier function call before it can proceed with its execution. Therefore, a Type 2 function call must always block the caller 104.
When the remoting manager 106 on the first computing system 102 receives a function call 202 from caller 104, the remoting manager determines the type of the function (Types 0-2). In one implementation of this technology, every function in the API can be assigned one of the three types a priori by analyzing their inputs, outputs and execution semantics. The name of the function and its arguments are entered into the linked list data structure 502. There are three cases to consider, based on the type of the function.
Case 1 (Type 0 function): The remoting manager 106 immediately returns to the caller 104 with a “SUCCESS” status (note: the execution of the function is yet to take place on the target side (e.g., by accelerator 170), but there is no need to block the caller 104). The remoting manager 106 on the caller side relays the function call 202 to the callee 144 side.
Case 2 (Type 1 function): The remoting manager 106 notes that the function has one or more output parameters. In the case of F1 404, for example, there is one output parameter, A1′. In the case of a normal blocking function call, F1 404 would have executed on the target side (by accelerator 170) and returned the value for A1′. Recall that this value is necessary to the caller 104 only to the extent that it (the caller) might need to pass this value back to the callee 144 in the future. The exact value returned to the caller does not matter, as long as the same value is used consistently by the caller, and the callee knows how to substitute the value with the real value. Therefore, the remoting manager 106 on the caller side 104 returns a pseudo-output value to the caller immediately. Remoting manager 106 also records this value in symbol table 504 entry under the Pseudo Output Parameter Symbol (POPS) field 506. In this example, the pseudo-output value is #1 for A1′. Function F2 406 is handled similarly with the creation of another pseudo-output parameter value, #2, for A2″. Note that the linked list nodes for the output arguments point to the corresponding entries in symbol table 502. In addition to the name of the function and its argument values, the remoting manager 106 also sends part of the linked list data structure and symbol table entries corresponding to the Type 1 function to the callee 144. For example, in the case of F1, the remoting manager 106 sends the linked list for F1 (and its arguments), as well as the first entry 510 in the symbol table 502 corresponding to parameter A1′. In the case of F2 406, remoting manager 106 sends the linked list of function call parameters for F2 406 and the second entry 512 in the symbol table 502.
Case 3 (Type 2): The remoting manager 106 blocks the caller 104 in this case. Furthermore, if any of the input parameter values match pseudo-output parameter values from earlier functions, the remoting manager adds pointers from the linked list to the corresponding entries in symbol table 502. For example, in the case of F3 408, the values for the two input parameters, A3′ and A3″ (from the caller 104) match the pseudo-output parameter symbol values #1 and #2, respectively (from earlier function calls F1 404 and F2 406). The remoting manager adds pointers from the linked list to the corresponding symbol table entries. As in the case of the earlier Type 1 calls, the remoting manager 106 sends the list of function arguments and the corresponding symbol table entries to the callee 144.
This example shows how the technology described herein can efficiently perform remoting of function calls 202 across the network 120 by exploiting the semantic characteristics (Types 0-2) of the APIs to overlap network transfers with code execution on the initiator (caller 104 on first computer system 102) and target side (callee 144 on second computing system 142).
An example of a process for implementing remoting manager 106 (initiator side) in pseudo-code in the first computing system 102 is shown below in Table 1.
Lines 1 and 2 of Table 1 initialize two lists: (1) args, which will contain information about the function arguments; and (2) symbols, which will contain the pseudo-output parameter symbols (POPS) representing output parameters in Type 1 functions. Each element in the args list is a structure (struct) containing three fields: type (e.g., int, float, etc.); val (value); and symidx (index into Symbol table 504 (SYMTAB) if the input value for the argument matches a pseudo-output parameter value created earlier).
The for-loop (lines 3-18) builds up the args and symbols list. Each parameter of the function is considered in one pass of the loop. If the parameter is an INPUT parameter (Lines 7-11), the symbol table 504 (SYMTAB) is looked up to find its value. If an entry is found with the parameter's value, it means that the value of the argument from the caller is a pseudo-output parameter symbol corresponding to an OUTPUT parameter of another function executed earlier. If the parameter is an OUTPUT parameter of a Type 1 function (Lines 12-17), a new pseudo-output parameter value is created and added to SYMTAB. The symidx field of the argument struct is set to index of the new symbol. This index will be used by the remoting manager 146 on the second computing system 142 to bind a real output parameter value (ROPV) with the POPS symbol after the function has finished execution. In Line 17, the pseudo-output value is copied to the memory location of the output parameter in preparation for the return to the caller 104.
The function, its arguments and any new symbols added to SYMTAB are packaged as a message for the remoting manager in the second computing system (Line 20). Lines 22-33 handle the transmission of the message. Type 0 and 1 function call request messages are sent asynchronously (async_send), without blocking the caller. Type 2 function calls block the caller (sync_send) and only return when the reply message is received from the target.
When the remoting manager 106 on the initiator side (e.g., caller 104 of first computing system 102) receives a response to a function call, it can either be success or an error. Since some functions (Type 0 and 1) are executed asynchronously, it is possible to get an error for an earlier function that already returned to the caller 104 (this is not possible for Type 2 functions, which are synchronous). The (unlikely) error from an earlier asynchronous function call can be propagated as an exception to the caller. Errors from Type 2 function calls are handled normally. Since Type 2 functions can have output parameters, it is necessary to make the value of the output parameters (in memory) consistent between the target and initiator side before returning to the caller.
In response to receiving a message from the initiator (e.g., remoting manager 106) with a function call request, the target (e.g., remoting manager 146) performs the example process as shown below in Table 2 after unpacking the message containing the function, parameter types, values and symbol table entries.
Lines 1-2 add new symbols to symbol table 504 (SYMTAB) on the target side (remoting manager 146 on second computing system 142). The for loop in lines 4-8 handle input parameters before the function is executed. Since some of the input parameter values could be pseudo parameter values from earlier function invocations, each INPUT parameter value must be looked up in SYMTAB (line 7) using the symidx index into the table. The real output parameter value (ROPV) corresponding to the symbol is substituted for the pseudo value in line 8. Line 10 executes the function (using accelerator 170) with its correct argument values. The for loop in lines 12-14 handles output parameter values for Type 1 functions. In line 14, the symbol corresponding to the output parameter is bound to the actual output value (from executing the function in line 10). Finally, in lines 16-17, a message with the return arguments is prepared and send asynchronously to the initiator (e.g., remoting manager 106).
Computing device 900 may additionally include one or more of the following: cache 962, a graphical processing unit (GPU) 912 (which may be hardware accelerator 120/170 in some implementations), a wireless input/output (I/O) interface 920, a wired I/O interface 930, memory circuitry 940, power management circuitry 950, non-transitory storage device 960, and a network interface 970 for connection to a network 120. The following discussion provides a brief, general description of the components forming the illustrative computing device 900. Example, non-limiting computing devices 900 may include a desktop computing device, blade server device, workstation, laptop computer, mobile phone, tablet computer, personal digital assistant, or similar device or system.
In embodiments, the processor cores 918 are capable of executing machine-readable instruction sets 914, reading data and/or instruction sets 914 from one or more storage devices 960 and writing data to the one or more storage devices 960. Those skilled in the relevant art will appreciate that the illustrated embodiments as well as other embodiments may be practiced with other processor-based device configurations, including portable electronic or handheld electronic devices, for instance smartphones, portable computers, wearable computers, consumer electronics, personal computers (“PCs”), network PCs, minicomputers, server blades, mainframe computers, FPAGs, Internet of Things (IOT) devices, and the like. For example, machine-readable instruction sets 914 may include instructions to implement remoting processing, as provided in
The processor cores 918 may include any number of hardwired or configurable circuits, some or all of which may include programmable and/or configurable combinations of electronic components, semiconductor devices, and/or logic elements that are disposed partially or wholly in a PC, server, mobile phone, tablet computer, or other computing system capable of executing processor-readable instructions.
The computing device 900 includes a bus or similar communications link 916 that communicably couples and facilitates the exchange of information and/or data between various system components including the processor cores 918, the cache 962, the graphics processor circuitry 912, one or more wireless I/O interfaces 920, one or more wired I/O interfaces 930, one or more storage devices 960, and/or one or more network interfaces 970. The computing device 900 may be referred to in the singular herein, but this is not intended to limit the embodiments to a single computing device 900, since in certain embodiments, there may be more than one computing device 900 that incorporates, includes, or contains any number of communicably coupled, collocated, or remote networked circuits or devices.
The processor cores 918 may include any number, type, or combination of currently available or future developed devices capable of executing machine-readable instruction sets.
The processor cores 918 may include (or be coupled to) but are not limited to any current or future developed single-core or multi-core processor or microprocessor, such as: on or more systems on a chip (SOCs); central processing units (CPUs); digital signal processors (DSPs); graphics processing units (GPUs); application-specific integrated circuits (ASICs), programmable logic units, field programmable gate arrays (FPGAs), and the like. Unless described otherwise, the construction and operation of the various blocks shown in
The system memory 940 may include read-only memory (“ROM”) 942 and random-access memory (“RAM”) 946. A portion of the ROM 942 may be used to store or otherwise retain a basic input/output system (“BIOS”) 944. The BIOS 944 provides basic functionality to the computing device 900, for example by causing the processor cores 918 to load and/or execute one or more machine-readable instruction sets 914. In embodiments, at least some of the one or more machine-readable instruction sets 914 causes at least a portion of the processor cores 918 to provide, create, produce, transition, and/or function as a dedicated, specific, and particular machine, for example a word processing machine, a digital image acquisition machine, a media playing machine, a gaming system, a communications device, a smartphone, a neural network, a machine learning model, or similar devices.
The computing device 900 may include at least one wireless input/output (I/O) interface 920. The at least one wireless I/O interface 920 may be communicably coupled to one or more physical output devices 922 (tactile devices, video displays, audio output devices, hardcopy output devices, etc.). The at least one wireless I/O interface 920 may communicably couple to one or more physical input devices 924 (pointing devices, touchscreens, keyboards, tactile devices, etc.). The at least one wireless I/O interface 920 may include any currently available or future developed wireless I/O interface. Example wireless I/O interfaces include, but are not limited to: BLUETOOTH®, near field communication (NFC), and similar.
The computing device 900 may include one or more wired input/output (I/O) interfaces 930. The at least one wired I/O interface 930 may be communicably coupled to one or more physical output devices 922 (tactile devices, video displays, audio output devices, hardcopy output devices, etc.). The at least one wired I/O interface 930 may be communicably coupled to one or more physical input devices 924 (pointing devices, touchscreens, keyboards, tactile devices, etc.). The wired I/O interface 930 may include any currently available or future developed I/O interface. Example wired I/O interfaces include but are not limited to universal serial bus (USB), IEEE 1394 (“FireWire”), and similar.
The computing device 900 may include one or more communicably coupled, non-transitory, data storage devices 960. The data storage devices 960 may include one or more hard disk drives (HDDs) and/or one or more solid-state storage devices (SSDs). The one or more data storage devices 960 may include any current or future developed storage appliances, network storage devices, and/or systems. Non-limiting examples of such data storage devices 960 may include, but are not limited to, any current or future developed non-transitory machine-readable storage mediums, storage appliances or devices, such as one or more magnetic storage devices, one or more optical storage devices, one or more electro-resistive storage devices, one or more molecular storage devices, one or more quantum storage devices, or various combinations thereof. In some implementations, the one or more data storage devices 960 may include one or more removable storage devices, such as one or more flash drives, flash memories, flash storage units, or similar appliances or devices capable of communicable coupling to and decoupling from the computing device 900.
The one or more data storage devices 960 may include interfaces or controllers (not shown) communicatively coupling the respective storage device or system to the bus 916. The one or more data storage devices 960 may store, retain, or otherwise contain machine-readable instruction sets, data structures, program modules, data stores, databases, logical structures, and/or other data useful to the processor cores 918 and/or graphics processor circuitry 912 and/or one or more applications executed on or by the processor cores 918 and/or graphics processor circuitry 912. In some instances, one or more data storage devices 960 may be communicably coupled to the processor cores 918, for example via the bus 916 or via one or more wired communications interfaces 930 (e.g., Universal Serial Bus or USB); one or more wireless communications interfaces 920 (e.g., Bluetooth®, Near Field Communication or NFC); and/or one or more network interfaces 970 (IEEE 802.3 or Ethernet, IEEE 802.11, or Wi-Fi®, etc.).
Processor-readable instruction sets 914 and other programs, applications, logic sets, and/or modules may be stored in whole or in part in the system memory 940. Such instruction sets 914 may be transferred, in whole or in part, from the one or more data storage devices 960. The instruction sets 914 may be loaded, stored, or otherwise retained in system memory 940, in whole or in part, during execution by the processor cores 918 and/or graphics processor circuitry 912.
The computing device 900 may include power management circuitry 950 that controls one or more operational aspects of the energy storage device 952. In embodiments, the energy storage device 952 may include one or more primary (i.e., non-rechargeable) or secondary (i.e., rechargeable) batteries or similar energy storage devices. In embodiments, the energy storage device 952 may include one or more supercapacitors or ultracapacitors. In embodiments, the power management circuitry 950 may alter, adjust, or control the flow of energy from an external power source 954 to the energy storage device 952 and/or to the computing device 900. The power source 954 may include, but is not limited to, a solar power system, a commercial electric grid, a portable generator, an external energy storage device, or any combination thereof.
For convenience, the processor cores 918, the graphics processor circuitry 912, the wireless I/O interface 920, the wired I/O interface 930, the storage device 960, and the network interface 970 are illustrated as communicatively coupled to each other via the bus 916, thereby providing connectivity between the above-described components. In alternative embodiments, the above-described components may be communicatively coupled in a different manner than illustrated in
Flowcharts representative of example hardware logic, non-tangible machine-readable instructions, hardware implemented state machines, and/or any combination thereof for implementing computing device 900, for example, are shown in
The machine-readable instructions described herein may be stored in one or more of a compressed format, an encrypted format, a fragmented format, a compiled format, an executable format, a packaged format, etc. Machine readable instructions as described herein may be stored as data (e.g., portions of instructions, code, representations of code, etc.) that may be utilized to create, manufacture, and/or produce machine executable instructions. For example, the machine-readable instructions may be fragmented and stored on one or more storage devices and/or computing devices (e.g., servers). The machine-readable instructions may require one or more of installation, modification, adaptation, updating, combining, supplementing, configuring, decryption, decompression, unpacking, distribution, reassignment, compilation, etc. in order to make them directly readable, interpretable, and/or executable by a computing device and/or other machine. For example, the machine-readable instructions may be stored in multiple parts, which are individually compressed, encrypted, and stored on separate computing devices, wherein the parts when decrypted, decompressed, and combined form a set of executable instructions that implement a program such as that described herein.
In another example, the machine-readable instructions may be stored in a state in which they may be read by a computer system, but require addition of a library (e.g., a dynamic link library (DLL)), a software development kit (SDK), an application programming interface (API), etc., in order to execute the instructions on a particular computing device or other device. In another example, the machine-readable instructions may be configured (e.g., settings stored, data input, network addresses recorded, etc.) before the machine-readable instructions and/or the corresponding program(s) can be executed in whole or in part. Thus, the disclosed machine-readable instructions and/or corresponding program(s) are intended to encompass such machine-readable instructions and/or program(s) regardless of the particular format or state of the machine readable instructions and/or program(s) when stored or otherwise at rest or in transit.
The machine-readable instructions described herein can be represented by any past, present, or future instruction language, scripting language, programming language, etc. For example, the machine-readable instructions may be represented using any of the following languages: C, C++, Java, C#, Perl, Python, JavaScript, HyperText Markup Language (HTML), Structured Query Language (SQL), Swift, etc.
As mentioned above, the example processes of
“Including” and “comprising” (and all forms and tenses thereof) are used herein to be open ended terms. Thus, whenever a claim employs any form of “include” or “comprise” (e.g., comprises, includes, comprising, including, having, etc.) as a preamble or within a claim recitation of any kind, it is to be understood that additional elements, terms, etc. may be present without falling outside the scope of the corresponding claim or recitation. As used herein, when the phrase “at least” is used as the transition term in, for example, a preamble of a claim, it is open-ended in the same manner as the term “comprising” and “including” are open ended.
The term “and/or” when used, for example, in a form such as A, B, and/or C refers to any combination or subset of A, B, C such as (1) A alone, (2) B alone, (3) C alone, (4) A with B, (5) A with C, (6) B with C, and (7) A with B and with C. As used herein in the context of describing structures, components, items, objects and/or things, the phrase “at least one of A and B” is intended to refer to implementations including any of (1) at least one A, (2) at least one B, and (3) at least one A and at least one B. Similarly, as used herein in the context of describing structures, components, items, objects and/or things, the phrase “at least one of A or B” is intended to refer to implementations including any of (1) at least one A, (2) at least one B, and (3) at least one A and at least one B. As used herein in the context of describing the performance or execution of processes, instructions, actions, activities and/or steps, the phrase “at least one of A and B” is intended to refer to implementations including any of (1) at least one A, (2) at least one B, and (3) at least one A and at least one B. Similarly, as used herein in the context of describing the performance or execution of processes, instructions, actions, activities and/or steps, the phrase “at least one of A or B” is intended to refer to implementations including any of (1) at least one A, (2) at least one B, and (3) at least one A and at least one B.
As used herein, singular references (e.g., “a”, “an”, “first”, “second”, etc.) do not exclude a plurality. The term “a” or “an” entity, as used herein, refers to one or more of that entity. The terms “a” (or “an”), “one or more”, and “at least one” can be used interchangeably herein. Furthermore, although individually listed, a plurality of means, elements or method actions may be implemented by, e.g., a single unit or processor. Additionally, although individual features may be included in different examples or claims, these may possibly be combined, and the inclusion in different examples or claims does not imply that a combination of features is not feasible and/or advantageous.
Descriptors “first,” “second,” “third,” etc. are used herein when identifying multiple elements or components which may be referred to separately. Unless otherwise specified or understood based on their context of use, such descriptors are not intended to impute any meaning of priority, physical order or arrangement in a list, or ordering in time but are merely used as labels for referring to multiple elements or components separately for ease of understanding the disclosed examples. In some examples, the descriptor “first” may be used to refer to an element in the detailed description, while the same element may be referred to in a claim with a different descriptor such as “second” or “third.” In such instances, it should be understood that such descriptors are used merely for ease of referencing multiple elements or components.
The following examples pertain to further embodiments. Example 1 is an apparatus including a processor; and a memory device coupled to the processor, the memory device having instructions stored thereon that, in response to execution by the processor, cause the processor to: receive, on a first computing system acting as an initiator, a first function call from a caller, the first function call to be executed by an accelerator on a second computing system acting as a target, the first computing system coupled to the second computing system by a network; determine a type of the first function call; generate a list of parameter values of the first function call; send a first message including a name of the first function call, the list of parameter values of the first function call, and one or more new entries for a symbol table, the one or more new entries representing pseudo-output parameter values, to the second computing system; and when the type of the first function call is an asynchronously callable function without output dependencies, return to the caller; when the type of the first function call is an asynchronously callable function with substitutable output parameters, assign newly created symbols to output parameters and return to the caller; and when the type of the first function call is a synchronous function, block the caller until a response to the first message is received from the second computing system.
In Example 2, the subject matter of Example 1 can optionally include instructions that in response to execution by the processor, cause the processor to unblock the caller when the response to the first message is received from the second computing system when the type of the first function call is synchronous function.
In Example 3, the subject matter of Example 1 can optionally include wherein the first function call is a request to offload performance of a workload from the first computing system to the accelerator on the second computing system.
In Example 4, the subject matter of Example 1 can optionally include instructions that in response to execution by the processor, cause the processor to: receive, on the first computing system acting as a target, a second message including a name of a second function call, the list of parameter values of the second function call, and one or more new entries for a symbol table, the one or more new entries representing pseudo-output parameter values, from the second computing system as an initiator; add the one or more new entries from the second message into the symbol table; for each input parameter value in the list of parameter values of the second function call, if there is a corresponding symbol table index, substitute a symbol table entry associated with the corresponding symbol table index for a pseudo-output parameter value; execute the function by an accelerator on the first computing system using the input parameter values; when the type of the second function call is the asynchronously callable function with substitutable output parameters, for each output parameter in the list of parameter values, map a pseudo-output parameter value from the list of parameter values of the second function call to a corresponding output value in the symbol table; and send a second message including the name of the second function call and a list of output parameter values to the second computing system.
In Example 5, the subject matter of Example 4 can optionally include wherein the second function call is a request to offload performance of a workload from the second computing system to the accelerator on the first computing system.
Example 6 is a method including receiving, on a first computing system acting as an initiator, a first function call from a caller, the first function call to be executed by an accelerator on a second computing system acting as a target, the first computing system coupled to the second computing system by a network; determining a type of the first function call; generating a list of parameter values of the first function call; sending a first message including a name of the first function call, the list of parameter values of the first function call, and one or more new entries for a symbol table, the one or more new entries representing pseudo-output parameter values, to the second computing system; and when the type of the first function call is an asynchronously callable function without output dependencies, returning to the caller; when the type of the first function call is an asynchronously callable function with substitutable output parameters, assigning newly created symbols to output parameters and returning to the caller; and when the type of the first function call is a synchronous function, blocking the caller until a response to the first message is received from the second computing system.
In Example 7, the subject matter of Example 6 can optionally include unblocking the caller when the response to the first message is received from the second computing system when the type of the first function call is synchronous function.
In Example 8, the subject matter of Example 6 can optionally include wherein the first function call is a request to offload performance of a workload from the first computing system to the accelerator on the second computing system.
In Example 9, the subject matter of Example 6 can optionally include receiving, on the first computing system acting as a target, a second message including a name of a second function call, the list of parameter values of the second function call, and one or more new entries for a symbol table, the one or more new entries representing pseudo-output parameters values, from the second computing system as an initiator; adding the one or more new entries from the second message into the symbol table; for each input parameter value in the list of parameter values of the second function call, if there is a corresponding symbol table index, substituting the symbol table entry associated with the corresponding symbol table index for a pseudo-output parameter value; executing the function by an accelerator on the first computing system using the input parameter values; when the type of the second function call is the asynchronously callable function with substitutable output parameters, for each output parameter in the list of parameter values, mapping a pseudo-output parameter value from the list of parameter values of the second function call to a corresponding output value in the symbol table; and sending a second message including the name of the second function call and a list of output parameter values to the second computing system.
In Example 10, the subject matter of Example 9 can optionally include wherein the second function call is a request to offload performance of a workload from the second computing system to the accelerator on the first computing system.
Example 11 is at least one non-transitory machine-readable storage medium comprising instructions that, when executed, cause at least one processor to at least: receive, on a first computing system acting as an initiator, a first function call from a caller, the first function call to be executed by an accelerator on a second computing system acting as a target, the first computing system coupled to the second computing system by a network; determine a type of the first function call; generate a list of parameter values of the first function call; send a first message including a name of the first function call, the list of parameter values of the first function call, and one or more new entries for a symbol table, the one or more new entries representing pseudo-output parameters values, to the second computing system; and when the type of the first function call is an asynchronously callable function without output dependencies, return to the caller; when the type of the first function call is an asynchronously callable function with substitutable output parameters, assign newly created symbols to output parameters and return to the caller; and when the type of the first function call is a synchronous function, block the caller until a response to the first message is received from the second computing system.
In Example 12, the subject matter of Example 11 can optionally include instructions, when executed further cause the at least one processor to unblock the caller when the response to the first message is received from the second computing system when the type of the first function call is synchronous function.
In Example 13, the subject matter of Example 11 can optionally include wherein the first function call is a request to offload performance of a workload from the first computing system to the accelerator on the second computing system.
In Example 14, the subject matter of Example 11 can optionally include instructions that, when executed, cause at least one processor to receive, on the first computing system acting as a target, a second message including a name of a second function call, the list of parameter values of the second function call, and one or more new entries for a symbol table, the one or more new entries representing pseudo-output parameters values, from the second computing system as an initiator; add the one or more new entries from the second message into the symbol table; for each input parameter value in the list of parameter values of the second function call, if there is a corresponding symbol table index, substitute a symbol table entry associated with the corresponding symbol table index for a pseudo-output parameter value; execute the function by an accelerator on the first computing system using the input parameter values; when the type of the second function call is the asynchronously callable function with substitutable output parameters, for each output parameter in the list of parameter values, map a pseudo-output parameter value from the list of parameter values of the second function call to a corresponding output value in the symbol table; and send a second message including the name of the second function call and a list of output parameter values to the second computing system.
In Example 15, the subject matter of Example 14 can optionally include wherein the second function call is a request to offload performance of a workload from the second computing system to the accelerator on the first computing system.
Example 16 is a system including a first computing system acting as an initiator; and a second computing system acting as a target, the second computing coupled to the first computing system by a network, the second computing system including an accelerator; wherein the first computing system is to receive a function call from a caller; determine a type of the function call; generate a list of parameter values of the function call; send a first message including a name of the function call, the list of parameter values of the function call, and one or more new entries for a symbol table, the one or more new entries representing pseudo-output parameters values, to the second computing system; and when the type of the function call is an asynchronously callable function without output dependencies, return to the caller; when the type of the function call is an asynchronously callable function with substitutable output parameters, assign newly created symbols to output parameters and return to the caller; and when the type of the function call is a synchronous function, block the caller until a response to the first message is received from the second computing system; and wherein the second computing system is to receive the first message; add the one or more new entries from the first message into the symbol table; for each input parameter value in the list of parameter values of the function call, if there is a corresponding symbol table index, substitute a symbol table entry associated with the corresponding symbol table index for a pseudo-output parameter value; execute the function by the accelerator using the input parameter values; when the type of the function call is the asynchronously callable function with substitutable output parameters, for each output parameter in the list of parameter values, map a pseudo-output parameter value from the list of parameter values of the function call to a corresponding output value in the symbol table; and send a second message including the name of the function call and a list of output parameter values to the first computing system.
In Example 17, the subject matter of Example 16 can optionally include wherein the first computing system is to unblock the caller when the response to the first message is received from the second computing system when the type of the function call is synchronous function.
In Example 18, the subject matter of Example 16 can optionally include wherein the function call is a request to offload performance of a workload from the first computing system to the accelerator on the second computing system.
Claims
1.-18. (canceled)
19. An apparatus comprising:
- a processor; and
- a memory device coupled to the processor, the memory device having instructions stored thereon that, in response to execution by the processor, cause the processor to:
- receive, on a first computing system acting as an initiator, a first function call from a caller, the first function call to be executed by a graphical processing unit (GPU) on a second computing system acting as a target, the first computing system coupled to the second computing system by a network;
- determine a type of the first function call;
- generate a list of parameter values of the first function call;
- send a first message including a name of the first function call, the list of parameter values of the first function call, and one or more new entries for a symbol table, the one or more new entries representing pseudo-output parameter values, to the second computing system; and
- when the type of the first function call is an asynchronously callable function without output dependencies, return control to the caller; when the type of the first function call is an asynchronously callable function with substitutable output parameters, assign newly created symbols to output parameters and return control to the caller; and when the type of the first function call is a synchronous function, block the caller until a response to the first message is received from the second computing system.
20. The apparatus of claim 19, the memory device having instructions stored thereon that, in response to execution by the processor, cause the processor to:
- unblock the caller when the response to the first message is received from the second computing system when the type of the first function call is synchronous function.
21. The apparatus of claim 19, wherein the first function call is a request to offload performance of a workload from the first computing system to the GPU on the second computing system.
22. The apparatus of claim 19, the memory device having instructions stored thereon that, in response to execution by the processor, cause the processor to:
- receive, on the first computing system, a second message including a name of a second function call, a list of parameter values of the second function call, and one or more new entries for the symbol table, the one or more new entries representing pseudo-output parameter values, from the second computing system;
- add the one or more new entries from the second message into the symbol table;
- for each input parameter value in the list of parameter values of the second function call, if there is a corresponding symbol table index, substitute a symbol table entry associated with the corresponding symbol table index for a pseudo-output parameter value;
- execute the function by a GPU on the first computing system using the input parameter values;
- when the type of the second function call is the asynchronously callable function with substitutable output parameters, for each output parameter in the list of parameter values of the second function call, map a pseudo-output parameter value from the list of parameter values of the second function call to a corresponding output value in the symbol table; and
- send a third message including the name of the second function call and a list of output parameter values to the second computing system.
23. The apparatus of claim 22, wherein the second function call is a request to offload performance of a workload from the second computing system to the GPU on the first computing system.
24. A method comprising:
- receiving, on a first computing system acting as an initiator, a first function call from a caller, the first function call to be executed by a graphical processing unit (GPU) on a second computing system acting as a target, the first computing system coupled to the second computing system by a network;
- determining a type of the first function call;
- generating a list of parameter values of the first function call;
- sending a first message including a name of the first function call, the list of parameter values of the first function call, and one or more new entries for a symbol table, the one or more new entries representing pseudo-output parameter values, to the second computing system; and
- when the type of the first function call is an asynchronously callable function without output dependencies, returning control to the caller; when the type of the first function call is an asynchronously callable function with substitutable output parameters, assigning newly created symbols to output parameters and returning control to the caller; and when the type of the first function call is a synchronous function, blocking the caller until a response to the first message is received from the second computing system.
25. The method of claim 24, comprising unblocking the caller when the response to the first message is received from the second computing system when the type of the first function call is synchronous function.
26. The method of claim 24, wherein the first function call is a request to offload performance of a workload from the first computing system to the GPU on the second computing system.
27. The method of claim 24, comprising:
- receiving, on the first computing system, a second message including a name of a second function call, a list of parameter values of the second function call, and one or more new entries for the symbol table, the one or more new entries representing pseudo-output parameters values, from the second computing system;
- adding the one or more new entries from the second message into the symbol table;
- for each input parameter value in the list of parameter values of the second function call, if there is a corresponding symbol table index, substituting the symbol table entry associated with the corresponding symbol table index for a pseudo-output parameter value;
- executing the function by a GPU on the first computing system using the input parameter values;
- when the type of the second function call is the asynchronously callable function with substitutable output parameters, for each output parameter in the list of parameter values of the second function call, mapping a pseudo-output parameter value from the list of parameter values of the second function call to a corresponding output value in the symbol table; and
- sending a third message including the name of the second function call and a list of output parameter values to the second computing system.
28. The method of claim 27, wherein the second function call is a request to offload performance of a workload from the second computing system to the GPU on the first computing system.
29. At least one non-transitory machine-readable storage medium comprising instructions that, when executed, cause at least one processor to:
- receive, on a first computing system acting as an initiator, a first function call from a caller, the first function call to be executed by a graphical processing unit (GPU) on a second computing system acting as a target, the first computing system coupled to the second computing system by a network;
- determine a type of the first function call;
- generate a list of parameter values of the first function call;
- send a first message including a name of the first function call, the list of parameter values of the first function call, and one or more new entries for a symbol table, the one or more new entries representing pseudo-output parameters values, to the second computing system; and
- when the type of the first function call is an asynchronously callable function without output dependencies, return control to the caller; when the type of the first function call is an asynchronously callable function with substitutable output parameters, assign newly created symbols to output parameters and return control to the caller; and when the type of the first function call is a synchronous function, block the caller until a response to the first message is received from the second computing system.
30. The at least one non-transitory machine-readable storage medium of claim 29, wherein the instructions, when executed further cause the at least one processor to unblock the caller when the response to the first message is received from the second computing system when the type of the first function call is synchronous function.
31. The at least one non-transitory machine-readable storage medium of claim 29, wherein the first function call is a request to offload performance of a workload from the first computing system to the GPU on the second computing system.
32. The at least one non-transitory machine-readable storage medium of claim 29 comprising instructions that, when executed, cause at least one processor to:
- receive, on the first computing system, a second message including a name of a second function call, a list of parameter values of the second function call, and one or more new entries for symbol table, the one or more new entries representing pseudo-output parameters values, from the second computing system;
- add the one or more new entries from the second message into the symbol table;
- for each input parameter value in the list of parameter values of the second function call, if there is a corresponding symbol table index, substitute a symbol table entry associated with the corresponding symbol table index for a pseudo-output parameter value;
- execute the function by a GPU on the first computing system using the input parameter values;
- when the type of the second function call is the asynchronously callable function with substitutable output parameters, for each output parameter in the list of parameter values of the second function call, map a pseudo-output parameter value from the list of parameter values of the second function call to a corresponding output value in the symbol table; and
- send a third message including the name of the second function call and a list of output parameter values to the second computing system.
33. The at least one non-transitory machine-readable storage medium of claim 29, wherein the second function call is a request to offload performance of a workload from the second computing system to the GPU on the first computing system.
Type: Application
Filed: Oct 18, 2022
Publication Date: Feb 16, 2023
Inventors: Pradeep Pappachan (Tualatin, OR), Sujoy Sen (Beaverton, OR), Joseph Grecco (Saddle Brook, NJ), Mukesh Gangadhar Bhavani Venkatesan (Bangalore), Reshma Lal (Hillsboro, OR)
Application Number: 17/968,589