Method and system for deferred command issuing in a computer system

Info

Publication number: 20070088856
Type: Application
Filed: Oct 17, 2006
Publication Date: Apr 19, 2007
Applicant:
Inventor: Guofeng Zhang (Shanghai)
Application Number: 11/581,975

Abstract

A method and system are disclosed for employing deferred command issuing in a computer system with multiple peripheral processors operating with a peripheral device driver embedded in a multi-threaded central processor. After issuing a first command with a first event tag by the peripheral device driver, a second command is generated for a first peripheral processor by the peripheral device driver following the issuing of the first command. The second command is stored awaiting for the first event tag being returned, and the second command is issued when the first event tag is returned if the first and second commands need to be synchronized.

Description

Description

PRIORITY DATA

This application claims the benefits of U.S. Patent Application Ser. No. 60/727,668, which was filed on Oct. 18, 2005 and entitled “Smart CPU Sync Technology for MultiGPU Solution.”

CROSS REFERENCE

This application also relates to U.S. patent application entitled “TRANSPARENT MULTI-BUFFERING IN MULTI-GPU GRAPHICS SUBSYSTEM”, U.S. patent application entitled “EVENT MEMORY ASSISTED SYNCHRONIZATION IN MULTI-GPU GRAPHICS SUBSYSTEM” and U.S. patent application entitled “METHOD AND SYSTEM FOR SYNCHRONIZING PARALLEL ENGINES IN A GRAPHICS PROCESSING UNIT”, all of which are commonly filed on the same day, and which are incorporated by reference in their entirety.

BACKGROUND

The present invention relates generally to the synchronization between a computer's central processing units (CPUs) and peripheral processing units, and, more particularly, to the timing of command issuing.

In a modern computer system, each peripheral functional module, such as audio or video, has its own dedicated processing subsystem, and the operations of these subsystems typically require direct control by computer's central processing unit (CPU). Besides, communication and synchronization among components of the subsystems are typically achieved through hardware connections. In an advanced graphics processing subsystem with two or more graphics processing units (GPUs), for instance, a CPU has to frequently evaluate the state of GPUs, and a next rendering command can only be issued when a previous or current command is finished. In other cases, when CPU(s) is calculating something for GPUs using multi-threaded technology, the GPUs may have to wait for the CPU to complete the calculation before executing commands that need the result from CPU(s). When one GPU requests data from another GPU, the transfer must be made through a direct hardware link or the bus, and controlled by the CPU, which then has to wait for the data transfer to complete before executing subsequent commands. Either CPU waiting for GPU or vice versa, the wait time is a waste and lowers the computer's overall performance.

It is therefore desirable for a computer system to be able to detach hard waiting as much as possible from CPU's operations.

SUMMARY

In view of the foregoing, this invention provides a method and system to remove some of the wait time by the CPU, as well as some idle time in peripheral processing units. In other words, it increases parallelism between processors.

A method and system are disclosed for employing deferred commands issuing in a computer system with multiple peripheral processors operating with a peripheral device driver embedded in one or more central processor(s). After issuing a first command with a first event tag by the peripheral device driver, a second command is generated for a first peripheral processor by the peripheral device driver following the issuing of the first command. The second command is stored awaiting for the first event tag to be returned, and the second command is issued when the first event tag is returned if the first and second commands need to be synchronized.

The construction and method of operation of the invention, however, together with additional objects and advantages thereof will be best understood from the following description of specific embodiments when read in connection with the accompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram of a part of a traditional computer system.

FIG. 2 is a block diagram of a part of a computer system according to one embodiment of the invention.

FIG. 3 illustrates commands and event-tag flowing according to one embodiment of the invention.

FIG. 4A is a flow chart showing a command block generating and a synchronization mechanism according to one embodiment of the present invention.

FIGS. 4B and 4C are flow charts illustrating two different driver subroutines within each command block and execution according to one embodiment of the present invention.

FIGS. 5A and 5B are command timing diagrams for showing time saving effects of deferred-command-issuing according to one embodiment of the present invention.

DESCRIPTION

Detailed information with regard to the operation of the GPU in the computer system is further described in U.S. patent application entitled “TRANSPARENT MULTI-BUFFERING IN MULTI-GPU GRAPHICS SUBSYSTEM”, U.S. patent application entitled “EVENT MEMORY ASSISTED SYNCHRONIZATION IN MULTI-GPU GRAPHICS SUBSYSTEM” and U.S. patent application entitled “METHOD AND SYSTEM FOR SYNCHRONIZING PARALLEL ENGINES IN A GRAPHICS PROCESSING UNIT”, all of which are commonly filed on the same day, and which are incorporated by reference in their entirety.

FIG. 1 illustrates a part of a traditional computer system 100. In such a system, a peripheral device driver 110 is just a program, functioning essentially like an instruction manual that provides the operating system with information on how to control and communicate with special processors 120 and 130 of a peripheral subsystem 140. The driver 110 does not have any control function, which is instead carried out by one or more central processor(s) (CPU) 150. Communications between the special processors 120 and 130 take place through hardware connection 160 or through the bus 170.

As an embodiment of present invention, FIG. 2 illustrates a part of a multi-processor computer system 200 with a driver 210 embedded in one or more central processor(s) 220. Here, the ‘embedded’ means that the driver actually runs in the CPU and employs some of the CPU processing capability, so that the driver can generate commands to be stored in the buffer, assign event-tags when synchronizations with other commands are needed, issue the commands and monitor the return of the event-tags, all without CPU hard wait. Such driver implementation does not require extensive hardware support, so it is also cost effective.

The computer system 200 also employs a command buffer 230, which stores immediate commands sent by the driver 210. The command buffer 230 can be just a memory space in a main memory 290 or another memory located anywhere, and can be dynamically allocated by the driver 210. With the processing power of the central processor(s) 220, the driver 210 directs command buffering in and subsequently issuing from the command buffer 230, as well as synchronization among special processors 240 and 250 and the central processor(s) 220. The special processors can be processors dedicated for graphics operations, known as graphics processing units (GPUs).

FIG. 3 is a diagram showing the flow of commands among the CPU, the buffers and the special processors according to one embodiment of the present invention. For illustration purposes, it provides more details on command buffering. An embedded driver 320 generates commands along with event-tags, and then sent them to command buffers 330 and 340 selectively. Commands and event-tags for special processors 350 are sent to command buffer1 330, and commands and event-tags for special processor2 360 are sent to command buffer2 340, so that the commands for different special processors can be issued independently and simultaneously. When a current command needs to synchronize with another command execution, the driver 320 generates an event-tag alongside the current command. Processors, either peripheral special processors 350 and 360, or central processor(s) 300, execute their corresponding commands and return event-tags, if present, upon completion of the execution. There are various control mechanisms installed among them through the communications. For example, the central processor(s) 300 can control both buffers in its operation.

FIG. 4A presents a flow chart detailing how the graphics driver 320 synchronizes command issuing with GPUs and CPU. Here, the driver 320 generates command blocks in steps 410A through 470A continuously without any delay on the CPU side. Some of these commands are to be stored in command buffers before being issued to GPUs for execution. For example, command block[n−1] 410A has a command to a first GPU with a request to return an event-tag[i]. The first GPU will return the event-tag[i] upon the completion of command block[n−1] 410A. Upon detecting the event-tag[i], another command that needs to synchronize with the command block[n−1] 410A, can then be issued from the command buffer by driver 320. In this way, CPU's hard wait for a synchronization event is eliminated. The term “deferred command issuing” refers generally to this command buffering process.

FIG. 4A also shows a command block[n+m] 440A that needs to synchronize with another CPU thread, as well as a command block[n+m+k] 470A that needs to synchronize with a second GPU. In both cases, the driver 320's operations of storing commands, checking event-tags and issuing commands are the same as in the above first GPU case.

Within each command block, driver 320 executes certain subroutines, such as generating a new command and an associated event-tag if needed, checking on returned event-tags, buffering the new command and issuing a buffered command or directly issuing the new command if there is no outstanding event-tag. These subroutines can be executed in various sequences. FIGS. 4B and 4C are two examples for conducting such subroutines.

Referring to FIG. 3 and FIG. 4B, driver 320 first generates a current command in step 410B, and then checks for any returned event-tag in step 420B. If there is a returned event-tag, and if a related command is in a buffer, then the driver 320 issues the buffered command along with its own event-tag if present, as shown in step 430B and 440B. Here, the ‘related’ means that there is a synchronization need between the buffered command and the previous command that returns the event-tag to the buffer. If the related command is not in buffer, then driver 320 checks if the current command is related to the returned event-tag in step 450B. If so, it issues the current command (step 470B), and if not, it buffers the current command (step 480B).

On the other hand, if there is no returned event-tag in the buffer, driver 320 then checks for any outstanding event-tag in step 460B. In case there is an outstanding event-tag that the current command issue will depend on or is related to, driver 320 then buffers the current command (step 480B). In case there is no outstanding related event-tag, driver 320 directly issues the current command. Note that in all cases of command buffering or issuing, the associated event-tag if present, is also buffered or issued along with the command.

FIG. 4C shows another subroutine according to another embodiment of the present invention where the driver 320 first checks for any returned event-tag in step 410C. If there is a returned event-tag, and if a related command is in the buffer, then driver 320 issues the buffered command (step 430C). If there is no returned event-tag (step 410C) or there is no related command in buffer (step 420C), then driver 320 generates a current command (step 445C). If the current command is related to any returned event-tag (step 450C), then the driver issues the current command (step 480C). If the current command is not related to any returned event-tag, then it is also checked to see whether there is any outstanding event-tag that relates to the current command (step 460C). In case there is an outstanding related event-tag, driver 320 buffers the current command along with its event-tag if present (step 470C), otherwise driver 320 issues the current command (step 480C) with its event-tag if present (step 470C). The aforementioned event-tag checking process can be limited to only those processors to which commands with event-tags have been sent previously.

In both cases as shown in FIGS. 4B and 4C, and as an alternative, a current command is always buffered if there is any outstanding event-tag. If driver 320 checks only the event-tag buffer, then before an outstanding event-tag returns, driver 320 really has no way to know if it is related to a newly generated current command. So that the current command has to be buffered if there is any outstanding event-tag.

FIGS. 5A and 5B are timing diagrams illustrating deferred-command-issuing that reduces CPU wait-time and GPU idle-time. FIG. 5A represents the situation in which a deferred-command-issuing process is not employed. In this case, the CPU generates commands in time slots 500A, 510A and 520A for a first GPU (or GPU1) in time slots 550A, 560A and 570A, respectively. Commands generated in time slots 502A, 512A and 522A are for a second GPU (or GPU2) in time slots 552A, 562A and 572A, respectively. Since there is no command buffering process employed, a subsequent command can only be generated and issued when a current GPU operation is completed. For example, the time slot 510A can only be initiated after the time slot 552A, and similarly, the time slot 520A is after time slot 562A. The CPU has to wait while a previously issued command is executing. As shown, the time intervals between two adjacent time slots are either the CPU wait time or the GPU idle time. For instance, the interval between 510A and 502A is CPU's wait time, and the interval between 560A and 550A is GPU1's idle time.

Contrasting to FIG. 5A and FIG. 5B illustrates the timing relationships in the situation where the deferred-command-issuing process is employed to allow the CPU to generate commands continuously to command buffers without waiting for any GPU to complete a command execution. In this case, the CPU generates commands in time slots 500B, 510B and 520B for GPU1 in time slots 550B, 560B and 570B, respectively. Commands generated in time slots 502B, 512B and 522B are for GPU2 in time slots 552B, 562B and 572B, respectively. As shown, the CPU command generating time slot 510B is brought up to follow the completion of the time slot 502B, which is prior to the end of GPU2 time slot 552B. But the CPU's fifth command at time slot 520B still waits for the time slot 552B to end, because there is a synchronization need between this particular command and the GPU2 execution, so are the command at 530B and the GPU2 execution at 562B. In such a command processing system, especially for a graphics system employing such a deferred-command-issuing process, benefits are obtained as a subsequent command is already generated and waiting in the command buffer for the execution by the GPU. On the other hand, the GPUs do not have to wait for the CPU to generate commands, and can execute a subsequent command right after a current one finishes. This is further illustrated in the case for GPU2 at time slot 562B and 572B, where GPU2 has no idle time. For the same reason, the GPU1 idle time between time slots 570B and 560B is also reduced.

To quantify time saved by the deferred-command-issuing process employed, if it is assumed that the CPU command generating time is ‘t’, and the execution time of GPU1 and GPU2 are T1 and T2 (assume T1<T2, for easing the evaluation), respectively, as shown in FIG. 5A, a system without using the deferred-command-issuing takes (3*T2+3*t) to complete three command cycles. The system with the deferred-command-issuing of FIG. 5B shortens the three-cycle time to (3*T2+t). So that the time saving for 3 cycles is 2*t. In general, the saving is (n−1)*t for n number of command cycles.

A comparison between FIGS. 5A and 5B also shows a saving in GPU idle time. In FIG. 5A, GPU1 idle time between time slots 560A and 570A is T2−T1+t, and GPU2 idle time is t. In FIG. 5B, GPU1 idle time between the corresponding time slots becomes T2−T1, or a saving of t. GPU2 idle time is completely eliminated, also a saving of t.

This invention provides many different embodiments, or examples, for implementing different features of the invention. Specific examples of components and methods are described to help clarify the disclosure. These are, of course, merely examples and are not intended to limit the disclosure from that described in the claims.

Claims

1. A method for deferred command issuing in a computer system with one or multi special purpose operating with a device driver running on one or multiple central processors, the method comprising:

issuing a first command with a first event tag by the peripheral device driver;

generating a second command for a first peripheral processor by the peripheral device driver following the issuing of the first command;

storing the second command awaiting for the first event tag being returned;

and issuing the second command when the first event tag is returned.

2. The method of claim 1, wherein storing the second command further includes storing the second command in a buffer associated with the first processor.

3. The method of claim 2, further comprising:

generating a third command for a second processor; and

storing the third command in a buffer associated there with.

4. The method of claim 3, wherein the buffers associated with the first and second processors are different.

5. The method of claim 1, further comprising checking whether the generated second command relates to the first command requiring the first event tag to return before the second command is being issued.

6. The method of claim 5, wherein checking further includes:

checking whether the first event tag has returned; and

checking whether the first event tag is outstanding if it is not yet returned and if it relates to the second command.

7. The method of claim 6, wherein checking whether the first event tag has returned is performed after the generating the second command.

8. The method of claim 6, wherein checking whether the first event tag has returned is performed prior to the generating the second command.

9. A method for deferred command issuing in a computer system with multiple graphics processors operating with a graphics driver embedded in a multi-threaded central processor, the method comprising:

issuing a first command with a first event tag by the graphics driver;

generating a second command to a first processor of the computer system by the graphics driver following the issuing of the first command;

storing the second command awaiting for the first event tag is returned; and

issuing the second command when the first event tag is returned.

10. The method of claim 9, wherein storing the second command further includes storing the second command in a buffer associated with the first processor.

11. The method of claim 10, further comprising:

generating a third command to a second processor; and

storing the third command in a buffer associated there with.

12. The method of claim 11, wherein the buffers associated with the first and second processors are different.

13. The method of claim 9, further comprises checking whether the generated second command needs to wait for the first event tag to return.

14. The method of claim 13, wherein the checking further includes:

checking whether the first event tag has returned; and

checking whether the first event tag is outstanding if it has not returned.

15. The method of claim 14, wherein checking whether the first event tag has returned is performed after the generating the second command.

16. The method of claim 14, wherein checking whether the first event tag has returned is performed prior to the generating the second command.

17. A system for supporting deferred command issuing in an advanced computer system, the system comprising:

a multi-threaded central processing unit (CPU);

a graphics subsystem with multiple graphics processing units;

at least one command buffer for storing commands and associated event-tags; and

a graphics driver embedded in the CPU for generating commands to be stored in the command buffers, assigning event-tags when synchronizations are needed, controlling command issuing and monitoring event-tag returns.