EFFICIENT TASK ALLOCATION

- Arm Limited

A method and processor comprising a command processing unit to receive, from a host processor, a sequence of commands to be executed; and generate based on the sequence of commands a plurality of tasks. The processor also comprises a plurality of compute units each having a first processing module for executing tasks of a first task type, a second processing module for executing tasks of a second task type, different from the first task type, and a local cache shared by at least the first processing module and the second processing module. The command processing unit issues the plurality of tasks to at least one of the plurality of compute units, and wherein at least one of the plurality of compute units is to process at least one of the plurality of tasks.

Skip to: Description  ·  Claims  · Patent History  ·  Patent History
Description
BACKGROUND OF THE INVENTION Field of the Invention

The present invention relates to methods, processors, and non-transitory computer-readable storage media for handling the management of different task types such as neural network processing operations and graphics processing operations.

Description of the Related Technology

Neural networks can be used for processes such as machine learning, computer vision, and natural language processing operations. A neural network may operate upon suitable input data (e.g. such as an image or sound data) to ultimately provide a desired output (e.g. an identification of an object within an image, or a spoken word within a sound clip, or other useful output inferred from the input data). This process is usually known as “inferencing” or “classification”. In a graphics (image) processing context, neural network processing may also be used for image enhancement (“de-noising”), segmentation, “anti-aliasing”, supersampling, etc., in which case a suitable input image may be processed to provide a desired output image.

A neural network will typically process the input data (e.g. image or sound data) according to a network of operators, each operator performing a particular operation. The operations will generally be performed sequentially to produce desired output data (e.g. a classification based on the image or sound data). Each operation may be referred to as a “layer” of neural network processing. Hence, neural network processing may comprise a sequence of “layers” of processing, such that the output from each layer is used as an input to a next layer of processing.

In some data processing systems a dedicated neural processing unit (NPU) is provided as a hardware accelerator that is operable to perform such machine learning processing as and when desired, e.g. in response to an application that is executing on a host processor (e.g. central processing unit (CPU)) requiring the machine learning processing. Similarly, a dedicated graphics processing unit (GPU) may be provided as a hardware accelerator that is operable to perform graphics processing. These dedicated hardware accelerators may be provided along the same interconnect (bus) alongside other components, such that the host processor is operable to request the hardware accelerators to perform a set of operations accordingly The NPU and GPU are therefore, dedicated hardware units for performing operations such as machine learning processing operations and graphics processing operations on request by the host processor.

SUMMARY

According to a first aspect, there is provided a processor comprising: a command processing unit to receive, from a host processor, a sequence of commands to be executed; and generate based on the sequence of commands a plurality of tasks; and a plurality of compute units, wherein at least one of the plurality of compute units comprises: a first processing module for executing tasks of a first task type generated by the command processing unit; a second processing module for executing tasks of a second task type, different from the first task type, generated by the command processing unit; a local cache shared by at least the first processing module and the second processing module; wherein the command processing unit is to issue the plurality of tasks to at least one of the plurality of compute units, and wherein at least one of the plurality of compute units is to process at least one of the plurality of tasks. This enables the issuance of tasks to different processing modules which share a local cache. This improves the efficiency and resource usage of the processor and reduces component size, since the scheduling and job decomposition tasks are undertaken by the command processing unit. Furthermore, the command processing unit issues the tasks based on compute unit availability, such that tasks requiring the use of the same resources, such as where one task generates output data and that output data is input data for another task, the tasks can be scheduled in such a way that the shared local cache can be used. This reduces memory read/write operations to higher-level/external memories, reducing the amount of processing and therefore reducing processing times.

The command processing unit may issue tasks of the first task type to the first processing module of a given compute unit and to issue tasks of the second task type to the second processing module of the plurality of a given compute unit. This enables tasks of differing types to be issued to different processing modules of the compute units. This improves efficiency, as the scheduling and issuance of tasks to individual processing modules is, undertaken by the command processing unit rather than each compute unit and/or the host processor.

The first task type is a task for undertaking at least a portion of graphics processing operation forming one of a set of predefined graphics processing operations which collectively enable implementation of a graphics processing pipeline and wherein the second task type is a task for undertaking at least a portion of a neural processing operation. The graphics processing operation comprises at least one of a graphics compute shader task; a vertex shader task; a fragment shader task; a tessellation task; and a geometry shader task. This enables the tasks of a given command in the sequence of commands to be allocated to the most appropriate processing module based on the type of processing operation.

Each compute unit may be a shader core of a graphics processing unit. This enables commands which comprise tasks require both graphics processing and a neural processing to be undertaken using a single piece of hardware, thereby reducing the number of memory transactions and hardware size.

The first processing module may be a graphics processing module and the second processing module may be a neural processing module. This enables the efficient sharing of tasks within a single command that requires the use of both an neural processing unit and a graphics processing unit, thereby improving efficiency and resource usage.

The command processing unit may further comprise at least one dependency tracker to track dependencies between commands in the sequence of commands; and wherein the command processing unit is to use the at least one dependency tracker to wait for completion of processing of a given task of a first command in the sequence of commands before issuing an associated task of a second command in the sequence of commands for processing, where the associated task is dependent on the given task. This enables the command processing unit to issue the tasks of commands to given compute units, and in a given order, based on whether they use the output of a preceding command. This improves efficiency by enabling tasks of a given command to use/reuse data stored in the local cache of a given compute unit.

An output of the given task may be stored in the local cache. This improves efficiency by enabling tasks of a given command to use/reuse data stored in the local cache of a given compute unit.

Each command in the sequence of commands may have metadata, wherein the metadata may comprise indications of at least a number of tasks in the command, and task types associated with each of the tasks. This ensures that the command processing unit can efficiently decompose a command into tasks, and indicate their task types such that it is capable of issuing the tasks to the desired compute unit and in the most efficient manner.

The command processing unit may allocate each command in the sequence of commands, a command identifier, and the dependency tracker tracks dependencies between commands in the sequence of commands based on the command identifier. Furthermore, when the given task of the first command is dependent on the associated task of the second command, the command processing unit allocates the given task and the associated task a same task identifier Additionally, the tasks of each of the commands that have been allocated the same task identifier may be executed on the same compute unit of the plurality of compute units. This enables the efficient tracking of commands, tasks, and their dependencies, thereby improving the efficiency of the allocation to individual compute units.

A task allocated a first task identifier may be executed on a first compute unit of the plurality of compute units and a task allocated a second, different, task identifier may be executed on a second compute unit of the plurality of compute units. This enables tasks having different task identifiers to be allocated to different compute units since they are unrelated and do not require the use of the shared local cache to perform efficiently.

A task allocated a first task identifier, and of the first type, may be executed on the first processing module of a given compute unit of the plurality of compute units, and a task allocated a second, different, task identifier, and of the second task type, may be executed on the second processing module of the given compute unit of the plurality of compute units. This enables tasks with different identifiers and types to be issued to the same compute unit but executed on different processing modules, thereby improving efficiency and ensuring the maximization of the usage of the available resources.

Each of the plurality of compute units may further comprise at least one queue of tasks, wherein the queue tasks comprise at least a part of the sequence of commands. This enables the compute units to have queues of tasks in a given sequence generated by the command processing unit. Thereby enabling tasks to be scheduled in accordance with the sequence and based on resource availability.

A given queue may be associated with at least one task type. This enables queues to be formed of multiple task types, thereby allowing the compute unit to handle a different types of tasks which improves the efficiency of the scheduling the tasks.

According to a second aspect, there is provided a method of allocating tasks associated with commands in a sequence of commands comprising receiving at a command processing unit, from a host processor, the sequence of commands to be executed; generating, at the command processing unit, based on the received sequence of commands a plurality of tasks; allocating each of the plurality of tasks in a given command of the sequence of commands, an identifier, based on metadata associated with each command; and issuing, by the command processing unit, each task to a compute unit of a plurality of compute units for execution, each compute unit comprising: a first processing module for executing tasks of a first task type; a second processing module for executing tasks of a second task type; and a local cache shared by at least the first processing module and the second processing module; wherein a task associated with a first command of the plurality of commands and a task associated with a second command of the plurality of commands, each being allocated the same identifier, are assigned to a given compute unit. This improves the efficiency and resource usage of the processor and reduces component size, since the scheduling and job decomposition tasks are undertaken by the command processing unit. Furthermore, the command processing unit issues the tasks based on compute unit availability, such that tasks requiring the use of the same resources, such as where one task generates output data and that output data is input data for another task, the tasks can be scheduled in such a way that the shared local cache can be used. This reduces memory read/write operations to higher-level/external memories, reducing the amount of processing and therefore reducing processing times.

The command processing unit may wait for the completion of processing of the tasks associated with the first command before issuing the tasks associated with the second command to the given compute unit, when the task associated with the second command is dependent on the task associated with the first command. This enables the command processing unit to issue the tasks of commands to given compute units, and in a given order, based on whether they use the output of a preceding command. This improves efficiency by enabling tasks of a given command to use/reuse data stored in the local cache of a given compute unit.

The metadata may comprise indications of at least a number of tasks in the given command, and task types associated with each of the plurality of tasks. This ensures that the command processing unit can efficiently decompose a command into tasks, and indicate their task types such that it is capable of issuing the tasks to the desired compute unit and in the most efficient manner.

According to a third aspect, there is provided a non-transitory computer-readable storage medium comprising a set of computer-readable instructions stored thereon which, when executed by at least one processor are arranged to allocate tasks associated with commands in a sequence of commands wherein the instructions, when executed cause the at least one processor to receive at a command processing unit, from a host processor, the sequence of commands to be executed; generate, at the command processing unit, based on the received sequence of commands a plurality of tasks; allocate each of the plurality of tasks in a given command of the sequence of commands, an identifier, based on metadata associated with each command; and issue, by the command processing unit, each task to a compute unit of a plurality of compute units for execution, each compute unit comprising: a first processing module for executing tasks of a first task type; a second processing module for executing tasks of a second task type; and a local cache shared by at least the first processing module and the second processing module; wherein a task associated with a first command of the plurality of commands and a task associated with a second command of the plurality of commands, each being allocated the same identifier, are assigned to a given compute unit. This improves the efficiency and resource usage of the processor and reduces component size, since the scheduling and job decomposition tasks are undertaken by the command processing unit. Furthermore, the command processing unit issues the tasks based on compute unit availability, such that tasks requiring the use of the same resources, such as where one task generates output data and that output data is input data for another task, the tasks can be scheduled in such a way that the shared local cache can be used. This reduces memory read/write operations to higher-level/external memories, reducing the amount of processing and therefore reducing processing times.

BRIEF DESCRIPTION OF THE DRAWINGS

Further features and advantages will become apparent from the following description of preferred embodiments, given by way of example only, which is made with reference to the accompanying drawings in which like reference numerals are used to denote like features.

FIG. 1 is a schematic diagram of a processor according to an embodiment;

FIG. 2 is a schematic diagram of a command processing unit according to an embodiment;

FIG. 3 is a schematic representation of the allocation of tasks to processing modules by a command processing unit, according to an embodiment;

FIG. 4 is a flowchart of a method for allocating tasks according to an embodiment; and

FIG. 5 is a schematic diagram of a system comprising features according to examples.

DETAILED DESCRIPTION OF CERTAIN INVENTIVE EMBODIMENTS

In some systems dedicated hardware units such as neural processing units (NPU) and graphics processing units (GPU) are provided as distinct hardware accelerators that are operable to perform relevant processing operations under the separate control of a host processor (such as a central processing unit (CPU)). For example, the NPU is operable to perform machine learning processing as and when desired, e.g. in response to an application that is executing on the host processor requiring the machine learning processing and issuing instructions to the NPU to execute. For instance, an NPU may be provided along the same interconnect (bus) as other hardware accelerators, such as a graphics processor (graphics processing unit, GPU), such that the host processor is operable to request the NPU to perform a set of machine learning processing operations accordingly, e.g. in a similar manner as the host processor is able to request the graphics processor to perform graphics processing operations. The NPU is thus a dedicated hardware unit for performing such machine learning processing operations on request by the host processor (CPU).

It has been recognized that, whilst not necessarily being designed or optimized for this purpose, a graphics processor GPU may also be used (or re-purposed) to perform machine learning processing tasks. For instance, convolutional neural network processing often involves a series of multiply-and-accumulate (MAC) operations for multiplying input feature values with the relevant feature weights of the kernel filters to determine the output feature values. Graphics processor shader cores may be well-suited for performing these type of arithmetic operations, as these operations are generally similar to the arithmetic operations that may be required when performing graphics processing work (but on different data). Also, graphics processors typically support high levels of concurrent processing (e.g. supporting large numbers of execution threads), and are optimized for data-plane (rather than control plane) processing, all of which means that graphics processors may be well-suited for performing machine learning processing.

Thus, a GPU may be operated to perform machine learning processing work. In that case, the GPU may be used to perform any suitable and desired machine learning processing tasks. The machine learning processing that is performed by the GPU may thus include general purpose training and inferencing jobs (that do not relate to graphics processing work as such). However, a GPU may also execute machine learning (e.g. inference) jobs for graphics processing operations, such as when performing “super sampling” techniques using deep learning, or when performing de-noising during a ray tracing process, for example.

However, using graphics processors to perform machine learning processing tasks can be a relatively inefficient use of the graphics processor's resource, as the graphics processor is not generally designed (or optimized) for such tasks, and can therefore result in lower performance, e.g. compared to using a dedicated machine learning processing unit (e.g. NPU). At least in the situation where the machine learning processing relates to a graphics processing (rendering) task, Re-purposing some of the graphics processor's functional units to perform the desired machine learning processing operations also prevents those functional units from performing the graphics processing work that they are designed for, which can further reduce the overall performance of the overall (rendering) process.

Nonetheless, in some cases, it may still be desirable to perform machine learning processing tasks using a graphics processor, e.g. rather than using an external machine learning processing unit, such as an NPU. For instance, this may be desirable, e.g. in order to reduce silicon area, and reduce data movement, etc., especially in mobile devices where area and resource may be limited, and where it may therefore be particularly desirable to be able to use existing and available resources to perform the desired work, potentially avoiding the need for an NPU altogether. There are other examples where this may be desirable, especially where the machine learning processing itself relates to a graphics processing task, and wherein it may be particularly desirable to free up the execution unit and other functional units of the graphics processor to perform actual graphics processing operations.

FIG. 1 is a schematic diagram 100 of a processor 130 that provides dedicated circuitry that can be used to perform operations which would normally be undertaken by dedicated hardware accelerators, such as an NPU and a GPU. It will be appreciated that the types of hardware accelerator which the processor 130 may provide dedicated circuitry for is not limited to that of an NPU or GPU but may be dedicated circuitry for any type of hardware accelerator. As mentioned above, GPU shader cores may be well-suited for performing these certain types of arithmetic operations such as neural processing operations, as these operations are generally similar to the arithmetic operations that may be required when performing graphics processing work (but on different data). Furthermore, graphics processors typically support high levels of concurrent processing (e.g. supporting large numbers of execution threads), and are optimized for data-plane (rather than control plane) processing, all of which means that graphics processors may be well-suited for performing other types of operations.

That is, rather than using entirely separate hardware accelerators, such as a machine learning processing unit that is independent of the graphics processor, such as an NPU, or only being able to perform machine learning processing operations entirely using the hardware of the GPU, dedicated circuitry may be incorporated into the GPU itself.

This means that the hardware accelerator circuitry incorporated into the GPU is operable, to utilize some of the GPU's existing resources (e.g. such that at least some functional units and resource of the GPU can effectively be shared between the different hardware accelerator circuitry, for instance), whilst still allowing an improved (more optimized) performance compared to performing all the processing with general purpose execution.

As such, in one embodiment, the processor 130 may be a GPU that is adapted to comprise a number of dedicated hardware resources, such as those which will be described below.

In some examples, this can be particularly beneficial when performing machine learning tasks that themselves relate to graphics processing work, as in that case all of the associated processing can be (and preferably is) performed locally to the graphics processor, thus improving data locality, and (e.g.) reducing the need for external communication along the interconnect with other hardware units (e.g. an NPU). In that case, at least some of the machine learning processing work can be offloaded to the machine learning processing circuit, thereby freeing the execution unit to perform actual graphics processing operations, as desired.

In other words, in some examples, providing a machine learning processing circuit within the graphics processor, this means that the machine learning processing circuit is preferably then operable to perform at least some machine learning processing operations whilst the other functional units of the graphics processor are simultaneously performing graphics processing operations. In the situation where the machine learning processing relates to part of an overall graphics processing task this can therefore improve overall efficiency (in terms of energy efficiency, throughput, etc.) for the overall graphics processing task.

The processor 130 is arranged to receive a command stream 120 from a host processor 110, such as a CPU. The command stream, as will be described in further detail below with reference to FIG. 3, comprises at least one command in a given sequence, each command to be executed, and each command may be decomposed into a number of tasks. These tasks may be self-contained operations, such as a given machine learning operation or a graphics processing operation. It will be appreciated that there may be other types of tasks depending on the command.

The command stream 120 is sent by the host processor 110 and is received by a command processing unit 140 which is arranged to schedule the commands within the command stream 120 in accordance with their sequence. The command processing unit 140 is arranged to schedule the commands and decompose each command in the command stream 120 into at least one task. Once the command processing unit 140 has schedule the commands in the command stream 120, and generated a plurality of tasks for the commands, the command processing unit issues each of the plurality of tasks to at least one compute unit 150a,150b each of which are configured to process at least one of the plurality of tasks.

The processor 130 comprises a plurality of compute units 150a, 150b. Each compute unit 150a, 150b, may, as described above be a shader core of a GPU specifically configured to undertake a number of different types of operations, however it will be appreciated that other types of specifically configured processor may be used, such as a general-purpose processor configured with individual compute units, such as compute units 150a, 150b. Each compute unit 150a, 150b comprises a number of components, and at least a first processing modules 152a, 152b for executing tasks of a first task type, and a second processing module 154a, 154b for executing tasks of a second task type, different from the first task type. In some examples, the first processing module 152a, 152b may be a processing module for processing neural processing operations, such as those which would normally be undertaken by a separate NPU as described above. Similarly, the second processing module 154a, 154b may be a processing module for processing graphics processing operations forming a set of pre-defined graphics processing operations which enables the implementation of a graphics processing pipeline. For example, such graphics processing operations include a graphics compute shader task, a vertex shader task, a fragment shader takes, a tessellation shader task, and a geometry shader task. These graphics processing operations may all form part of a set of pre-defined operations as defined by an application programming interface, API. Examples of such APIs include Vulkan, Direct3D and Metal. Such tasks would normally be undertaken by a separate/external GPU. It will be appreciated that any number of other graphics processing operations may be capable of being processed by the second processing module.

As such, the command processing unit 140 issues tasks of a first task type to the first processing module 152a, 152b of a given compute unit 150a, 150b, and tasks of a second task type to the second processing module 154a, 154b of a given compute unit 150a, 150b. Continuing the example above, the command processing unit 140 would issue machine learning/neural processing tasks to the first processing module 152a, 152b of a given compute unit 150a, 150b where the first processing module 152a, 152b is optimized to process neural network processing tasks, for example by comprising an efficient means of handling a large number of multiply-accumulate operations. Similarly, the command processing unit 140 would issue graphics processing tasks to the second processing module 154a, 154b of a given compute unit 150a, 150b where the second processing module 152a, 154a is optimized to process such graphics processing tasks.

In addition to comprising a first processing module 152a, 152b and a second processing module 154a, 154b, each compute unit 150a, 150b also comprises a memory in the form of a local cache 156a,156b for use by the respective processing module 152a, 152b, 154a, 154b during the processing of tasks. Examples of such a local cache 156a, 156b is L1 cache. The local cache 156a, 156b may, for example, a synchronous dynamic random-access memory (SDRAM). For example, the local cache 156a, 156b may comprise a double data rate synchronous dynamic random-access memory (DDR-SDRAM). It will be appreciated that the local cache 156a, 156b may comprise other types of memory.

The local cache 156a, 156b is used for storing data relating to the tasks which are being processed on a given compute unit 150a, 150b by the first processing module 152a, 152b and second processing module 154a, 154b. It may also be accessed by other processing modules (not shown) forming part of the compute unit 150a 150b the local cache 156a, 156b is associated with. However, in some examples it may be necessary to provide access data associated with a given task executing on a processing module of a given compute unit 150a, 150b to a task being executed on a processing module of another compute unit (not shown) of the processor 130. In such examples, the processor 130 may also comprise a cache 160, such as an L2 cache for providing access to data use for the processing of tasks being executed on different compute units 150a, 150b.

By providing a local cache 156a, 156b tasks which have been issued to the same compute unit 150a, 150b may access data stored in the local cache 156a, 156b, regardless of whether they form part of the same command in the command stream 120. As will be described in further detail below, the command processing unit 140 is responsible for allocating tasks of commands to given compute units 150a, 150b such that they can most efficiently use the available resources, such as the local cache 156a, 156b th reducing the number of read/write transactions required to memory external to the compute units 150a, 150b, such as the cache 160 (L2 cache) or higher level memories. One such example, is that a task of one command issued to a first processing module 152a of a given compute unit 150a, may store its output in the local cache 156a such that it is accessible by a second task of a different (or the same) command issued to a given processing module 152a, 154a of the same compute unit 150a.

One or more of the command processing unit 140, the compute units 150a, 150b, and the cache 160 may be interconnected using bus. This allows data to be transferred between the various components. The bus may be or include any suitable interface or bus. For example, an ARM® Advanced Microcontroller Bus Architecture (AMBAC) interface, such as the Advanced eXtensible Interface (AXI), may be used.

FIG. 2 is a schematic diagram 200 of a command processing unit 140 according to an embodiment. As described above, the command processing unit 140 forms part of a processor, such as processor 130, and receives a command stream 120 from a host processor, such as host processor 110. The command processing unit 140 comprises a host interface module 142 for receiving the command stream 120 from the host processor 110. The receive command stream 120 is then parsed by a command stream parser module 144. As mentioned above, the command stream 120 comprises a sequence of commands in a given order. The command stream parser 144 parses the command stream 120 and decomposes it into separate commands, and decomposes each command in the command stream 120 into separate tasks 210, 220. The dependency tracker 146 then schedules the tasks 210, 220 and issues them to the relevant compute units, such as compute units 150a, 150b of the processor 130. Whilst the example 200 in FIG. 2 shows a command processing unit comprising a single dependency tracker, it will be appreciated that in some examples there may be more than one dependency tracker, such as comprising a dependency tracker for each type of task.

In some examples, the dependency tracker 146 tracks the dependencies between the commands in the command stream 120 and schedules and issues the tasks associated with the commands such that task 210, 220 operations are processed in the desired order. That is, where task 210 is dependent on task 220 then the dependency tracker 146 will only issue the task 220 once task 210 has been completed.

In order to facilitate the decomposition of commands in the command stream 120 into task, each command in the command stream 120 may comprise associated metadata. The metadata may comprise information such as the number of tasks in a given command and the types of those tasks. In some examples, the command stream parser 144 may allocate each command in the command stream 120 a command identifier. The command identifier may be used to indicate the order in which the commands of the command stream 120 are to be processed, such that the dependency tracker can track the dependencies between the commands and issue the tasks of said commands to the necessary compute units 150a, 150b in the required order. Furthermore, once each command of the command stream 120 has been decomposed into a plurality of tasks, such as tasks 210, 220, the dependency tracker 146 may allocate each task a given task identifier.

As shown in FIG. 2, task 210 has been given a task identifier ‘0’, and task 220 has been given a task identifier ‘1’. As task 210 and task 220 have different task identifiers then the command processing unit 140 may issue these tasks at the same time to different compute units 150a, 150b. More specifically, because each of task 210 and task 220 have different task types, task 210 has type ‘X’ and task 220 has type ‘Y’, they may be issued to different processing modules 152a, 152b, 154a, 154b, whereby the processing module they are issued to corresponds to the type of the task and the specific configuration of the processing module. For example, where the first processing module 152a, 152b of a given compute unit 150a, 150b is configured for processing machine learning operations, then where the task 210, 220 is a machine learning task, then it may be issued to that processing module 152a, 154a. Similarly, where the second processing module 154a, 154b of a given compute unit 150a, 150b is configured to processing graphics processing operations, then where the task 210, 220 is a graphics processing task, that task may be issued to the that processing module 154a, 154b.

Alternatively, where the tasks 210, 220 are allocated the same task identifier, then the dependency tracker 146 will issue the tasks to the same compute unit 150a, 150b. This enables the tasks to use the local cache 156a, 156b thereby improving the efficiency and resource usage since there is no need to write data to external memory, such as cache 160 or other higher-level memories. Even if the tasks types are different, they can be executed by the corresponding processing modules 152a, 152b, 154a, 154b of the same compute unit 150a, 150b. In yet further examples, each compute unit 150a, 150b may comprise at least one queue of tasks, for storing tasks representing at least part of a command of the sequence of commands. Each queue may be specific to the task type, and therefore correspond to one of the processing modules 152a, 152b, 154a, 154b.

FIG. 3 is a schematic representation 300 of the allocation of tasks 310a, 310b, 320a, 320b to processing modules 150a, 150b by a command processing unit 140, in line with the example described above. Commands 310c, 320c are part of the command stream 120 received from the host processor 100 at the processor 130. Each command 310c, 320c comprises two tasks 310a, 310b, 320a, 320b. The command processing unit 140 decomposes the commands 310c, 320c into each of these tasks and schedules them as described above. For example, where the command 320c is dependent on command 310a, the tasks 310a, 310b 320a, 320b are to be allocated to the same compute unit where the output of one task 310a, 310b is the input of another task 320a, 320b. As shown in FIG. 3, tasks 310a, 320a are allocated to compute unit 150a, and tasks 310b, 320b are allocated to compute unit 150b. The tasks need not be of the same type, for example task 310a may be a machine learning operation, such that it is allocated to a processing module 152a configured to perform machine learning operations. Task 320a may be a graphics processing operation such that it is allocated to a processing module 154b configured to perform graphics processing operations.

Task 320a is dependent on task 310a as indicated by the ‘ *’, the command processing unit 140 issues the task 320a to processing module 150a once task 310a has been completed. By issuing task 320a following the completion of task 310a, any data required by task 320a generated as an output of task 310a may be stored in the local cache 156a of the processing module. This enables the dependent task 320a to quickly and efficiently access the required data from the local cache 156a without the need to request data from external memory, such as the cache 160 (the L2 cache) or higher-level memory.

Tasks 310b and 320b are not dependent on one another, and therefore, may be allocated to the same processing module or different processing modules. In the example 300 of FIG. 3 both task 310b and 320b are issued to the same processing module 150b, however it will be appreciated that they may have been issued to different processing modules. Furthermore, where task 310b and 320b have different task types, they may each be issued to different processing modules 152b, 154b of the same compute unit 150b to be run substantially concurrently. Alternatively, they may be issued to different compute units 150a, 150b to be run substantially concurrently.

FIG. 4 is a flowchart 400 of a method for allocating tasks. At step 410 a sequence of commands, such as the command stream 120 described above is received at the command processing unit 140 of the processor 120, from the host processor 110. As described above the command stream 120 comprises a plurality of commands each comprising a plurality of tasks.

Following receipt of the command stream 120 at the command processing unit 140, the command processing unit 140 generates a plurality of tasks. As described above, the command processing unit 140 may generate the plurality of tasks based on metadata associated with each of the commands of the command stream 120. For example, each command in the command stream 120 may be allocated a command identifier used to indicate at least the dependency between the commands. Each task generated may also have associated metadata, such as a task identifier and task type.

At step 430 following the generation of the plurality of tasks, the tasks are issued, by the command processing unit 140 to a compute unit of a plurality of compute units such as compute unit 150a, 150b described above. The tasks may be allocated based on the task identifier and task type and allocated to a given processing module 152a, 152b, 154a, 154b based on the type of task. For example, as described above, a machine learning task may be issued to machine learning processing module and a graphics processing task may be issued to a graphics processing module of a given compute unit.

As described above, where a given task is dependent on the completion of another task, the given task may be issued to the same compute unit as the other task. This enables data required for the other task and generated by the given task, or data required for both tasks to be stored in the local cache, such as local cache 156a, 156b. This reduces the number of external memory transactions required to be issued, thereby increasing efficiency and improving resource usage.

FIG. 5 shows schematically a system 500 for allocating tasks associated with commands in a sequence of commands.

The system 500 comprises host processor 110 such as a central processing unit, or any other type of general processing unit. The host processor 110 issues a command stream comprising a plurality of commands, each having a plurality of tasks associated therewith.

The system 500 also comprises at least one other processor 130, configured to perform different types of tasks efficiently as described above. The one or more other processors 130 may be any type of processor specifically configured as describe above to comprise at least a plurality of compute units 150a, 150b and a command processing unit 140. Each compute unit may comprise a plurality of processing modules each configured to perform at least one type of operation. The processor 130, and host processor 110 may be combined as a System on Chip (SoC) or onto multiple SoCs to form one or more application processors.

The system 500 may also comprises memory 520 for storing data generated by the tasks externally from the processor 130, such that other tasks operating on other processors may readily access the data. However, it will be appreciated that the external memory usage will be used sparingly, due to the allocation of tasks as described above, such that tasks requiring the use of data generated by other tasks, or requiring the same data as other tasks will be allocated to the same compute unit 150a, 150b of a processor 110 so as to maximize the usage of the local cache 156a, 156b.

In some examples, the system 500 may comprise a memory controller (not shown), which may be a dynamic memory controller (DMC). The memory controller is coupled to the memory 520. The memory controller is configured to manage the flow of data going to and from the memory. The memory may comprise a main memory, otherwise referred to as a ‘primary memory’. The memory may be an external memory, in that the memory is external to the system 400. For example, the memory 460 may comprise ‘off-chip’ memory. The memory may have a greater storage capacity than the memory cache(s) of the processor(s) 130 and/or host processor 110. In some examples, the memory 520 is comprised in the system 520. For example, the memory 5200 may comprise ‘on-chip’ memory. The memory 520 may, for example, comprise a magnetic or optical disk and disk drive or a solid-state drive (SSD). In some examples, the memory 430 comprises a synchronous dynamic random-access memory (SDRAM). For example, the memory 460 may comprise a double data rate synchronous dynamic random-access memory (DDR-SDRAM).

One or more of the host processor 110, the processor 130, the memory 520 may be interconnected using system bus 510. This allows data to be transferred between the various components. The system bus 510 may be or include any suitable interface or bus. For example, an ARM® Advanced Microcontroller Bus Architecture (AMBAC) interface, such as the Advanced eXtensible Interface (AXI), may be used.

Each of the examples described above results in the complexity of the neural network being reduced and an increase in the efficiency as there is no requirement for the neural network to determine the characteristic data from the image data, nor determine the exposure information.

The above embodiments are to be understood as illustrative examples of the invention. Further embodiments of the invention are envisaged. It is to be understood that any feature described in relation to any one embodiment may be used alone, or in combination with other features described, and may also be used in combination with one or more features of any other of the embodiments, or any combination of any other of the embodiments. Furthermore, equivalents and modifications not described above may also be employed without departing from the scope of the invention, which is defined in the accompanying claims.

Claims

1. A processor comprising:

a command processing unit to: receive, from a host processor, a sequence of commands to be executed; and generate based on the sequence of commands a plurality of tasks; and
a plurality of compute units, wherein at least one of the plurality of compute units comprises: a first processing module for executing tasks of a first task type generated by the command processing unit; a second processing module for executing tasks of a second task type, different from the first task type, generated by the command processing unit; a local cache shared by at least the first processing module and the second processing module;
wherein the command processing unit is to issue the plurality of tasks to at least one of the plurality of compute units, and wherein at least one of the plurality of compute units is to process at least one of the plurality of tasks.

2. The processor of claim 1, wherein the command processing unit is to issue tasks of the first task type to the first processing module of a given compute unit and to issue tasks of the second task type to the second processing module of the plurality of a given compute unit.

3. The processor of claim 1, wherein the first task type is a task for undertaking at least a portion of a graphics processing operation forming one of a set of pre-defined graphics processing operations which collectively enable the implementation of a graphics processing pipeline, and wherein the second task type is a task for undertaking at least a portion of a neural processing operation.

4. The processor of claim 3, wherein the graphics processing operation comprises at least one of:

a graphics compute shader task;
a vertex shader task;
a fragment shader task;
a tessellation task; and
a geometry shader task.

5. The processor of claim 1, wherein each compute unit is a shader core in a graphics processing unit.

6. The processor of claim 1, wherein the first processing module is a graphics processing module and wherein the second processing module is a neural processing module.

7. The processor of claim 1, wherein the command processing unit further comprises at least one dependency tracker to track dependencies between commands in the sequence of commands; and wherein the command processing unit is to use the at least one dependency tracker to wait for completion of processing of a given task of a first command in the sequence of commands before issuing an associated task of a second command in the sequence of commands for processing, where the associated task is dependent on the given task.

8. The processor of claim 7, wherein an output of the given task is stored in the local cache.

9. The processor of claim 7, wherein each command in the sequence of commands has metadata, wherein the metadata comprises indications of at least a number of tasks in the command, and task types associated with each of the tasks.

10. The processor of claim 9, wherein the command processing unit allocates each command in the sequence of commands, a command identifier, and the dependency tracker tracks dependencies between commands in the sequence of commands based on the command identifier.

11. The processor of claim 10, wherein when the given task of the first command is dependent on the associated task of the second command, the command processing unit allocates the given task and the associated task a same task identifier.

12. The processor of claim 11, wherein tasks of each of the commands that have been allocated the same task identifier are executed on the same compute unit of the plurality of compute units.

13. The processor of claim 10, wherein a task allocated a first task identifier is executed on a first compute unit of the plurality of compute units and a task allocated a second, different, task identifier is executed on a second compute unit of the plurality of compute units.

14. The processor of claim 11, wherein a task allocated a first task identifier, and of the first type, is executed on the first processing module of a given compute unit of the plurality of compute units, and a task allocated a second, different, task identifier, and of the second task type, is executed on the second processing module of the given compute unit of the plurality of compute units.

15. The processor of claim 1, wherein each of the plurality of compute units further comprise at least one queue of tasks, wherein the queue tasks comprise at least a part of the sequence of commands.

16. The processor of claim 15, wherein a given queue is associated with at least one task type.

17. A method of allocating tasks associated with commands in a sequence of commands comprising:

receiving at a command processing unit, from a host processor, the sequence of commands to be executed;
generating, at the command processing unit, based on the received sequence of commands a plurality of tasks; and
issuing, by the command processing unit, each task to a compute unit of a plurality of compute units for execution, each compute unit comprising: a first processing module for executing tasks of a first task type; a second processing module for executing tasks of a second task type; and a local cache shared by at least the first processing module and the second processing module;
wherein the command processing unit is to issue the plurality of tasks to at least one of the plurality of compute units, and wherein at least one of the plurality of compute units is to process at least one of the plurality of tasks.

18. The method according to claim 17, wherein the command processing unit waits for completion of processing of the tasks associated with the first command before issuing the tasks associated with the second command to the given compute unit, when the task associated with the second command is dependent on the task associated with the first command.

19. The method according to claim 17, wherein each command has associated metadata comprising indications of at least a number of tasks in the given command, and task types associated with each of the plurality of tasks.

20. A non-transitory computer-readable storage medium comprising a set of computer-readable instructions stored thereon which, when executed by at least one processor are arranged to allocate tasks associated with commands in a sequence of commands wherein the instructions, when executed cause the at least one processor to:

receive at a command processing unit, from a host processor, the sequence of commands to be executed;
generate, at the command processing unit, based on the received sequence of commands a plurality of tasks; and
issue, by the command processing unit, each task to a compute unit of a plurality of compute units for execution, each compute unit comprising: a first processing module for executing tasks of a first task type; a second processing module for executing tasks of a second task type; and a local cache shared by at least the first processing module and the second processing module;
wherein the command processing unit is to issue the plurality of tasks to at least one of the plurality of compute units, and wherein at least one of the plurality of compute units is to process at least one of the plurality of tasks.
Patent History
Publication number: 20240036919
Type: Application
Filed: Jul 26, 2023
Publication Date: Feb 1, 2024
Applicant: Arm Limited (Cambridge)
Inventors: Alexander Eugene Chalfin (Mountain View, CA), John Wakefield Brothers, III (Calistoga, CA), Rune Holm (Oslo), Samuel James Edward Martin (Waterbeach)
Application Number: 18/358,995
Classifications
International Classification: G06F 9/48 (20060101); G06T 1/20 (20060101);