GRAPHICS PROCESSORS

Info

Publication number: 20240036932
Type: Application
Filed: Jul 26, 2023
Publication Date: Feb 1, 2024
Applicant: Arm Limited (Cambridge)
Inventors: Daren Croxford (Swaffham Prior), Sharjeel Saeed (Cambridge), Isidoros Sideris (Cambridge)
Application Number: 18/359,002

Abstract

Disclosed herein is a graphics processor that comprises a programmable execution unit operable to execute programs to perform graphics processing operations. The graphics processor further comprises a dedicated machine learning processing circuit operable to perform processing operations for machine learning processing tasks. The machine learning processing circuit is in communication with the programmable execution unit internally to the graphics processor. In this way, the graphics processor can be configured such that machine learning processing tasks can be performed by the programmable execution unit, the machine learning processing circuit, or a combination of both, with the different units being able to message each other accordingly to control the processing.

Description

Description

BACKGROUND

The technology described herein relates to graphics processors and in particular to performing machine learning processing, such as neural network processing, using graphics processors.

Neural networks can be used for processes such as machine learning, computer vision, and natural language processing operations. A neural network may operate upon suitable input data (e.g. such as an image or sound data) to ultimately provide a desired output (e.g. an identification of an object within an image, or a spoken word within a sound clip, or other useful output inferred from the input data). This process is usually known as “inferencing” or “classification”. In a graphics (image) processing context, neural network processing may also be used for image enhancement (“de-noising”), segmentation, “anti-aliasing”, supersampling, etc., in which case a suitable input image may be processed to provide a desired output image.

A neural network will typically process the input data (e.g. image or sound data) a network of operators, each operator performing a particular operation. The operations will generally be performed sequentially to produce desired output data (e.g. a classification based on the image or sound data). Each operation may be referred to as a “layer” of neural network processing.

Hence, neural network processing may comprise a sequence of “layers” of processing, such that the output from each layer is used as an input to a next layer of processing. FIG. 1 shows an exemplary sequence of layers of neural network processing from an initial input layer 101 to a final output layer 107, between which are layers comprising various convolutional layers (C-layers) 102, 103, 104, and fully-connected layers (FC layers) 105, 106.

The input layer 101 may be configured to receive input data (e.g. image or sound data), and to provide that input data in a suitable form (e.g. as an array of data elements, otherwise known as a “feature map”) for use by subsequent neural network layers. The feature map will generally comprise a three-dimensional array of data elements, each data element having data associated therewith. The feature map may have a width (W), a height (H) and a depth (C), wherein the width (W) and height (H) may be defined as the number of data elements in the width and height direction respectively, and the depth (C) may correspond to a number of data channels. For example, in the case of input data comprising an image, the width and height of the array provided by the input layer may correspond to a number of data positions (e.g. pixels) along the width and height direction of the image respectively, whilst the channels may comprise the RGB channels of the image.

After the input layer, there may be one or more other layers of neural network processing (e.g. including convolutional layers, fully-connected layers, pooling layers, deconvolution layers, or any other layers of neural network processing that may be present).

Generally, a layer of neural network processing will process an input feature map (IFM) in order to generate a corresponding output feature map (OFM) (e.g. in the case of a convolutional layer, deconvolution layer, or pooling layer), or output value (e.g. a probability in the case of a fully-connected layer). The output generated by a layer of neural network processing will be used as the input for a next layer of neural network processing in the sequence, and so on. This is illustrated in FIG. 2.

As used herein, the term “feature map” may refer to either an input feature map or an output feature map.

The operation performed by each layer of neural network processing may comprise any suitable operation which manipulates an input (feature map) to provide an output (feature map). The operation may require process parameters (e.g. such as weights for a filter or “kernel”) which may be specific to a particular layer of neural network processing. Hence, as shown in FIG. 2, suitable process parameters (e.g. weights and biases) may be read from working memory (e.g. a buffer) in order to perform each layer of neural network processing.

With reference to FIG. 1, the final layer of neural network processing in the sequence may comprise an output layer 107. The output layer may process an input feature map to generate useful output data (e.g. an inference or classification result, or in the case of image processing an output image).

Whilst FIG. 1 shows an example of a particular convolutional neural network, it will be appreciated that a neural network may have various other layer types, and/or network architectures (e.g. a recurrent neural network architecture).

Typically, in existing arrangements, data corresponding to an output feature map generated by a layer of neural network processing may be written to a suitable working memory (e.g. a buffer), as shown in FIG. 2. A next layer of neural network processing may then read that data from the buffer for use as an input feature map for said next layer of neural network processing.

In some data processing systems a dedicated neural processing unit (NPU) is provided as a hardware accelerator that is operable to perform such machine learning processing as and when desired, e.g. in response to an application that is executing on a host processor (e.g. central processing unit (CPU)) requiring the machine learning processing. For instance, an NPU may be provided along the same interconnect (bus) as other hardware accelerators, such as a graphics processor (graphics processing unit, GPU), such that the host processor (CPU) is operable to request the NPU to perform a set of machine learning processing operations accordingly, e.g. in a similar manner as the host processor is able to request the graphics processor to perform graphics processing operations. The NPU is thus a dedicated hardware unit for performing such machine learning processing operations on request by the host processor (CPU).

It has been recognised that, whilst not necessarily being designed or optimised for this purpose, a graphics processor (GPU) may also be used (or re-purposed) to perform machine learning processing tasks. For instance, neural network processing often involves a series of multiply-and-accumulate (MAC) operations for multiplying input feature values with the relevant feature weights of the kernel filters to determine the output feature values. Graphics processor shader cores may be well-suited for performing these type of arithmetic operations, as these operations are generally similar to the arithmetic operations that may be required when performing graphics processing work (but on different data). Also, graphics processors typically support high levels of concurrent processing (e.g. supporting large numbers of execution threads), and are optimised for data-plane (rather than control plane) processing, all of which means that graphics processors may be well-suited for performing machine learning processing.

Thus, a graphics processor may be operated to perform machine learning processing work. In that case, the graphics processor (GPU) may be used to perform any suitable and desired machine learning processing tasks. The machine learning processing that is performed by the graphics processor (GPU) may thus include general purpose training and inferencing jobs (that do not relate to graphics processing work as such). However, a graphics processor (GPU) may also execute machine learning (e.g. inference) jobs for graphics processing operations, such as when performing “super sampling” techniques using deep learning, or when performing de-noising during a ray tracing process, for example.

The Applicants therefore believe that there is scope for improved, e.g. more efficient, approaches for performing machine learning processing using graphics processors.

BRIEF DESCRIPTION OF THE DRAWINGS

Embodiments of the technology described herein will now be described by way of example only and with reference to the accompanying drawings, in which:

FIG. 1 shows an exemplary sequence of layers of neural network processing comprising an input layer and an output layer, between which are neural network layers comprising various convolutional layer (C-layer) layers and fully-connected layers (FC layer);

FIG. 2 illustrates a sequence of layers of neural network processing, wherein the output feature map from a layer of neural network processing may be written to a suitable buffer and then use as an input feature map for a next layer in the sequence, and wherein each layer of neural network processing may use processing parameters (e.g. such as weights) which are read from a suitable buffer;

FIG. 3 shows schematically an exemplary graphics processing system including a graphics processor an embodiment;

FIG. 4 shows schematically an embodiment of a graphics processor that can be operated in the manner of the technology described herein;

FIG. 5 shows schematically an embodiment of another graphics processor that can be operated in the manner of the technology described herein;

FIG. 6 shows schematically an example of how processing of a convolutional neural network may be performed an embodiment; and

FIGS. 7 and 8 illustrate a ray tracing de-noising operation that may be performed using a graphics processor an embodiment.

Like reference numerals are used for like features in the drawings (where appropriate).

DETAILED DESCRIPTION

A first embodiment of the technology described herein comprises a graphics processor comprising:

- a programmable execution unit operable to execute programs to perform graphics processing operations; and
- a machine learning processing circuit operable to perform processing operations for machine learning processing tasks and in communication with the programmable execution unit internally to the graphics processor,
- the graphics processor configured such that machine learning processing tasks can be performed by the programmable execution unit, the machine learning processing circuit, or a combination of both.

A second embodiment of the technology described herein comprises a method of operating a graphics processor, the graphics processor comprising:

- a programmable execution unit operable to execute programs to perform graphics processing operations; and
- a machine learning processing circuit operable to perform machine learning processing operations and in communication with the programmable execution unit internally to the graphics processor,
- the graphics processor configured such that machine learning processing tasks can be performed by the programmable execution unit, the machine learning processing circuit, or a combination of both;
- the method comprising:
- the graphics processor performing a machine learning task using a combination of the programmable execution unit and the machine learning processing circuit.

The technology described herein relates to a graphics processor (graphics processing unit, GPU) comprising a programmable execution unit that is operable to execute programs to perform graphics processing operations. The graphics processor in an embodiment acts as an accelerator, e.g., and in an embodiment, under control of a host processor, e.g. a central processing unit (CPU). Thus, when an application executing on the host processor requires graphics processing work, the host processor is operable to issue a suitable request for graphics processing work to be performed by the graphics processor.

A graphics processor can however also be used to perform other, more general purpose processing work. The technology described herein particularly relates to the situation where the graphics processor is operating to perform processing for a machine learning processing task, such as neural network processing.

In this respect, the Applicants have recognised that using graphics processors to perform machine learning processing tasks can be a relatively inefficient use of the graphics processor's resource, as the graphics processor is not generally designed (or optimised) for such tasks, and can therefore result in lower performance, e.g. compared to using a dedicated machine learning processing unit (e.g. NPU). At least in the situation where the machine learning processing relates to a graphics processing (rendering) task, re-purposing some of the graphics processor's functional units to perform the desired machine learning processing operations also prevents those functional units from performing the graphics processing work that they are designed for, which can further reduce the performance of the overall (rendering) process.

Nonetheless, in some cases, it may still be desirable to perform machine learning processing tasks using a graphics processor, e.g. rather than using an external machine learning processing unit (NPU). For instance, this may be desirable, e.g. in order to reduce silicon area, and reduce data movement, etc., especially in mobile devices where area and resource may be limited, and where it may therefore be particularly desirable to be able to use existing and available resources to perform the desired work, potentially avoiding the need for an NPU altogether. There are other examples where this may be desirable, especially where the machine learning processing itself relates to a graphics processing task, and wherein it may be particularly desirable to free up the execution unit and other functional units of the graphics processor to perform actual graphics processing operations.

To facilitate this, the technology described herein provides a dedicated machine learning processing circuit within the graphics processor that can thus be used to perform machine learning operations as desired. The machine learning circuit is provided (logically) inside the graphics processor, e.g., and in an embodiment, alongside the execution unit, with the machine learning circuit operable to communicate with the execution unit internally to the graphics processor. The machine learning circuit and execution unit can, and therefore in an embodiment do, share at least some of the graphics processor's resource, which can further improve overall efficiency (e.g. throughput, latency, energy efficiency) and/or reduce area, as will be explained further below.

In this way, by providing a machine learning processing circuit within the graphics processor, the machine learning processing circuit may allow for a more efficient (e.g. optimised) operation when performing at least some machine learning processing operations, e.g. compared to using the graphics processor's execution unit to do general purpose computations, whilst still allowing the machine learning processing to be performed locally to the graphics processor (e.g. rather than using a separate NPU accelerator that is also operable under control of the host processor independently of the graphics processor), which may be beneficial in some situations.

That is, rather than using an entirely separate machine learning processing unit that is independent of the graphics processor, such as an NPU, or only being able to perform machine learning processing operations entirely using the execution unit, the technology described herein proposes adding a dedicated machine learning processing circuit into the graphics processor itself.

This then means that the machine learning processing circuit is operable, e.g., to utilise some of the graphics processor's existing resource (e.g. such that at least some functional units and resource of the graphics processor can effectively be shared between the machine learning processing circuit and execution unit, for instance), whilst still allowing an improved (more optimised) performance compared to performing all the processing with general purpose execution in the execution unit.

Correspondingly, in embodiments, processing work can be split between the execution unit and the machine learning processing circuit in order to provide a more efficient use of the graphics processor's available processing resource.

For instance, the approach according to the technology described herein can be particularly beneficial when performing machine learning tasks that themselves relate to graphics processing work, as in that case all of the associated processing can be (and in an embodiment is) performed locally to the graphics processor, thus improving data locality, and (e.g.) reducing the need for external communication along the interconnect with other hardware units (e.g. an NPU). In that case, at least some of the machine learning processing work can be offloaded to the machine learning processing circuit, thereby freeing the execution unit to perform actual graphics processing operations, as desired.

In other words, by providing a machine learning processing circuit within the graphics processor, this means that the machine learning processing circuit is in an embodiment then operable to perform at least some machine learning processing operations whilst the other functional units of the graphics processor are simultaneously performing graphics processing operations. In the situation where the machine learning processing relates to part of an overall graphics processing task this can therefore improve overall efficiency (in terms of energy efficiency, throughput, etc.) for the overall graphics processing task.

Various arrangements are possible in this regard, as will be explained further below.

Thus, even when an NPU is also provided, it may still be desirable to be able to perform at least some machine learning processing using the graphics processor, especially when the machine learning relates to a graphics processing operation.

That is, the Applicants have recognised that many graphics processing operations themselves involve machine learning processing, and in that case it may be particularly beneficial to perform the required machine learning processing locally to the graphics processor, even where a separate NPU is provided that could otherwise be used for performing the machine learning processing tasks. An example of this would be when performing so-called “super sampling” and/or other “anti-aliasing” techniques using deep learning processing. Another example might be for de-noising applications when performing a ray tracing process. Various other examples would be possible.

Thus, in some embodiments, the machine learning processing operations being performed by the graphics processor is part of an overall graphics processing job being performed by the graphics processor.

However, the machine learning processing work that is performed by the graphics processor may generally comprise any suitable and desired machine learning processing work, and does not have to relate to graphics processing work as such. In that case, the technology described herein may still provide various benefits compared to more conventional approaches, as will be explained further below.

In particular, providing a dedicated machine learning processing circuit within the graphics processor allows for a degree of optimisation for machine learning processing operations whilst still allowing benefits of (re-)using some of the graphics processor's local resource and area when performing machine learning processing using the graphics processor. For instance, in embodiments, by utilising the machine learning processing circuit, at least some machine learning processing operations can be performed using the graphics processor in a more efficient manner (e.g. compared to more conventional graphics processor arrangements where these calculations may be (and may only be) performed entirely using the execution unit), thus reducing the need for a separate NPU (and hence reducing overall area, although a separate NPU could still be provided, if that were desired).

The graphics processor according to the technology described herein may therefore provide various benefits compared to more conventional graphics processors when performing machine learning processing operations.

The graphics processor can comprise any suitable and desired graphics processor that includes a programmable execution unit (circuit).

The programmable execution unit can be any suitable and desired programmable execution unit (circuit) that a graphics processor may contain. It should be operable to execute graphics shading programs to perform graphics processing operations. Thus, the programmable execution unit will receive graphics threads to be executed, and execute appropriate graphics shading programs for those threads to generate the desired graphics output.

There may be a single or plural programmable execution units. In an embodiment there are plural execution units, which are in an embodiment arranged as respective “shader cores”. A “shader core” thus generally comprises an execution unit together with respective interfaces and one or more other functional units with which the execution unit may communicate, as described below. Where there are plural programmable execution units (shader cores), each execution unit can in an embodiment operate in the manner of the technology described herein.

The graphics processor may be operated to perform any desired processing work. This may be graphics processing work as such or may comprise more general purpose processing operations. The technology described herein however relates particularly to the situation where the processing work includes a set of machine learning processing operations, such as for neural network processing.

To facilitate this, the graphics processor of the technology described herein thus further comprises a machine learning processing circuit that is operable (and dedicated for) performing operations for machine learning processing tasks.

Thus, a machine learning processing task that is issued to the graphics processor may generally be performed entirely using the programmable execution unit (e.g. in compute shading), entirely using the machine learning processing circuit, or (in an embodiment) using a combination of both. Various examples would be possible in this regard, as will be explained further below.

The machine learning processing circuit of the graphics processor may be, and is in an embodiment, a (substantially) fixed-function hardware unit (circuit) that is configured to perform processing operations for machine learning processing tasks. The machine learning processing circuit should thus comprise an appropriate fixed function circuit or circuits to perform the required operations, although it may comprise and have some limited form of configurability, in use, e.g. if desired.

The machine learning processing circuit is in embodiments configured to perform arithmetic operations, such as, and in an embodiment, multiply-and-accumulate (MAC) operations. The machine learning processing circuit thus in an embodiment comprises one or more MAC circuits that are configured to perform such operations. Thus, the machine learning processing circuit may load in an input feature map, together with a set of weights, biases, etc., from respective buffers (in general, ‘storage’, which storage may be integral with the machine learning processing circuit or may be located elsewhere in the graphics processor (shader core) and accessed by the machine learning processing circuit, various arrangements being possible in this regard), perform the required arithmetic (e.g. MAC) operations to generate the corresponding output feature map, and then write the output feature map into a suitable buffer. Various arrangements would be possible in this regard.

The machine learning processing circuit thus in an embodiment also has access to one or more buffers for storing data that may be required for machine learning processing operations. These buffers may be integral to the machine learning processing circuit, or may be otherwise located within the graphics processor (shader core) but accessibly by the machine learning processing circuit and available to store data for machine learning processing operations. For instance, machine learning processing typically involves input data in the form of an input feature map, output data in the form of an output feature map, the weights that are to be applied, as well as any other control information (data structures, programs, etc.) that determine the processing operations to be performed, and this data therefore needs to be loaded in and at least temporarily stored for use by the graphics processor when performing the machine learning task.

Thus, in embodiments, the machine learning processing circuit has an interface to a memory system of the graphics processor where such data resides. For instance, in embodiments the graphics processor is in communication with external, e.g. main, memory.

In embodiments, the graphics processor has one or more external memory access interface that is common for all types of data that may need to be transferred between the graphics processor and the external memory. That is, all memory requests (whether for graphics processing work or machine learning processing work) are in an embodiment made via the same, shared memory interface, in an embodiment via a shared cache system. For instance, the graphics processor in an embodiment comprises a cache (or arrangement of plural caches) that is local to the graphics processor, e.g. one or more level 2 (L2) cache, via which data can be transferred to/from external memory, and this cache can be (and in an embodiment is) also utilised by the machine learning processing circuit when fetching machine learning data from external memory. In other words, the cache system (e.g. the L2 cache or caches) is in an embodiment shared between the execution unit and the machine learning processing circuit.

In an embodiment, the machine learning processing circuit has at least some dedicated local storage (e.g. a buffer). For example, this may be used for storing the machine learning algorithm (e.g. neural network) itself.

The feature maps, weights, biases, etc., or portions thereof, could also be stored locally to the machine learning processing circuit, in dedicated respective buffers for this data. For example, a portion of the weights may be stored locally to the machine learning processing circuit, or at least locally to the graphics processor (shader core). The feature map may more typically be streamed from the cache and/or memory system but various arrangements would be possible in this regard.

However, it will be appreciated that machine learning processing may generate large amounts of data. For instance, when processing a neural network, the feature maps may typically be relatively large data structures. Likewise, the kernel weights need to be stored/retrieved accordingly when processing the different layers.

Thus, in embodiments, rather than adding dedicated storage for this purpose, the graphics processor is configured to allow other storage (buffers) that is already available for the graphics processor, and able to be re-purposed for storing data for machine learning processing when required, to be used for storing machine learning data.

Indeed, a benefit of the technology described herein is that a graphics processor typically (and in embodiments) already has relatively large (e.g. tile) buffers on chip, as well as a cache system for external memory access, as described above, that are already available for handling large amounts of graphics data and that can therefore also be (and in an embodiment are) utilised by the machine learning processing circuit for storing machine learning data.

Thus, in addition to any dedicated storage (buffers) that the machine learning processing circuit may have, the machine learning processing circuit in an embodiment also has access to various other storage (buffers) within the graphics processor that may be re-purposed for storing data that may be required for machine learning processing operations.

In embodiments this includes at least a set of one or more tile buffers that are also used when performing normal tile-based rendering but are re-purposed for storing machine learning data. Thus, in embodiments, the graphics processor is configured to perform tile-based rendering, in which graphics data is stored in one or more tile buffers, and wherein when performing a machine learning processing task at least some data for the machine learning processing task is stored using the tile buffers.

In embodiments, the storage that is available to the machine learning processing circuit may also include, e.g., a load/store unit (cache) associated with the execution unit, and/or any other suitable storage (buffers) that can be re-purposed for storing machine learning data.

The machine learning processing circuit may thus, and in an embodiment does, have (direct) interfaces to at least some of these buffers (e.g. to the tile buffers). As mentioned above, the machine learning processing circuit in an embodiment also has access to the graphics processor's external memory interface, e.g., and in an embodiment, via an L2 cache.

Thus, when the graphics processor is performing a machine learning processing task, the graphics processor may request the required data (e.g. feature maps, weights, etc.) from memory. These can then be loaded in via the cache system and then provided to the graphics processor accordingly for use when performing the machine learning processing operations.

For instance, the data may be transferred appropriately from the L2 cache into the various buffers (e.g. the tile buffers) that are available to the machine learning processing circuit.

For example, it may be appropriate to store the weights in the load/store cache and/or within a tile buffer, and in embodiments this is therefore done. Thus, a shader core may request the weight data from memory (e.g. via the cache system). The weight data can then be read in via the cache and then transferred to an appropriate (e.g. tile) buffer associated with the shader core in question.

The feature maps are typically relatively larger data structures. In embodiments, the feature maps that are currently being used for a machine learning processing operation are stored in one or more buffers, e.g. the tile buffers, within the graphics processor. For instance, the input feature map may be transferred from the L2 cache into one of the graphics processor's tile buffers, ready for processing, with the output feature map resulting from the processing then being written into another one of the tile buffers.

Where there are plural execution units, arranged as respective shader cores, each shader core may have its own set of (e.g. tile) buffers. However, all the shader cores in an embodiment share the same cache system. In embodiments, a machine learning processing task may be distributed/partitioned between a plurality of shader cores, such that all of the shader cores perform part of the same processing task. In that case, in an embodiment, the machine learning data (feature maps, weights, etc.) is transferred from the cache to all of the shader cores that require the data at the same time, e.g. in a broadcast fashion. This helps reduce memory access bandwidth, and can improve data locality.

Thus, when the feature maps (and potentially kernel weights/biases) are not currently being used, they are in an embodiment held in the L2 cache, provided that there is sufficient space in the L2 cache to do so. The feature maps (and potentially also the weights, biases, etc.) can then be transferred to any shader cores that require them for a particular processing operation. Of course, if the feature maps do not fit in the L2 cache, they may be written out to external memory, and read back in when needed, e.g. in the normal way for cache operations.

Because these are relatively large data structures, the feature maps, weights, etc., are in an embodiment stored in memory in compressed form, and then decompressed for use by the graphics processor.

The machine learning processing circuit and/or the shader core could have associated compression/decompression circuits for this purpose.

However, in embodiments, the compression/decompression of the machine learning data is performed as/when data is transferred to/from the external memory system. For instance, the graphics processor's cache system may comprise suitable compression/decompression circuits (which are already present for compressing graphics data) that can therefore be utilised for compressing the machine learning processing data.

Thus, in embodiments, the graphics processor further comprises compression and decompression circuits for compressing and decompressing data as it is transferred between the graphics processor (shader core) and the external memory.

The machine learning processing circuit thus in an embodiment also has access to these compression and/or decompression circuits so that the activation layers, weights, etc., can be transferred to/from the memory system in a compressed format. The compression and/or decompression units may be associated with the machine learning processing circuit itself, or in some embodiments may be associated with the cache system.

For instance, data may be compressed/decompressed as it is transferred from the graphics processor (shader core) where it is used in uncompressed format to the cache system, e.g. such that the data is stored in the cache in compressed form. Alternatively, the data could be stored in the cache in uncompressed form, and the compressed/decompressed as it is transferred from the graphics processor's cache system to external memory. Thus, the compression/decompression may generally take places at any suitable location between the graphics processor shader core where it is used and the external memory. Various arrangements would be possible in this regard.

There may be a combined compression and decompression unit that is operable to perform both compression and decompression, or separate compression and decompression units may be provided. In an embodiment the compression and decompression circuits are configured to be able to compress all types of data that is to be transferred from the graphics processor to memory, e.g. including both graphics data and machine learning processing data. However, it would also be possible to use separate, respective compression and decompression circuits for different types of data.

In an embodiment the machine learning processing circuit also comprises appropriate local storage, such as a queue or cache, for buffering requests/data for the machine learning processing. For instance, the machine learning processing circuit may comprise a translation lookaside buffer for storing recently used translations of virtual to physical memory addresses (“VA/PA translations”) to speed up retrieval of data.

There may be a single or plural machine learning processing circuits, e.g. such that plural programmable execution units share a given (or a single) machine learning processing circuit, and/or such that a given programmable execution unit has access to and can communicate with and use plural different machine learning processing circuits. Where there are plural machine learning processing circuits, each such circuit can in an embodiment operate in the manner of the technology described herein.

The machine learning processing circuit may be configured to perform any suitable operations that may be desired for a machine learning process. For instance, in some embodiments, the machine learning processing circuit could be designed to be able to perform all of the required processing for a particular machine learning processing task, e.g. for processing a convolutional neural network. However, in other embodiments, the machine learning processing circuit is configured to perform some but not all of the required operations, with the machine learning processing work therefore being divided between the machine learning processing circuit and the execution unit.

For example, in an embodiment, in the case where the machine learning processing task relates to processing a convolutional neural network, the machine learning processing circuit is in an embodiment configured to at least perform the processing of the convolution layers. Thus, for a given convolution layer, the machine learning processing circuit may read in (e.g. from respective buffers) the relevant input feature map, together with the relevant kernel weights, biases, etc., perform (e.g. using its MAC circuit(s)) the required convolution operations, and then write out the output feature map to an appropriate buffer.

In addition to processing convolution layers, as above, the machine learning processing circuit may also perform at least some (e.g. relatively simpler) pooling operations and/or may perform an activation function. The machine learning processing circuit may also perform any other desired operations when processing a neural network but in some embodiments the processing of the fully-connected layers, in an embodiment as well as any other more complex pooling operations, etc., are passed to the execution unit and performed by executing an appropriate (compute) shader program. Some convolution operations may also be passed to the execution unit, as desired, e.g. where these correspond to non-standard convolution operations. That is, it might be better, e.g. more efficient, to configure the machine learning processing circuit to perform some but not all convolutions.

Various arrangements would be possible in this regard and a benefit of the technology described herein is that there is flexibility to distribute the processing between the various functional units of the graphics processor in this way.

Thus, in embodiments, when the graphics processor is performing a machine learning processing task, at least some, but in embodiments not all, of the processing operations are offloaded to the machine learning processing circuit.

Another benefit of providing a dedicated machine learning processing circuit within the graphics processor is that the graphics processor can then be designed to better handle different data formats. For instance, a graphics processor's execution unit is typically (and in embodiments) configured to perform floating point and fixed point calculations (only), e.g., and in an embodiment, is configured to support only some standard floating or fixed point data formats (e.g. standard 32-bit, 16-bit, 8-bit fixed or floating point data formats), as this is what it typically desired for graphics processing tasks. In that case, the machine learning processing circuit may be operable and arranged to perform processing on any (or all) floating point, fixed point or integer data formats, as desired. That is, providing a dedicated machine learning processing circuit within the graphics processor means that the machine learning processing circuit can then be configured to work on whatever data format is desired for the machine learning processing operations, whereas the execution unit may be configured for performing certain types of floating and fixed point calculations (only). For instance, machine learning processing tasks may use specialised (non-standard) floating or fixed point data formats, such as 12-bit, 9-bit, etc., data formats, that differ from those that are normally used for graphics processing tasks (and for which the execution unit is therefore in an embodiment configured for). The machine learning processing circuit may thus be configured to process different data formats to the execution unit, e.g. depending on the machine learning processing operations that the machine learning processing circuit is designed to accelerate. This can further facilitate distribution of work between the two circuits. Various arrangements would of course be possible in this respect.

The graphics processor in an embodiment further comprises an (overall) job controller (interface) that is operable to schedule processing work for the graphics processor. For example, the job controller may be operable to receive tasks/jobs to be performed by the graphics processor, e.g. via an appropriate command stream that is provided to the graphics processor by a driver for the graphics processor. The job manager may then, e.g., schedule and distribute the processing of respective tasks/jobs to the graphics processor (and appropriate functional units of the graphics processor).

The (overall) job controller is in an embodiment common for all types of processing work and is thus able to schedule both graphics processing and machine learning processing work, as desired (although there may then be further, lower level job controllers that break such work down into sub-tasks, etc., for issuance to the different functional units, and these lower level job controllers may be dedicated for particular functional units/types of work).

As mentioned above, in embodiments there are plural execution units (which are in an embodiment arranged as respective shader cores). In embodiments, each shader core has its own respective machine learning processing circuit. However, it would also be possible for the machine learning processing circuit to be provided externally to a shader core, and/or for plural shader cores to share one or more machine learning processing circuit.

The job controller is thus in an embodiment arranged to schedule and distribute work accordingly to the different execution units (shader cores). For instance, in embodiments, where there are plural shader cores, a machine learning processing task may be distributed between the plurality of shader cores.

In that case, a plurality of shader cores may be arranged to process the same region at the same time. In that case, the input feature map may be broadcast from the L2 cache (for example) to each of the plurality of shader cores that are to perform a respective part of the processing operation for that region. For example, each shader core may then process a respective sub-set of the kernels. This approach can work well to increase data locality and/or reduce external memory access since all of the shader cores will generally need data at the same time, and the graphics processor has the capability to distribute work in this way. Also, machine learning processing is often deterministic, such that the job controller is able to accurately allocate a respective number of shader cores for performing the processing work, and schedule the work accordingly.

The machine learning processing work may be distributed between the execution unit and the machine learning processing unit within a respective shader core in various suitable ways. Various arrangements are contemplated for controlling the distribution of machine learning processing work between the execution unit and the machine learning processing unit.

In embodiments, the job controller is operable to schedule processing work for the execution unit(s) (only). In that case, the operation of the machine learning processing circuit may be controlled (triggered) by the execution unit. Thus, in embodiments, the job controller schedules one or more processing tasks for the execution unit. A thread generator circuit then generates respective execution threads for the execution unit accordingly. The execution unit may thus be caused to execute a program and this program may include one or more instructions to cause the machine learning processing circuit to perform machine learning processing. Thus, when the execution unit encounters and executes such an instruction, the execution unit is in an embodiment then caused to message the machine learning processing circuit to cause the machine learning processing circuit to perform a set of one or more machine learning processing operations, as required. A result of the processing can then be returned to execution unit accordingly.

The message to the machine learning processing circuit may thus include any suitable and required information relating to the machine learning processing operations to be performed. For example, the message may include indications of one or more of: the machine learning processing operations that are to be performed; the location of the input feature map; the location to which the output feature map should be written. Any other suitable and desired information relating to the machine learning processing may also be indicated in the message.

Thus, in an embodiment, the graphics processor is configured such that (and the method correspondingly involves steps of) when the execution unit is executing a program including an instruction that relates to a set of machine learning operations to be performed by the machine learning processing circuit: in response to the execution unit executing the instruction, the programmable execution unit is caused to message the machine learning processing circuit to cause the machine learning processing circuit to perform the set of machine learning processing operations.

In this case, the machine learning processing task is effectively performed under the control of the execution unit, with the execution unit offloading at least some (but in an embodiment not all) of the machine learning processing operations to the machine learning processing circuit, and the result of these processing operations then being returned to the execution unit. For instance, as mentioned above, the execution unit may offload at least the processing of convolution layers to the machine learning processing circuit. However, more complex pooling and/or processing of fully-connected layers may still be performed by the execution unit, appropriately. Various arrangements would be possible in this regard.

Alternatively, in some embodiments, the execution unit triggers the machine learning processing circuit to perform a machine learning processing task, but the machine learning processing task is then managed by the machine learning processing circuit.

In that case, the machine learning processing circuit could perform all of the processing work itself, or may pass some operations back to the execution unit (e.g. by triggering the generation of a thread, as will be explained below). The execution unit may thus perform some processing work for the machine learning processing task that is being performed by the machine learning processing circuit, with the result of that processing work thus being returned from the execution unit to the machine learning processing circuit.

The overall result of the machine learning processing task (i.e. task completed) may then be returned to the execution unit accordingly, at least where the execution unit triggered the operation.

As mentioned above, in embodiments, the machine learning processing circuit may be configured to perform some but not all of the processing for a given machine learning processing task. In that case, the machine learning processing circuit may be operable to cause the execution unit to perform one or more operations (sub-tasks) as part of the overall machine learning processing task.

For instance, in an embodiment, the machine learning processing circuit is operable to trigger the generation of threads for (sub) programs to be performed by the execution unit that when executed cause the execution unit to perform a set of one or more processing operations for the machine learning process. In an embodiment, the machine learning processing circuit is configured to message a thread generation circuit for the execution unit (e.g. a compute shader endpoint) to trigger the generation of such threads. That is, in an embodiment, the machine learning processing circuit has an interface to a thread generation circuit that is also used to generate other, e.g. compute, threads. However, the machine learning processing circuit could have its own thread generation circuit that is dedicated for generating machine learning threads.

In this case, the machine learning processing task is effectively managed by the machine learning processing circuit, with the execution unit acting as an accelerator to which the machine learning processing circuit can offload some of the processing, as desired, e.g. by generating suitable threads.

In other embodiments, the job controller may be configured to schedule processing work directly for both the execution unit(s) (e.g. in the normal way) and the machine learning processing circuit(s), i.e. such that the job controller can issue work to a machine learning processing circuit independently of its execution unit. In that case, when the graphics processor is desired to perform machine learning processing, the job controller can schedule one or more tasks to be performed by the machine learning processing circuit accordingly, to thereby directly trigger the machine learning processing circuit to perform a machine learning processing task (e.g. without the execution unit having to trigger this operation).

Various other arrangements would be possible. Thus, when a machine learning processing tasks is to be performed, the machine learning processing can be split in various suitable ways between the machine learning processing circuit and compute shading performed by the execution unit, with the internal communication between the two circuits facilitating this approach.

The communication between the machine learning processing circuit circuit(s), etc., and the programmable execution unit can be facilitated as desired. There is in an embodiment an appropriate communication (messaging) network for passing messages between the various units. This communication (messaging) network can operate any desired communications protocol and standard, such as using a suitable interconnect/messaging protocol.

Subject to the requirements for operation in the manner of the technology described herein, the graphics processor can otherwise have any suitable and desired form or configuration of graphics processor and comprise and execute any other suitable and desired processing elements, circuits, units and stages that a graphics processor may contain, and execute any suitable and desired form of graphics processing pipeline.

For instance, as well as the machine learning processing circuit, there may also be other accelerators (special purpose units) within the graphics processor that are able to communicate with the programmable execution unit, such as a load/store unit (circuit), an arithmetic unit or units (circuit(s)), a texture mapper, etc., if desired. In principle, any of these units may also be utilised by the machine learning processing circuit when performing machine learning processing tasks.

The graphics processor may also have any other suitable elements that a graphics processor may have. For instance, in some embodiments the graphics processor may be arranged to perform tile-based graphics processing, in which case the graphics processor may comprise a tiler circuit, one or more (and in an embodiment plural) tile buffers, and so on. The graphics processor may also comprise, for example, a graphics processing pipeline, including a primitive set-up circuit, a rasteriser, etc., and any other such functional units that a graphics processor may normally or desirably have.

The graphics processor may be arranged to perform any desired processing work. However, as explained above, the technology described herein relates particularly to situations where graphics processor is being used to perform machine learning processing. The machine learning processing may be any suitable and desired machine learning processing work. For example, in embodiments, may comprise neural network processing, e.g., for “inferencing” or “classification” purposes. As other examples, the machine learning processing may comprise image processing, such as de-noising, segmentation, etc. The machine learning processing may also relate to a training task.

The machine learning processing itself may thus be performed for any purpose. That is, the machine learning processing may in some embodiments relate to a general purpose machine learning processing task (i.e. that does not relate to graphics processing as such).

In some embodiments however the machine learning processing relates to part of an overall graphics processing task. Examples of machine learning processing relating to graphics processing may include deep learning “super sampling”, or de-noising for ray tracing processes. Other examples would be possible.

In these cases, the image that is to be processed may be an image that has been previously generated by the graphics processor itself. For instance, the image that is to be subject to the machine learning processing may currently be stored in a suitable buffer (e.g. a tile buffer) of the graphics processor. The machine learning processing circuit can then process the image in the (tile) buffer and output the result of the machine learning processing accordingly into another (tile) buffer.

For instance, in the case of a ray tracing de-noising process, the graphics processor may first be operated to perform a ray-tracing rendering (or hybrid ray-tracing rendering) process to generate an initial output frame, e.g. in the normal way for ray tracing (or hybrid ray-tracing) processes. That is, the graphics processor may first perform some actual graphics processing (ray tracing rendering) work to generate (render) an initial version of the output frame.

As part of the ray-tracing (or hybrid ray-tracing) rendering process, it may further be desired to perform “de-noising” on the initial output frames, e.g. in order to provide better (e.g. smoother) frames for display. For instance, ray tracing calculations are relatively complex, such that casting larger numbers of rays will require significant processing resource, which may not be practical for real-time rendering. This means that when generating the initial output frame, only a finite number of (relatively few) rays will have been cast, and the initially generated output frame may therefore be noisy.

In order to de-noise the initial frame, the initial frame may be processed using a suitable neural network, i.e. that has been trained to provide smoother images. In an embodiment, this de-noising is offloaded (at least in part) to the machine learning processing circuit of the technology described herein. Thus, the current (noisy) frame is loaded into a suitable buffer for input to the neural network, and the neural network processing is then performed accordingly to generate a de-noised output frame, which is then stored in another buffer. In embodiments one or more other (previous) frames, or in an embodiment an accumulation buffer that stores one or more previous frames, together with information regarding the frame motion (e.g. per-pixel motion vectors) are also provided as input to the de-noising algorithm to facilitate the de-noising (although this is not necessary).

To facilitate this operation the graphics processor (shader core) in an embodiment has multiple tile buffers. Furthermore, the tile buffers are in an embodiment oversized to allow data (pixels) from adjacent tiles to be fetched and used at the same time, e.g. as the machine learning algorithms will typically require overlap from adjacent tiles. Thus adjacent tiles are in an embodiment processed as part of a quad to allow more efficient data access.

In a similar fashion, when performing rasterisation-based rendering techniques, various super sampling/anti-aliasing techniques may be performed to try to improve the image quality. These may involve deep learning processes. For instance, when performing rasterisation-based rendering, it may again be desirable to try to reduce the amount of processing required, leading to lower quality images, but to then perform additional super sampling/anti-aliasing techniques to increase the image quality for output.

By performing the machine learning processing in the machine learning processing circuit, the other functional units in the shader core are then otherwise freed to perform the processing that they are optimised for. That is, whilst the machine learning processing circuit is performing one or more machine learning processing tasks (deep learning super sampling, de-noising, etc.) on the image data currently in the tile buffers, the rest of the graphics processor can perform actual graphics processing in parallel to continue graphics processing throughput. For instance, in the ray-tracing example given above, the graphics processor is free to cast further rays, e.g. to continue the ray tracing rendering process in parallel with the de-noising operation being performed for the current frame.

Thus a particular benefit of the technology described herein is that when the machine learning relates to a graphics processing job, the execution unit, texture mapper, etc., are free to perform the graphics processing that they are optimised for whilst the machine learning processing is performed in the machine learning processing circuit. Therefore, overall throughput and energy efficiency can be improved. This energy efficiency may be of particular importance for mobile devices such as smart phones or tablets, which are limited by their battery life, and wherein there may be a maximum power budget. Thus, in embodiments, the technology described herein is employed within a data processing system within a mobile device. However, the technology described herein may find utility within any suitable data processing systems that may include a graphics processor, and that may be used to perform machine learning processing.

The technology described herein can be used for all forms of output that a graphics processor may output. Thus, it may be used when generating frames for display, for render-to-texture outputs, etc. The output from the graphics processor is, in an embodiment, exported to external, e.g. main, memory, for storage and use.

In an embodiment, the graphics processor is part of an overall graphics (data) processing system that includes, e.g., and in an embodiment, a host processor (CPU) that, e.g., executes applications that require processing by the graphics processor. The host processor will send appropriate commands and data to the graphics processor to control it to perform graphics processing operations and to produce graphics processing output required by applications executing on the host processor. To facilitate this, the host processor should, and, in an embodiment does, also execute a driver for the graphics processor and a compiler or compilers for compiling programs to be executed by the programmable execution unit of the graphics processor.

The overall graphics processing system may, for example, include one or more of: a host processor (central processing unit (CPU)), the graphics processor (processing unit), a display processor, a video processor (codec), a system bus, and a memory controller.

The data processing system may further comprise a separate neural processing unit (NPU) that is also operable to perform operations under control of the host processor. For instance, NPU may be connected to host processor along same interconnect as graphics processor, but is otherwise independent of the graphics processor. However, an NPU is not essential and a benefit of the technology described herein is that it may be possible to avoid the use of an NPU, whilst still providing more efficient machine learning processing using the graphics processor.

Where the system does additionally comprise an NPU, a machine learning task could then be distributed between host processor (central processing unit (CPU)), the graphics processor (processing unit) and the NPU, if that were desired.

The graphics processor and/or graphics processing system may also comprise, and/or be in communication with, one or more memories and/or memory devices that store the data described herein, and/or the output data generated by the graphics processor, and/or store software (e.g. (shader) programs) for performing the processes described herein. The graphics processor and/or graphics processing system may also be in communication with a display for displaying images based on the data generated by the graphics processor. For instance, the graphics processor may write out its frame buffer to memory, with a display processor then reading the frame buffer from memory for display. Various arrangements would be possible in this respect.

As will be appreciated from the above, in a graphics processing system that is operable in the manner of the technology described herein, in embodiments of the technology described herein at least, a compiler, e.g. executing on a host processor, will generate and issue to the graphics processor one or more shader programs that when executed will perform the required processing operations in accordance with the technology described herein, with the graphics processor (the programmable execution unit of the graphics processor) then executing the programs to perform the processing, and as part of that program execution exchanging the messages discussed above with the machine learning processing circuit of the graphics processor.

The technology described herein also extends to such an overall data processing system and the operation of that system.

Another embodiment of the technology described herein comprises a data processing system comprising:

- a host processor; and
- a graphics processor operable to perform operations under control of the host processor, wherein the graphics processor comprises:
  - a programmable execution unit operable to execute programs to perform graphics processing operations; and
  - a machine learning processing circuit operable to perform machine learning processing operations and in communication with the programmable execution unit internally to the graphics processor,
  - the graphics processor configured such that machine learning processing tasks can be performed by the programmable execution unit, the machine learning processing circuit, or a combination of both.

A further embodiment of the technology described herein comprises a method of operating a data processing system, wherein the data processing system comprises:

- a host processor; and
- a graphics processor operable to perform operations under control of the host processor, wherein the graphics processor comprises:
  - a programmable execution unit operable to execute programs to perform graphics processing operations; and
  - a machine learning processing circuit operable to perform machine learning processing operations and in communication with the programmable execution unit internally to the graphics processor,
  - the graphics processor configured such that machine learning processing tasks can be performed by the programmable execution unit, the machine learning processing circuit, or a combination of both;
- the method comprising:
- the host processor requesting the graphics processor to perform a machine learning processing task; and
- the machine learning processing task being performed by the graphics processor using a combination of the programmable execution unit and the machine learning processing circuit.

As will be appreciated by those skilled in the art, these embodiments of the technology described herein can, and in an embodiment do, include any one or more or all of the features of the technology described herein described herein.

The technology described herein can be implemented in any suitable system, such as a suitably configured micro-processor based system. In an embodiment, the technology described herein is implemented in a computer and/or micro-processor based system. The technology described herein is in an embodiment implemented in a portable device, such as, and in an embodiment, a mobile phone or tablet.

The various functions of the technology described herein can be carried out in any desired and suitable manner. For example, the functions of the technology described herein can be implemented in hardware or software, as desired. Thus, for example, unless otherwise indicated, the various functional elements, stages, units, and “means” of the technology described herein may comprise a suitable processor or processors, controller or controllers, functional units, circuitry, circuits, processing logic, microprocessor arrangements, etc., that are operable to perform the various functions, etc., such as appropriately dedicated hardware elements (processing circuitry/circuits), and/or programmable hardware elements (processing circuitry/circuits) that can be programmed to operate in the desired manner.

It should also be noted here that, as will be appreciated by those skilled in the art, the various functions, etc., of the technology described herein may be duplicated and/or carried out in parallel on a given processor. Equally, the various processing stages, etc., may share processing circuitry/circuits, etc., if desired.

The methods in accordance with the technology described herein may be implemented at least partially using software e.g. computer programs. It will thus be seen that when viewed from further embodiments the technology described herein provides computer software specifically adapted to carry out the methods herein described when installed on a data processor, a computer program element comprising computer software code portions for performing the methods herein described when the program element is run on a data processor, and a computer program comprising code adapted to perform all the steps of a method or of the methods herein described when the program is run on a data processing system. The data processor may be a microprocessor system, a programmable FPGA (field programmable gate array), etc.

The technology described herein also extends to a computer software carrier comprising such software which when used to operate a display processor, or microprocessor system comprising a data processor causes in conjunction with said data processor said controller or system to carry out the steps of the methods of the technology described herein. Such a computer software carrier could be a physical storage intermediate such as a ROM chip, CD ROM, RAM, flash memory, or disk, or could be a signal such as an electronic signal over wires, an optical signal or a radio signal such as to a satellite or the like.

It will further be appreciated that not all steps of the methods of the technology described herein need be carried out by computer software and thus from a further broad embodiment the technology described herein provides computer software and such software installed on a computer software carrier for carrying out at least one of the steps of the methods set out herein.

The technology described herein may accordingly suitably be embodied as a computer program product for use with a computer system. Such an implementation may comprise a series of computer readable instructions either fixed on a tangible, non-transitory intermediate, such as a computer readable intermediate, for example, diskette, CD ROM, ROM, RAM, flash memory, or hard disk. It could also comprise a series of computer readable instructions transmittable to a computer system, via a modem or other interface device, over either a tangible intermediate, including but not limited to optical or analogue communications lines, or intangibly using wireless techniques, including but not limited to microwave, infrared or other transmission techniques. The series of computer readable instructions embodies all or part of the functionality previously described herein.

Those skilled in the art will appreciate that such computer readable instructions can be written in a number of programming languages for use with many computer architectures or operating systems. Further, such instructions may be stored using any memory technology, present or future, including but not limited to, semiconductor, magnetic, or optical, or transmitted using any communications technology, present or future, including but not limited to optical, infrared, or microwave. It is contemplated that such a computer program product may be distributed as a removable intermediate with accompanying printed or electronic documentation, for example, shrink wrapped software, preloaded with a computer system, for example, on a system ROM or fixed disk, or distributed from a server or electronic bulletin board over a network, for example, the Internet or World Wide Web.

FIG. 3 shows an exemplary system on-chip (SoC) graphics processing system 8 within which the technology described herein can be employed. As shown in FIG. 3, the graphics processing system 8 in the present embodiment comprises a host processor in the form of a central processing unit (CPU) 1, a graphics processor (GPU) 2, a display processor 3 and a memory controller 5.

As shown in FIG. 3, these units communicate via an interconnect 4 and have access to off-chip memory 6. In this system, the graphics processor 2 will render frames (images) to be displayed, and the display processor 3 will then provide the frames to a display panel 7 for display.

In use of this system, an application 13 such as a game, executing on the host processor (CPU) 1 will, for example, require the display of frames on the display panel 7. To do this, the application will submit appropriate commands and data to a driver 11 for the graphics processor 2 that is executing on the CPU 1. The driver 11 will then generate appropriate commands and data to cause the graphics processor 2 to render appropriate frames for display and to store those frames in appropriate frame buffers, e.g. in the main memory 6. The display processor 3 will then read those frames into a buffer for the display from where they are then read out and displayed on the display panel 7 of the display.

Other arrangements would of course be possible. For instance, rather than displaying the frames on a local display panel 7, the rendered frame may be transmitted over a network to a remote device for display.

Whilst its primary purpose within the graphics processing system 8 is to perform such graphics processing operations, the graphics processor (GPU) 2 may however also be used to perform more general purpose processing operations. That is, it has been recognised that a graphics processor may also find utility for various other types of processing that do not necessarily relate to graphics processing as such, but wherein similar operations as may be performed during graphics processing are also performed, but on different data.

The present embodiments relate particularly to the operation of a graphics processor, e.g. in a graphics processing system as illustrated in FIG. 3, when the graphics processor is being used to perform machine learning processing, such as neural network processing. Neural network processing generally comprises plural layers of processing, wherein each layer performs an operation on an input feature map in order to generate an output feature map, as shown in FIG. 1 and FIG. 2, for example, and as described above. It will be appreciated that, whilst FIG. 1 and FIG. 2 show an example of a particular convolutional neural network for illustrative purposes, other examples would of course possible and the technology described herein can be applied to any suitable neural network processing, and any suitable neural network architecture (e.g. comprising any suitable arrangement of layers, which may be arranged as a convolutional neural network, but could also be a recurrent neural network, etc., depending on the machine learning processing task in question). Also, whilst shown in FIG. 1 and FIG. 2 as a sequence of separate layers, it is also possible to combine the processing of a number of layers together (“layer fusion”).

Thus, the machine learning processing that is performed may generally relate to any suitable machine learning processing, e.g. using any suitable neural network.

The neural network processing operations described above could be performed by the graphics processor's shader core, e.g. entirely in compute shading. However, this can be inefficient as graphics processor (GPU) 2 is not optimised for this work. Furthermore, this means the graphics processor (GPU) 2 is prevented from performing the actual graphics processing operations that it is designed for.

Thus, according to the technology described herein, a dedicated machine learning processing circuit is provided within the graphics processor (GPU) 2, as will be explained further below.

FIG. 4 shows schematically the relevant elements and components of a graphics processor (GPU) 2 of the present embodiments.

As shown in FIG. 4, the graphics processor (GPU) 2 includes one or more shader (processing) cores 61, 62 together with a shared level 2 cache 64 which is operable to communicate with an off-chip memory system 6 (e.g. via an appropriate interconnect 4 and (dynamic) memory controller 5 as shown in FIG. 3). In the configuration shown in FIG. 4 a compression unit (compressor) 63 is provided that is operable to compress data as it is written back into the level 2 cache 64 (and conversely to decompress data as it is loaded from the level 2 cache 64 for use by the graphics processor (GPU) 2). In FIG. 4, the compression unit (compressor) 63 is thus a combined compression and decompression unit. However, there could be separate compression and decompression units, if that were desired. Other arrangements would however be possible. For instance, in FIG. 4, the compression (and decompression) unit 63 is associated with the level 2 cache 64. However, compression and decompression units could alternatively (or additionally) be provided within the shader (processing) cores 61, 62. As another example, the compression/decompression could take place between the level 2 cache 64 and the external memory system 6, e.g. such that data is stored in the cache in uncompressed form, but compressed as it is written from the cache 64 to external memory 6.

FIG. 4 shows schematically the relevant configuration of one shader core 61, but as will be appreciated by those skilled in the art, any further shader cores of the graphics processor (GPU) 2 will be configured in a corresponding manner.

The graphics processor (GPU) shader cores 61, 62 comprise programmable processing units (circuits) in the form of execution engine 65 that perform processing operations by running small programs (often referred to as “shader” programs) for each “item” in an output to be generated such as a render target, e.g. frame. (An “item” in this regard may be, e.g. a vertex, one or more sampling positions, etc.) The shader cores will process each “item” by means of one or more execution threads which will execute the instructions of the shader program(s) in question for the “item” in question. Typically, there will be multiple execution threads each executing at the same time (in parallel).

As will be appreciated by those skilled in the art there may be other elements of the graphics processor (GPU) 2 that are not illustrated in FIG. 4. It should also be noted here that FIG. 4 is only schematic, and that, for example, in practice the shown functional units may share significant hardware circuits, even though they are shown schematically as separate units in FIG. 4. It will also be appreciated that each of the elements and units, etc., of the graphics processor as shown in FIG. 4 may, unless otherwise indicated, be implemented as desired and will accordingly comprise, e.g., appropriate circuits (processing logic), etc., for performing the necessary operation and functions.

As shown in FIG. 4, each shader core of the graphics processor (GPU) 2 includes an appropriate programmable execution unit (execution engine) 65 that is operable to execute graphics shader programs for execution threads to perform graphics processing operations.

The shader core 61 also includes an instruction cache 66 that stores instructions to be executed by the programmable execution unit 65 to perform graphics processing operations.

The shader core 61 also includes an appropriate load/store unit 76 in communication with the programmable execution unit 65, that is operable, e.g., to load into an appropriate cache, data, etc., to be processed by the programmable execution unit 65, and to write data back to the memory system (via the level 2 cache 64) (for data loads and stores for programs executed in the programmable execution unit).

As shown in FIG. 4, the shader core 61 also includes a texture mapper unit in the form of texture mapping apparatus 74, which is in communication with the programmable execution unit 65, and which is operable to perform texturing operations. The texture mapping apparatus 74 includes suitable processing circuitry to follow texturing instructions. In the present embodiments, this processing circuitry is in the form of one or more dedicated hardware elements that are configured appropriately. The texture mapping apparatus 74 is in an embodiment also operable to fetch data from the memory system (although this is not shown in FIG. 4).

The graphics processor also includes local storage in the form of one or more tile buffers 75. For instance, the graphics processor, when performing (normal) tile-based graphics processing is operable to write data into these tile buffers 75. The tile buffers 75 can also be re-purposed for storing machine learning data when the graphics processor is performing a machine learning processing task.

In order to perform graphics processing operations, the programmable execution unit 65 will execute graphics shader programs (sequences of instructions) for respective execution threads (e.g. corresponding to respective sampling positions of a frame to be rendered). Accordingly, as shown in FIG. 4, the shader core 61 further comprises a fragment thread creator (generator) 72 operable to generate execution threads for execution by the programmable execution unit 65 as desired.

A job controller (job control interface) 77 is also provided that receives requests for processing work to be performed by the graphics processor (GPU) 2 from the host processor (CPU) 1 and issues respective processing tasks to the shader cores 61,62 accordingly. The job controller (job control interface) 77 is generally able to schedule any desired processing work for the graphics processor (GPU) 2, including both normal graphics processing work, as well as compute and machine learning processing work.

To facilitate the performance of machine learning processing work using the graphics processor (GPU) 2, the shader cores of the graphics processor (GPU) 2 are each provided with a respective machine learning processing circuit (neural processing accelerator, “NPA”) 78 that is operable to communicate with the execution engine internally to the graphics processor. In this way, processing work can be distributed between the functional units, as desired. Various options are contemplated in this regard and in general the work may be distributed between the machine learning processing circuit (NPA) 78 and execution engine 65 in various suitable ways.

For instance, the machine learning processing work may be initially triggered by the job controller (job control interface) 77 issuing a suitable processing task to the graphics processor (GPU) 2. The execution engine 65 may then execute an appropriate program to perform the processing task which program includes one or more instructions relating to machine learning processing operations to be performed by the machine learning processing circuit (NPA) 78.

When the execution engine 65 encounters and executes such instructions, the execution engine 65 can then message the machine learning processing circuit (NPA) 78 appropriately to cause the machine learning processing circuit (NPA) 78 to perform the desired processing operations.

As shown in FIG. 4, the machine learning processing circuit (NPA) 78 has an interface to the tile buffers 75 and also to the shader core interconnect, and hence the level 2 cache 64. The machine learning processing circuit (NPA) 78 is thus operable to utilise the graphics processor's resource to fetch machine learning data from memory via the level 2 cache 64 and to temporarily store this, e.g. in the tile buffers 75 and/or level 2 cache 64 when performing machine learning processing.

In the example shown in FIG. 4, the machine learning processing circuit (NPA) 78 is not able to perform all of the required machine learning processing work for the current machine learning processing task. The machine learning processing circuit (NPA) 78 is thus able to send messages to the shader cores' compute shader endpoint (CSE) 73 to spawn threads for the execution engine 65 to perform the work.

In this example, the machine learning processing may thus be triggered by the execution engine 65 but is then managed by the machine learning processing circuit (NPA) 78, with the machine learning processing circuit (NPA) 78 causing the execution engine 65 to perform some of the processing work as desired. Other arrangements would however be possible.

For instance, FIG. 5 shows another example where the machine learning processing circuit (NPA) 78 is able to perform more (e.g. all) of the machine learning processing. Therefore, the machine learning processing circuit (NPA) 78 in this example may not need to create threads to the compute shader endpoint (CSE) 73. The machine learning processing circuit (NPA) 78 could still be requested to perform work by the execution engine 65, as described above, or directly from the job controller (job control interface) 77, as illustrated in FIG. 5.

In other arrangements, the processing may be performed under the control of the execution engine 65, with the job controller 77 requesting work to be performed by the execution engine 65, and the execution engine 65 being able to message the machine learning processing circuit (NPA) 78 to perform processing work with the result then being written out and/or otherwise returned for use by the execution engine 65 accordingly.

Thus, in the present embodiments, the machine learning processing circuit (NPA) 78, is operable to communicate with the execution engine 65 internally to the graphics processor in order to distribute processing work between the machine learning processing circuit (NPA) 78 and the execution engine 65, as desired.

Various options would be possible in this regard and in general a graphics processor of the technology described herein may be operated in either of the manners described above, or according to some combination of these approaches, depending on the processing task in question.

For instance, the machine learning processing tasks being performed by the graphics processor may generally comprise any suitable and desired machine processing task. In embodiments, this involves processing of a convolutional neural network, as shown in FIGS. 1 and 2.

FIG. 6 shows schematically one approach for dividing the processing of a convolutional neural network between the machine learning processing circuit (NPA) 78 and the execution engine 65.

In FIG. 6, the processing of the convolutional layers is performed by the machine learning processing circuit (NPA) 78 and the machine learning processing circuit (NPA) 78 is accordingly configured and optimised for performing such convolutions. However, the pooling operations and processing of any fully-connected layers is in this example still performed by the execution engine 65.

Other examples would of course be possible. For instance, the machine learning processing circuit (NPA) 78 could also be configured to perform at least some of the pooling operations, with these only being offloaded to the execution engine 65 in particularly complex cases. Likewise, the machine learning processing circuit (NPA) 78 may be configured only for some types of convolutions (for example, 3×3×c convolutions), with other, e.g. more complex convolutions (for example, non-3×3×c convolutions), being passed to the execution engine 65. Or, only part of the convolution operation may be performed by the machine learning processing circuit (NPA) 78, with other parts of the convolution operation performed by the execution engine 65. For example, the MAC operations may be performed using the machine learning processing circuit (NPA) 78, with the bias and activation functions being performed by the execution unit 65. Various examples would be possible in this regard. Typically, the processing of any fully-connected layers will be performed by the execution engine 65 but this is not necessary and this could also be offloaded to the machine learning processing circuit (NPA) 78, as desired.

Thus, in general, a given machine learning task could be performed either entirely using the machine learning processing circuit (NPA) 78, entirely using the execution engine 65, or some combination of both.

The machine learning task may be any suitable and desired machine learning task. For instance, the task may relate to a generic training or inference job. However, the machine learning processing work may itself relate to a graphics processing operation. An example of this would be ray tracing de-noising, as illustrated schematically in FIGS. 7 and 8.

Ray tracing is a known rendering process which involves tracing the paths of rays of light from a viewpoint (sometimes referred to as a “camera”) back through sampling positions in an image plane into a scene, and simulating the effect of the interaction between the rays and objects in the scene. The output data value, e.g., sampling point in the image, is determined based on the object(s) in the scene intersected by the ray passing through the sampling position, and the properties of the surfaces of those objects. The ray tracing calculation is complex, and involves determining, for each sampling position, a set of objects within the scene which a ray passing through the sampling position intersects.

Thus, after performing ray tracing using a first set of rays, the initial output frame may be relatively noisy. A neural network may thus be trained to transform noisy images into smoother frames, e.g. for output. This process is illustrated in FIG. 7. In FIG. 7, the de-noising is performed by analysing (only) the current frame. However, it is possible to perform de-noising analysing the current and also previous (noisy or de-noised) frames, as shown in FIG. 8.

Thus, as shown in FIG. 7, when performing a ray tracing rendering process, the graphics processor may be operated to generate an initial (noisy) output frame. This processing will typically be performed or at least managed by the execution engine 65, in the normal way for a ray tracing process. This will involve casting a certain number of rays in order to generate an initial output frame (step 80). Because ray-tracing processing can be computationally expensive, it may only be possible to cast relatively few rays within the desired frame rate. This can therefore lead to noisy images. Thus, it may be desirable to perform “de-noising” to try to generate better frames for output.

Once an initial (noisy) output frame is generated (step 82), the execution engine 65 may then message the machine learning processing circuit (NPA) 78 to perform the desired de-noising operations (with the machine learning processing circuit (NPA) 78 either performing the de-noising entirely itself, or passing some of this work back to the execution engine 65, as explained above) (step 84). The de-noising process thus generates a final, smoother frame for output (step 86). The final frame can then be written out, e.g. to a framebuffer, ready for output, e.g. in the normal way.

The process in FIG. 8 is similar except that the de-noising algorithm (step 84) additionally takes as input information regarding one or more previous frames. For instance, a number of previous (de-noised) frames may be accumulated in a suitable accumulation buffer, and then used, together with respective per-pixel motion vectors indicating relative movement between those frames and the frames, as part of the de-noising process to generate the final frame (step 86).

By offloading the de-noising processing to the machine learning processing circuit (NPA) 78, at least in part, this means that the execution engine 65 is then free to continue with the ray tracing process, e.g. by casting further rays, and so on. That is, the machine learning processing circuit (NPA) 78 can perform the de-noising process simultaneously with the other functional units performing graphics processing operations. This can therefore provide a particularly efficient approach for performing such machine learning processing within a graphics processing job.

It can be seen from the above that the technology described herein, in its embodiments at least, can provide a more efficient process for performing machine learning processing using a graphics processor. This is achieved, in the embodiments of the technology described herein at least, by using a dedicated machine learning processing circuit within the graphics processor to perform at least some processing operations for the machine learning processing task to be performed, but with other processing for the task in an embodiment being performed by executing an appropriate shader program or programs using a programmable execution unit of the graphics processor.

The foregoing detailed description has been presented for the purposes of illustration and description. It is not intended to be exhaustive or to limit the technology to the precise form disclosed. Many modifications and variations are possible in the light of the above teaching. The described embodiments were chosen in order to best explain the principles of the technology and its practical application, to thereby enable others skilled in the art to best utilise the technology in various embodiments and with various modifications as are suited to the particular use contemplated. It is intended that the scope be defined by the claims appended hereto.

Claims

1. A graphics processor comprising:

a programmable execution unit operable to execute programs to perform graphics processing operations; and

a machine learning processing circuit operable to perform processing operations for machine learning processing tasks and in communication with the programmable execution unit internally to the graphics processor,

the graphics processor configured such that machine learning processing tasks can be performed by the programmable execution unit, the machine learning processing circuit, or a combination of both.

2. The graphics processor of claim 1, being configured such that, when the execution unit is executing a program including an instruction that relates to a set of machine learning operations to be performed by the machine learning processing circuit: in response to the execution unit executing the instruction, the programmable execution unit is caused to message the machine learning processing circuit to cause the machine learning processing circuit to perform the set of machine learning processing operations.

3. The graphics processor of claim 2, wherein the machine learning processing circuit is configured to return a result of its processing to the execution unit for further processing.

4. The graphics processor of claim 1, wherein the machine learning processing circuit, when performing a machine learning processing task, is operable to cause the execution unit to perform one or more processing operations for the machine learning processing task being performed by the machine learning processing circuit.

5. The graphics processor of claim 4, wherein the machine learning processing circuit is operable to trigger the generation of threads for execution by the programmable execution unit to cause the execution unit to perform the one or more processing operations for the machine learning processing task being performed by the machine learning processing circuit.

6. The graphics processor of claim 1, wherein the machine learning processing circuit comprises one or more multiply-and-accumulate circuits.

7. The graphics processor of claim 1, wherein the graphics processor includes a cache system for transferring data to and from an external memory, and wherein the machine learning processing circuit has access to the graphics processor's cache system.

8. The graphics processor of claim 7, wherein when a machine learning processing tasks is to be performed using the graphics processor, the graphics processor is operable to fetch required input data for the machine learning processing task via the cache system, and write an output of the machine learning processing task to memory via the cache system.

9. The graphics processor of claim 7, further comprising compression and decompression circuits for compressing and decompressing data as it is transferred between the graphics processor and the external memory.

10. The graphics processor of claim 1, comprising a plurality of programmable execution units, arranged as respective shader cores, with each shader core having its own respective machine learning processing circuit, and wherein an overall job controller of the graphics processor is operable to distribute processing tasks between the different shader cores.

11. The graphics processor of claim 1, wherein the graphics processor is configured to perform tile-based rendering, in which graphics data is stored in one or more tile buffers, and wherein when performing a machine learning processing task at least some data for the machine learning processing task is stored using the tile buffers.

12. A method of operating a graphics processor, the graphics processor comprising:

a programmable execution unit operable to execute programs to perform graphics processing operations; and

a machine learning processing circuit operable to perform machine learning processing operations and in communication with the programmable execution unit internally to the graphics processor,

the graphics processor configured such that machine learning processing tasks can be performed by the programmable execution unit, the machine learning processing circuit, or a combination of both;

the method comprising:

the graphics processor performing a machine learning task using a combination of the programmable execution unit and the machine learning processing circuit.

13. The method of claim 12, further comprising: when the execution unit is executing a program including an instruction that relates to a set of machine learning operations to be performed by the machine learning processing circuit: in response to the execution unit executing the instruction, the programmable execution unit messaging the machine learning processing circuit to cause the machine learning processing circuit to perform the set of machine learning processing operations.

14. The method of claim 12, further comprising: the machine learning processing circuit, when performing a machine learning processing task, causing the execution unit to perform one or more processing operations for the machine learning processing task being performed by the machine learning processing circuit.

15. The method of claim 14, wherein the machine learning processing circuit causes the execution unit to perform one or more processing operations by triggering the generation of an execution thread, which execution thread when executed by the execution unit causes the execution unit to perform the one or more processing operations for the machine learning processing task being performed by the machine learning processing circuit.

16. The method of claim 12, comprising the machine learning processing circuit returning a result of its processing to the execution unit for further processing.

17. The method of claim 12, wherein the graphics processor includes a cache system for transferring data to and from an external memory, and wherein the machine learning processing circuit has access to the graphics processor's cache system, and wherein when a machine learning processing tasks is to be performed using the graphics processor, the graphics processor fetches required input data for the machine learning processing task via the cache system, and writes an output of the machine learning processing task to memory via the cache system.

18. The method of claim 17, further comprising compressing data as it is written to memory and/or decompressing data as it is retrieved from memory.

19. The method of claim 12, comprising a plurality of programmable execution units, arranged as respective shader cores, with each shader core having its own respective machine learning processing circuit, and wherein an overall job controller of the graphics processor is operable to distribute processing tasks between the different shader cores.

20. The method of claim 12, wherein the graphics processor is configured to perform tile-based rendering, in which graphics data is stored in one or more tile buffers, and wherein the method comprises: when performing a machine learning processing task, storing at least some data for the machine learning processing task using the tile buffers.