METHOD AND PROCESSING UNIT FOR GENERATING AN OUTPUT FEATURE MAP

Info

Publication number: 20230367991
Type: Application
Filed: May 5, 2023
Publication Date: Nov 16, 2023
Inventors: Anders Per SJÖ (Södra Sandby), Fredrik Peter STOLT (Lund), Stefan Johannes FRID (Södra Sandby)
Application Number: 18/312,868

Abstract

A method for generating output feature map data during operation of neural network processing by a processing unit comprising a plurality of computation resources. The method comprises obtaining first, real, data to be processed and loading the first data into a set of the plurality of computation resources, causing the set of computation resources to generate a computational result, in a first processing cycle of the processing unit. A lack of real data for processing in a second processing cycle of the processing unit, which is subsequent to the first processing cycle, is detected. The method comprises obtaining second, artificial, data, loading the second data into an artificially activated set, of the set of computation resources, in the second processing cycle, inhibiting the second data from affecting the output feature map data, and generating the output feature map data based at least in part on the computational result.

Description

Description

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims priority to United Kingdom Application No. GB 2207130.2, filed May 16, 2022, under 35 U.S.C. § 119(a). The above-referenced patent application is incorporated by reference in its entirety.

BACKGROUND Technical Field

The present invention relates to a method and processing unit for generating an output feature map.

Background

Neural networks have emerged as powerful tools for image processing, inference, machine learning, and related tasks. Neural networks may include convolutional layers. In a convolutional layer, an output feature map (OFM) comprising OFM data is computed via convolutions between input feature map (IFM) data of an IFM, and a matrix of weights.

The convolutional computations account for a significant portion of the computational cost of performing inference or training for a neural network, both in terms of processing time and in terms of the power required to switch bits within registers. Since these computations are performed repeatedly during inference or training, specialised integrated circuits called hardware accelerators have been developed.

A neural processing unit (NPU) is a hardware accelerator which is specialised for processing data in accordance with neural networks, for example, convolutional neural networks (CNNs). An NPU includes an array of specialised convolution engines (CEs), which each contain multiply-accumulate (MAC) hardware to perform convolutional operations.

The power consumption by the MAC hardware ranges a large span between different neural network models and different IFMs. In some scenarios, the power consumed by the MAC hardware can be significantly higher than by the rest of the NPU. Processing of an input data array can lead to sudden and extreme changes in NPU power consumption in a single clock cycle, which may go beyond the limits of what the surrounding power supply system can handle. This in turn can lead to sudden increases or decreases in on-chip voltage, which may result in hold, set-up and clock skew violations, and may cause the NPU to crash.

It is desirable to reduce the amount of change in power consumption by the NPU between clock cycles, and/or to reduce the average time derivative of the power consumed by the NPU.

SUMMARY

According to a first aspect of the present invention, there is provided a method for generating output feature map data during operation of neural network processing by a processing unit. The processing unit comprises a plurality of computation resources. The method comprises obtaining first, real, data to be processed. The method comprises loading the first data into a set of the plurality of computation resources, causing the set of computation resources to generate a computational result, in a first processing cycle of the processing unit. The method comprises detecting a lack of real data for processing in a second processing cycle of the processing unit, which is subsequent to the first processing cycle. The method comprises obtaining second, artificial, data. The method comprises loading the second data into an artificially activated set, of the set of computation resources, in the second processing cycle. The method comprises inhibiting the second data from affecting the output feature map data. The method comprises generating the output feature map data based at least in part on the computational result.

By loading artificial data into the computation resources during a processing cycle where no or less data would otherwise be processed by the computation resources, more power may be consumed by the computation resources, reducing the amount of change in power consumption by the processing unit, and in turn, a risk of hold, set-up and clock skew violations may be reduced.

Optionally, the second processing cycle is immediately subsequent to the first processing cycle. This may mean that the amount of change in power consumption by the processing unit stays substantially the same in adjacent processing cycles.

Optionally, the method comprises deriving the second data from real data. This may mean that the amount of power consumed by the processing unit during the second processing cycle is similar to the amount of power consumed by the processing unit during processing of real data.

Optionally, the method comprises obtaining preceding, real, data to be processed before the first data. The method may further comprise loading the preceding data into a preceding set, of the set of computation resources, causing the preceding set of computation resources to generate a preceding computational result, in a preceding processing cycle of the processing unit. The method may further comprise deriving the second data from the preceding data. The method may further comprise generating the output feature map data based at least in part on the preceding computational result. This may mean that a number of bits of the computation resources which are toggled during the first processing cycle is similar to a number of bits of the computation resources which are toggled during the second processing cycle.

Optionally, the method comprises maintaining, in the second processing cycle, a deactivated subset of the set of computation resources in an idle state. This may mean that the power consumption by the processing unit decreases during the second processing cycle.

Optionally, the method comprises obtaining third, artificial, data. The method may further comprise loading the third data into a further artificially activated set, of the set of computation resources, in a third processing cycle of the processing unit subsequent to the second processing cycle. The method may further comprise inhibiting the third data from affecting the output feature map data. The method may further comprise maintaining, in the third processing cycle, a further deactivated subset of the set of computation resources in the idle state. Further, the further deactivated subset may be larger than the deactivated subset. This may mean that the power consumption by the processing unit decreases gradually from the first processing cycle to the third processing cycle.

Optionally, the method comprises obtaining ramping, real, data. The method may further comprise loading the ramping data into a ramping activated subset, of the set of computation resources, causing the ramping activated subset of computation resources to generate a ramping computational result, in a ramping processing cycle of the processing unit prior to the first processing cycle. The method may further comprise maintaining, in the ramping processing cycle, a ramping deactivated subset of the set of computation resources in an idle state. The method may further comprise generating the output feature map data based at least in part on the ramping computational result. This may mean that the power consumption by the processing unit may increase gradually between a processing cycle in which it is processing no data, and the first processing cycle.

Optionally, the method comprises obtaining further ramping, real, data. The method may further comprise loading the further ramping data into a further ramping activated subset, of the set of computation resources, causing the further ramping activated subset of computation resources to generate a further ramping computational result, in a further ramping processing cycle of the processing unit prior to the ramping processing cycle. The method may further comprise maintaining, in the further ramping processing cycle, a further ramping deactivated subset of the set of computation resources in the idle state. The method may further comprise generating the output feature map data based at least in part on the further ramping computational result. Further, the ramping deactivated subset may comprise at least one computation resource of the further ramping activated subset of computation resources. This may mean that the buffering of data is reduced, while enabling data to be provided synchronously.

Optionally, the lack of real data comprises at least one of a lack of real input feature map data and a lack of real weights, and the second, artificial, data comprises artificial input feature map data and artificial weights. This may mean that the change in power consumption is reduced as compared with, for example, loading real weights and artificial input feature map data.

Optionally, each computation resource of the plurality of computation resources comprises at least one multiply-accumulate unit, each multiply-accumulate unit configured to multiply a portion of input feature map data by at least one weight.

Optionally, the method comprises causing the artificially activated set to generate an artificial computational result in the second processing cycle. Further, inhibiting the second data from affecting the output feature map data may comprise discarding the artificial computational result. This may mean that more power may be consumed by the computation resources in the second processing cycle.

Optionally, each of the set of computation resources is associated with a buffer. The method may further comprise synchronously providing the first data to the buffers associated with the set of computation resources. The method may further comprise synchronously providing the second data to the buffers associated with the artificially activated set of computation resources. Further, loading the first data into the set of computation resources may comprise, for each of the set of computation resources, loading the first data from the buffer associated with the computation resource into the computation resource. Still further, loading the second data into the artificially activated set may comprise, for each of the artificially activated set of computation resources, loading the second data from the buffer associated with the computation resource into the computation resource. This may mean that the input feature map data may be provided synchronously.

Optionally, the method comprises, after detecting the lack, lengthening a processing cycle duration of the processor so that a duration of the second processing cycle is greater than a duration of the first processing cycle. This may mean that the average time derivative of the power consumption by the processing unit may be reduced.

Optionally, the processing unit is a neural processing unit.

According to a second aspect of the present invention, there is provided a processing unit configured to perform a method according to the first aspect.

Further features and advantages of the invention will become apparent from the following description of preferred embodiments of the invention, given by way of example only, which is made with reference to the accompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 shows an example of a multiply-accumulate operation performed during calculation of OFM data based on IFM data.

FIG. 2 is a schematic diagram of a neural processing unit which falls outside of the scope of the claims.

FIG. 3a shows a timeline for approximate power consumption of a neural processing unit which falls outside of the scope of the claims.

FIG. 3b shows a timing diagram for processing of IFM data which falls outside of the scope of the claims.

FIG. 4 is a schematic diagram of a neural processing unit according to the invention.

FIG. 5a shows a timeline for approximate power consumption of a neural processing unit according to the invention.

FIG. 5b shows a timing diagram for processing of IFM data according to the invention.

DETAILED DESCRIPTION

Details of systems and methods according to examples will become apparent from the following description with reference to the figures. In this description, for the purposes of explanation, numerous specific details of certain examples are set forth. Reference in the specification to ‘an example’ or similar language means that a feature, structure, or characteristic described in connection with the example is included in at least that one example but not necessarily in other examples. It should be further noted that certain examples are described schematically with certain features omitted and/or necessarily simplified for the ease of explanation and understanding of the concepts underlying the examples.

Neural networks are typically constructed from three types of layers. An input layer is the initial data for the neural network. An output layer provides the results for given inputs. One or more hidden layers are provided between the input layer and the output layer. The hidden layers may include convolutional layers. Other layers such as pooling layers and deconvolution layers and other structures such as recurrent neural networks may be present. In a convolutional layer, output feature map (OFM) data is generated via convolutions between input feature map (IFM) data and a set of weights.

FIG. 1 shows an example of a multiply-accumulate operation 10 performed during calculation of OFM data using IFM data. The multiply-accumulate operation uses IFM data values (X1, X2 and X3), weights (W1, W2 and W3), and an activation function 11 to generate an OFM data value Y. Each IFM element, X1, X2, X3 is multiplied by a corresponding weight W1, W2, W3. The results of the multiplications of the IFM data values with their corresponding weights are added together to generate an accumulated result in step 10. The generation of the sum from the IFM data values and the weights may be referred to as taking a dot product of an IFM vector comprising the IFM data values and a weight vector comprising the corresponding weights. An activation function 11 is applied to the accumulated result to generate the OFM data value Y. The activation function 11 may be, for example, a sigmoid function or a hyperbolic tangent function.

There may be more than one OFM data value calculated based on a given set of IFM data values from the IFM. In such a case, a dot product between the same IFM data values X1, X2 and X3 and a different set of weights corresponds to a different OFM data value.

FIG. 2 is a schematic diagram of a neural processing unit which falls outside of the scope of the claims. An NPU 200 is configured to accelerate the performance of calculations associated with neural networks by, amongst other things, efficiently performing multiply-accumulate operations described above in connection with FIG. 1 to generate OFM data.

The NPU 200 comprises IFM generators such as IFM generator 210. The IFM generators are arranged to provide IFM data to an array of MAC elements 220. During a given processing cycle, such as a clock cycle, an IFM generator may synchronously provide the same set of IFM data to every MAC element in the column of the array of MAC elements 220 corresponding to the IFM generator. The set of IFM data provided by one IFM generator is typically different from the IFM data provided by a different IFM generator. However, each row of MAC elements collectively receives the same set of data from the IFM generators. One row of MAC elements may be referred to as a computation resource or a convolution engine (CE) such as CE 221. The NPU 200 comprises a plurality of computation resources. The NPU 200 may comprise eight computation resources. The IFM generators may obtain the IFM data from a storage medium internal or external to the NPU 200 (not shown). IFM data that is to be used to generate OFM data may be referred to as real IFM data.

The NPU 200 comprises weight generators such as weight generator 230. The weight generators are arranged to provide weights to the array of MAC elements 220. One weight generator provides the same set of weights to every MAC element in the CE corresponding to the weight generator. The set of weights provided by one weight generator may be different from the set of weights provided by a different weight generator. In this way, each CE receives a different set of weights. The weight generators may receive the weights from a storage medium internal or external to the NPU 200 (not shown).

Each MAC element is configured to multiply at least one IFM data value by at least one weight. Each MAC element may be configured to multiply an IFM vector by a weight vector in accordance with the method shown in FIG. 1. Each CE 221 is associated with an accumulator buffer such as accumulator buffer 240. Each MAC element transfers the result of its multiplication to its associated accumulator buffer. Each accumulator buffer accumulates the results of the multiplications from each MAC element of its associated CE to produce an accumulated result. Each accumulator buffer transfers its accumulated result to an associated OFM channel such as OFM channel 250. The activation function 11 is applied to each accumulated result to generate OFM data values. The OFM is generated from the OFM data values.

FIG. 3a shows a timeline for approximate power consumption of a neural processing unit which falls outside of the scope of the claims. FIG. 3b shows a timing diagram for processing of IFM data which falls outside of the scope of the claims. The timeline of FIG. 3a pertains to the NPU 200 for which the timing of the processing of IFM data is shown in FIG. 3b.

Prior to and subsequent to processing of the IFM data, the CEs may be in an idle state. Furthermore, CEs may be idle during processing of an IFM, if there is a temporary lack of real IFM data or real weights for the CEs to process. In a given time period, the IFM generators may obtain and provide real IFM data if they have real IFM data, or may not provide real IFM data if there is a lack of real IFM data for them to provide. Similarly, the weight generators may provide weights if they have weights, or may not provide real weights if there is a lack of real weights for them to provide.

CEs 0 to 7 each comprise at least one MAC element. The total number of MAC elements in the array of MAC elements 220 may be 4096, and the array of MAC elements 220 may be evenly divided into CEs, such that each CE comprises 512 MAC elements.

An IFM is divided into IFM blocks 0 to 9. Each IFM block comprises a set of IFM data values. In the present example, each IFM block is obtained by the IFM generators and provided synchronously to each CE. IFM block 0 is obtained and provided first. In this example, IFM block 0 is loaded into each CE in a first clock cycle of the NPU 200. During the first clock cycle each CE may multiply at least one IFM data value by at least one weight during the first clock cycle and transfer the result of its multiplication to its associated accumulator buffer, as described with reference to FIG. 2.

Loading IFM block 0 into each CE typically comprises toggling at least some of the bits of each CE. This may consume power. In addition, loading the weights into the CEs, performing the multiplication, performing number format conversions, and transferring the result to an accumulator buffer also may consume power. It should be noted that toggling the bits of each CE may be a major contributor to the amount of power consumed during a given clock cycle of the processing unit. The amount of toggling required may depend on the magnitude of the IFM data.

Prior to the loading of IFM block 0 into each CE, the CEs may have been in an idle state, where no IFM data is being loaded into any of the CEs. In such a case, the loading of IFM block 0 into each CE means that the voltage across and power consumed by the array of MAC elements 220 of the NPU 200 increases during the first clock cycle. The increase in power consumed by the NPU 200 may be up to 4 W, and this increase may occur over approximately 1 nanosecond. This sudden increase, or power transient, may result in hold, set-up and clock skew violations, which can sometimes cause the NPU to crash.

IFM block 1 is the next IFM block to be obtained by the IFM generators and provided to each of the CEs. IFM block 1 is loaded into each CE in a second clock cycle of the NPU 200 immediately subsequent to the first clock cycle.

Loading IFM block 1 into each CE typically comprises toggling at least some of the bits of each CE. The amount of toggling required may depend on the IFM data in IFM blocks 0 and 1. For example, if a bit of a CE comprises a data value of IFM block 0 in the first clock cycle, and the data value of IFM block 1 to be loaded during the second clock cycle is the same as the data value of IFM block 0 in the first clock cycle, the bit may not need to be toggled. However, if these two data values are different, then the bit may need to be toggled. The amount of power consumed in toggling these bits may be proportional to the number of bits that need to be toggled. The amount of power consumed may depend on the number of sign changes between the data values of IFM block 0 and the data values of IFM block 1.

IFM block 2 is the next IFM block to be obtained by the IFM generators and provided to each of the CEs. IFM block 2 is loaded into each CE in a third clock cycle of the NPU 200 immediately subsequent to the second clock cycle. As mentioned above, the amount of power consumed in toggling the bits of the CEs is proportional to the number of bits that need to be toggled, which depends on the IFM data in IFM blocks 1 and 2. Therefore, the amount of power consumed in toggling the bits during the second clock cycle may be different from the amount of power consumed in toggling the bits during the third clock cycle.

The loading of IFM blocks 3 to 5 into the CEs is performed in successive clock cycles of the NPU 200 subsequent to the second clock cycle, similar to the manner described above.

After IFM block 5 is loaded into each of the CEs, there is a lack of data to be loaded into the CEs. This may result from the IFM generators 210 or the weight generators 230 being unable to provide the IFM data at the same speed at which the IFM data is processed by the CEs. In any case, there is a lack of IFM data to be processed in a seventh clock cycle of the NPU 200.

As a result, during the seventh clock cycle, each of the CEs becomes idle. No IFM data is loaded into the CEs, and the CEs do not perform any multiplication of any IFM data with any weights. The bits of the CEs are not toggled to load new IFM data into the CEs.

In turn, the power consumed by the NPU 200 decreases substantially. In a similar fashion to the sudden increase in power caused by the loading of IFM block 0 into each CE, the sudden decrease in power consumed may result in hold, set-up and clock skew violations, which can sometimes cause the NPU to crash. This lack of IFM data to be processed lasts for one clock cycle. IFM blocks 6 and 7 are loaded into each CE in the eighth and ninth clock cycles of the NPU 200 respectively. This is followed by a second lack of IFM data which lasts for the tenth and eleventh clock cycles of the NPU 200.

In the twelfth and thirteenth clock cycles of the NPU 200, IFM blocks 8 and 9 respectively are loaded into each CE. This completes the loading of the IFM into the CEs. From the fourteenth clock cycle of the NPU 200 onwards, no IFM data or weights are loaded into the CEs. As a result, each of the CEs becomes idle. In turn, the power consumed by the NPU decreases substantially.

As mentioned previously, sudden decreases and increases in power consumed by the array of MAC elements 220 may result in hold, set-up and clock skew violations. It is therefore desirable to provide a method for computing an output feature map which has less steep or smaller decreases and increases in power. One example of such a method will be described with reference to FIGS. 4 and 5.

FIG. 4 is a schematic diagram of a neural processing unit according to the invention. The neural processing unit may be configured to process IFM data according to the timing diagram of FIG. 5b. In order to mitigate a power transient, an artificial data injector 410 may be provided, and/or an IFM buffer 420 for each MAC element and/or a weight buffer 430 for each CE.

The NPU 400 may comprise IFM generators, an array of MAC elements 220, and/or weight generators, arranged to perform at least some of the operations described with reference to FIG. 2.

The IFM generators such as IFM generator 210 may provide real IFM data to IFM buffers such as IFM buffer 420. The IFM buffers may be configured to store IFM blocks. A MAC element may load IFM data from an IFM buffer associated with that MAC element. In this example, there is one IFM buffer associated with each MAC element.

While only one CE is shown, it is to be understood that the NPU comprises a plurality of CEs, for example 8 CEs. Each MAC element of each CE may be associated with an IFM buffer.

In addition to providing real IFM data to IFM buffers, the IFM generators may also provide real IFM data to the artificial data injector 410. The artificial data injector 410 may derive artificial IFM data from the real IFM data. Additionally, or alternatively, artificial IFM data may be pre-programmed into the artificial data injector 410.

In general, real data refers to data input to the NPU 400 for processing during normal operation. Real data is used to generate computational results, and output feature map data is generated based at least in part on such computational results. Put another way, real data refers to data that, when processed, contributes to the output feature map generated by the computation resources.

In contrast, artificial data refers to data which may be derived from real data by or pre-programmed into the artificial data injector 410, and which may be loaded into the CEs during clock cycles where no or less data would otherwise be processed by the CEs. In this way, more power is consumed by the computation resources. This in turn reduces the amount of change in power consumption or current drawn by the processing unit between clock cycles in which real data is processed and clock cycles in which less or no real data is processed, and in turn, a risk of hold, set-up and clock skew violations may be reduced. The artificial data is inhibited from affecting the OFM.

The artificial data injector 410 may store concurrently up to two real IFM blocks such as two of IFM blocks 0 to 9 described with reference to FIG. 3b. The two IFM blocks may be provided to the artificial data injector 410 in succession by the IFM generators.

The weight generators such as weight generator 230 may provide weights to weight buffers such as weight buffer 430. The weight buffers may be configured to store weights. While only weight buffer 430 is shown, it is to be understood that there may be one weight buffer associated with each CE. In addition to providing weights to IFM buffers, the weight generators may also provide weights to the artificial data injector 410. The artificial data injector 410 may derive artificial weights from the weights. Additionally, or alternatively, artificial weights may be pre-programmed into the artificial data injector 410.

The artificial data injector 410 may receive a synchronization signal. The synchronization signal may indicate the availability of real IFM data or real weights to be provided to the CEs 5-7 clock cycles in advance. For example, in the sixth clock cycle in the example of FIG. 3b, IFM block 5 is being loaded into the CEs, while in the seventh clock cycle, no IFM data is being loaded into the CEs. In the method of FIG. 4, this lack of data to be loaded into the CEs during the seventh clock cycle may be detected 5-7 clock cycles earlier. This may be described as a falling edge on the synchronization signal.

In general, detecting a lack of real data for processing in a given processing cycle (e.g. clock cycle) refers to detecting that there will be a lack of real data to be loaded into at least one CE during a given processing cycle later than the current processing cycle, wherein during the current processing cycle the lack is detected. This means that at least one CE may be idle or under-utilised during the given processing cycle if no data is loaded into and/or processed by said at least one CE during the given processing cycle. Said at least one CE may be idle if there is no real IFM data or real weights to be loaded during the given processing cycle. Alternatively, said at least one CE may be under-utilised if there is a shortage of real data to be loaded during the given processing cycle. Said at least one CE may alternatively be under-utilised if the content of the real data to be loaded during the given processing cycle and the content of the data to be loaded during the processing cycle directly preceding the given processing cycle are such that the power consumed by the CE during the given processing cycle is different from the power consumed by the CE during the processing cycle directly preceding the given processing cycle.

Nevertheless, during the current processing cycle during which the lack is detected, real data may be loaded into and processed by said at least one CE.

The presence of IFM block 6 to be loaded into the CEs during the eighth clock cycle, subsequent to the lack of data to be loaded into the CEs during the seventh clock cycle, may be detected 5-7 clock cycles earlier. This may be described as a rising edge on the synchronization signal.

The artificial data injector 410 may take action upon detecting a falling edge or rising edge on the synchronization signal. When the artificial data injector 410 detects a falling edge on the synchronization signal, it may trigger an artificial ramp-down. The artificial data injector 410 may obtain artificial IFM data. It may provide the artificial IFM data to at least a subset of the IFM buffers. The CEs may load the artificial IFM data from the IFM buffers. The artificial data injector 410 may obtain artificial weights. It may provide the artificial weights to at least a subset of the weight buffers. The CEs may load the artificial weights from the weight buffers.

Each MAC element may multiply at least one artificial weight by at least one artificial IFM data value to generate at least part of an artificial computational result. The artificial computational result may be discarded. The artificial computational result may not be transferred to the accumulator buffers.

FIG. 5a shows a timeline for approximate power consumption of a neural processing unit according to the invention. FIG. 5b shows a timing diagram for processing of IFM data according to the invention. The timeline of FIG. 5a pertains to the NPU 400 for which the timing of the processing of IFM data is shown in FIG. 5b.

The artificial data injector 410 may receive the synchronization signal indicating the availability of IFM blocks 0 to 9, the synchronization signal shown by the timing diagram of FIG. 3b. However, the IFM blocks are not loaded into or processed by the CEs in accordance with the timing diagram of FIG. 3b. Instead, they are loaded in accordance with the timing diagram of FIG. 5b.

Prior to IFM block 0, the CEs may not process any IFM data and may be in the idle state. The artificial data injector 410 may detect a rising edge on the synchronization signal. This may trigger an initial artificial ramp-up.

The artificial data injector 410 may synchronously provide IFM block 0 to the IFM buffers. Alternatively, the IFM generators may synchronously provide IFM block 0 to the IFM buffers. When the IFM data provided to the IFM buffers is real, the IFM generators may synchronously provide the IFM data to the IFM buffers.

The CEs may each load IFM block 0 from their respective IFM buffers at different times. This process is referred to as “staggering”. Firstly, IFM block 0 may be loaded into CE0 and processed (i.e. used to perform multiplication with a set of weights, and possibly the result of the multiplication being transferred to an accumulator buffer) in clock cycle 1. During clock cycle 1, remaining CEs 1 to 7 may be in the idle state. It is to be understood that, in this example, during a clock cycle in which a CE is not loading an IFM block or an artificial IFM block, it is in the idle state. In clock cycle 2, CE1 and CE2 may load and process IFM block 0. In clock cycle 3, CEs 3 to 5 may load and process IFM block 0.

After CE3, CE4 and CE5 have loaded and processed IFM block 0, in clock cycle 4, CE6 and CE7 may load and process IFM block 0. This completes the processing of IFM block 0.

Prior to the loading and processing of IFM block 0 by CE6 and CE7, and subsequent to the provision of IFM block 0 to the IFM buffers, the artificial data injector 410 or the IFM generators may synchronously provide IFM block 1 to the IFM buffers. In general, the synchronous provision of IFM blocks may occur when there is sufficient free memory space in all of the IFM buffers. During clock cycle 4, CE0 and CE1 may load and process IFM block 1. The simultaneous processing of different IFM blocks by different CEs despite the synchronous provision of individual IFM blocks may be enabled by the IFM buffers.

In clock cycle 5, CEs 2 to 6 may load and process IFM block 1. Then, in clock cycle 6, CE7 may load and process IFM block 1, while CEs 0 to 4 load and process IFM block 2. In clock cycle 7, CEs 0 to 3 may load and process IFM block 3, and CEs 5 to 7 may load and process IFM block 2. In clock cycle 8, CEs 0 to 3 may load and process IFM block 4, and CEs 4 to 7 may load and process IFM block 3.

This completes the initial artificial ramp-up. In each clock cycle during the initial artificial ramp-up, the number of CEs loading and processing IFM blocks (also referred to as activated CEs) may increase by 1, and the number of CEs in the idle state (also referred to as deactivated CEs) may decrease by 1. The timeline for approximate power consumption of FIG. 5a may also indicate the number of active CEs during each clock cycle. By gradually raising the number of active CEs, compared with activating all CEs in one clock cycle, the amount of power consumed by the array of MAC elements 220 may change by a smaller amount per clock cycle, and furthermore, the average time derivative of the power consumed between the start and end of the artificial initial ramp-up may be reduced. These effects may reduce the risk of hold, set-up and clock skew violations, and hence the risk of the NPU 400 crashing.

Furthermore, a CE may be deactivated after it has been activated during the initial artificial ramp-up. For example, CE0 may be activated for one clock cycle when it loads and processes IFM block 0, and may be deactivated for two subsequent clock cycles before loading IFM block 1. By deactivating CE0, some of the other CEs can load and process IFM block 0 before CE0 loads IFM block 1. Compared with activating 1 CE at a time and not deactivating any CEs during the initial artificial ramp-up, this method may allow the sizes of the IFM buffers to be reduced while still enabling synchronous provision of IFM blocks.

In the method depicted by FIG. 5b, a “ceiling” may start at CE0 in clock cycle 1, with only one CE (namely CE0) activated. The ceiling may move up by two CEs in clock cycle 2, with only the two CEs below it activated (namely CE1 and CE2). In each clock cycle, the ceiling may move up by a number of CEs increasing by 1 per clock cycle, with the number of activated CEs below the ceiling increasing by 1. When the ceiling reaches the final CE, it may restart at the first CE and continue moving without stalling. The CEs may process the IFM blocks in the order in which they are provided.

While there is no rising or falling edge on the synchronization signal, i.e. while the provision of IFM blocks by the artificial data injector 410 or the IFM generators is continuous, the CEs may continue to load and process IFM blocks in the order in which they are provided. CEs 4 to 7 may load and process IFM block 4 while CEs 0 to 3 load and process IFM block 5 in clock cycle 9.

The artificial data injector 410 may detect a falling edge on the synchronization signal. This may represent a lack of IFM data. Additionally, or alternatively, it may represent a lack of weights. This may trigger a first artificial ramp-down.

The artificial data injector 410, upon detecting the falling edge on the synchronization signal, may generate artificial IFM data and/or artificial weights. Typically, the artificial data injector 410 generates artificial IFM data and artificial weights, even if the falling edge on the synchronization signal represents only one of a lack of IFM data and a lack of weights. The artificial IFM data may be derived from real IFM data, such as the IFM data of any of IFM blocks 0 to 9.

CEs 0 to 3 may have already loaded and processed IFM block 5. The artificial data injector 410 may provide artificial IFM block 4 to the IFM buffers associated with CEs 0 to 2. In this example, artificial IFM block 4 may be derived from IFM block 4. Artificial IFM block 4 may comprise the same data as IFM block 4.

The artificial data injector 410 may provide artificial weights to the weight buffers associated with CEs 0 to 2. The artificial weights may be derived from weights previously used in processing IFM blocks 0 to 5. The artificial weights provided to weight buffers associated with different CEs may be different. The artificial weights provided to a given buffer may be the same as the artificial weights that were provided to that buffer to process IFM block 4.

CEs 0 to 2 may load artificial IFM block 4 from their associated IFM buffers in clock cycle 10. Loading artificial IFM data may consume power, as compared with being in the idle state.

As described with reference to FIG. 3b, loading an artificial IFM block typically comprises toggling bits in the MAC elements of the CEs. In order to load IFM block 5 into one of CEs 0 to 2 when IFM block 4 was the previously loaded block in that CE, the bits representing data values that are different between IFM block 4 and IFM block 5 may be toggled. However, in the case that artificial IFM block 4 consists of the same data as IFM block 4, the same bits may be toggled when loading artificial IFM block 4, because the bits that need to be toggled may be the bits representing data values that are different between IFM block 4 and IFM block 5. This means that the power consumed in toggling bits for any of CEs 0 to 2 may be the same when loading IFM block 5 as when loading artificial IFM block 4.

CEs 0 to 2 may load artificial weights from their associated weight buffers. CEs 0 to 2 may process artificial IFM block 4, i.e. use artificial IFM block 4 and/or the artificial weights to generate an artificial computational result. Processing artificial IFM block 4 may consume power, bringing the power consumed during this clock cycle closer to the power consumed in the immediately preceding clock cycle.

Artificial IFM data is inhibited from affecting the OFM. For example, CEs 0 to 2 may load artificial IFM block 4 from their associated IFM buffers, but they may not process artificial IFM block 4. Alternatively, if CEs 0 to 2 process artificial IFM block 4, they may discard the generated artificial computational result, and/or not transfer the artificial computational result to the accumulator buffers.

In clock cycle 10, CEs 4 to 7 may load and process IFM block 5.

One clock cycle after detecting the falling edge on the synchronization signal, the artificial data injector 410 may detect a rising edge on the synchronization signal. The artificial data injector 410 may begin a second artificial ramp-up. In this example, since only CE3 is in the idle state, only CE3 needs to be activated in the artificial ramp-up. All of the CEs may load and process IFM block 6 in clock cycle 11. This completes the second artificial ramp-up.

The CEs may each load and process IFM block 7 in clock cycle 12. This may be followed by a second artificial ramp-down due to the lack of IFM data lasting two clock cycles as shown by FIG. 3b. The second artificial ramp-down may additionally, or alternatively, be caused by a lack of weights. The second artificial ramp-down may proceed in a similar fashion to the first artificial ramp-down. During the first clock cycle of the second artificial ramp-down, clock cycle 13, CEs 1 to 7 may load artificial IFM block 6, and CE0 may be in the idle state. During the second clock cycle of the second artificial ramp-down, clock cycle 14, CEs 1 and 3 to 7 may load artificial IFM block 7, while CEs 1 and 2 may be in the idle state.

The artificial data injector 410 may detect a rising edge on the synchronization signal. This may trigger a third artificial ramp-up. CEs 1 to 7 may load and process IFM block 8, while CE0 is in the idle state, in clock cycle 15. Then, CEs 1 to 7 may load and process IFM block 9, while CE1 processes IFM block 8, in clock cycle 16. This completes the third artificial ramp-up.

The artificial data injector 410 may detect a falling edge on the synchronization signal after IFM block 9. This may trigger a third artificial ramp-down. IFM block 9 may be the final IFM block of the IFM. While CE0 loads and processes IFM block 9 in clock cycle 17, CEs 2 to 7 may load artificial IFM block 8. In this example, the number of idle CEs increases by 1, and the number of activated CEs decreases by 1, in each clock cycle. The ramp-down may be defined by the above-mentioned ceiling, which defines the CEs which are activated, i.e. loading real or artificial IFM data, in a similar fashion to as described with reference to the initial ramp-up of FIG. 5b. During an artificial ramp-down, the ceiling may move identically as compared with the initial artificial ramp-up but in the opposite direction. During the clock cycles in which a given CE is activated, it may alternately process artificial IFM blocks 8 and 9. The third artificial ramp-down may be completed when the number of activated CEs is zero.

By loading artificial IFM data into the CEs during clock cycles where no data would otherwise be loaded into the CEs, more power may be consumed by the CEs. By gradually reducing the number of activated CEs, compared with reducing the number of activated CEs to zero more quickly, the risk of hold, set-up and clock skew violations may be reduced. As compared with reducing the number of the activated CEs while still loading and processing real IFM data, the IFM data may be processed earlier (in that the processing of real IFM data finishes earlier than the end of a ramp down), meaning that the impact on the performance of the NPU 400 may be reduced.

Storing and/or generating artificial IFM data and/or artificial weights in the artificial data injector 410 and loading the artificial IFM data and/or artificial weights into the CEs in response to detecting a lack of real IFM data and/or real weights for processing may enable a ramp-down to be performed without prior indication that there will be a falling edge on the synchronization signal, as compared to ramping down using real IFM data.

While not depicted in FIG. 5b, the clock cycle duration of the processing unit may be varied. For example, the clock cycle duration during an artificial ramp-down or an artificial ramp-up may be longer than the clock cycle duration when no artificial ramp-down or artificial ramp-up is in progress.

By lengthening the clock cycle duration during an artificial ramp-up or an artificial ramp-down, the average time derivative of the power consumed between the start and end of the artificial ramp-up or the artificial ramp-down may be reduced. By shortening the clock cycle duration outside of these times, the performance (speed with which IFM data is processed) of the NPU 400 may be improved.

FIG. 6 shows a method performed by a processing unit for generating output feature map data during operation of neural network processing by a processing unit. The processing unit comprises a plurality of computation resources, such as the convolution engines CE0 to CE7 of FIG. 5b. Each computation resource of the plurality of computation resources may comprise at least one multiply-accumulate unit, each multiply-accumulate unit configured to multiply a portion of input feature map data by at least one weight. Each of the set of computation resources may be associated with a buffer such as IFM buffer 420 or weight buffer 430.

The processing unit may be the NPU 400. In such a case, the NPU 400 is specialised for processing data in accordance with neural networks. The NPU 400 may be configured to execute the same set of instructions for each IFM it receives.

The method may begin with step S1. In step S1, first, real, data to be processed is obtained. The first data may be obtained by the artificial data injector 410. Additionally, or alternatively, the first data may be obtained by the IFM generators and/or the weight generators. Any real input feature map data described below may be obtained by the artificial data injector 410 and/or the IFM generators. The first data may be IFM block 7.

In step S2, the first data is loaded into a set of a plurality of computation resources, causing the set of computation resources to generate a computational result, in a first processing cycle of a processing unit. Prior to this, the method may comprise synchronously providing the first data to the buffers associated with the set of computation resources. The set of the plurality of computation resources may be CEs 0 to 7. The first processing cycle may be clock cycle 12. Loading the first data into the set of computation resources may comprise, for each of the set of computation resources, loading the first data from the buffer associated with the computation resource into the computation resource.

The method may comprise obtaining ramping, real, data. The ramping data may be one or both of IFM blocks 2 and 3. The method may comprise loading the ramping data into a ramping activated subset, of the set of computation resources, causing the ramping activated subset of computation resources to generate a ramping computational result, in a ramping processing cycle of the processing unit prior to the first processing cycle. The ramping activated subset may be CEs 0 to 3, CEs 5 to 7, or CEs 0 to 3 and 5 to 7. The ramping processing cycle may be clock cycle 7. The method may comprise maintaining, in the ramping processing cycle, a ramping deactivated subset of the set of computation resources in an idle state. The ramping deactivated subset may be CE7.

The method may comprise obtaining further ramping, real, data. The further ramping data may be IFM block 1 and/or IFM block 2. The method may comprise loading the further ramping data into a further ramping activated subset, of the set of computation resources, causing the further ramping activated subset of computation resources to generate a further ramping computational result, in a further ramping processing cycle of the processing unit prior to the ramping processing cycle. The further ramping activated subset may be CEs 0 to 4 and/or CE7. The further ramping processing cycle may be clock cycle 6. The method may comprise maintaining, in the further ramping processing cycle, a further ramping deactivated subset of the set of computation resources in the idle state. The further ramping deactivated subset may be CE5 and CE6. The ramping deactivated subset may comprise at least one computation resource of the further ramping activated subset of computation resources, such as CE4.

In step S3, a lack of real data for processing in a second processing cycle of the processing unit, which is subsequent to the first processing cycle, is detected. The second processing cycle may be clock cycle 13. In any case, the second processing cycle may be immediately subsequent to the first processing cycle. The lack may be detected by the artificial data injector 410. The artificial data injector 410 may, for example, detect a falling edge on the synchronization signal. The lack of real data may comprise at least one of a lack of real input feature map data and a lack of real weights. The method may comprise, after detecting the lack, lengthening a processing cycle duration of the processor so that a duration of the second processing cycle is greater than a duration of the first processing cycle.

In step S4, second, artificial, data to be processed is obtained. The second data may be derived from real data. The second data may comprise artificial input feature map data. The second data may comprise artificial weights. The real data may be preceding, real, data to be processed before the first data. The preceding data may be IFM block 6.

The method may comprise obtaining the preceding data. The method may comprise loading the preceding data into a preceding set, of the set of computation resources, causing the preceding set of computation resources to generate a preceding computational result, in a preceding processing cycle of the processing unit. The preceding processing cycle may be clock cycle 11. The preceding set may be CEs 0 to 7.

In step S5, the second data is loaded into an artificially activated set, of the set of computation resources, in the second processing cycle. Prior to this, the method may comprise synchronously providing the second data to the buffers associated with the artificially activated set of computation resources. The artificially activated set, of the set of computation resources, may be CEs 1 to 7. The second processing cycle may be clock cycle 13. The method may comprise maintaining, in the second processing cycle, a deactivated subset of the set of computation resources in an idle state. The deactivated subset may be CE0. The method may comprise causing the artificially activated set to generate an artificial computational result in the second processing cycle. Loading the second data into the artificially activated set may comprise, for each of the artificially activated set of computation resources, loading the second data from the buffer associated with the computation resource into the computation resource.

The method may comprise obtaining third, artificial, data. The third data may be IFM block 7. Alternatively, the third data may be IFM blocks 6 and 7. The method may comprise loading the third data into a further artificially activated set, of the set of computation resources, in a third processing cycle of the processing unit subsequent to the second processing cycle. The further artificially activated set may be CEs 3 to 7. The further artificially activated set may alternatively be CEs 0 and 3 to 7. The third processing cycle may be clock cycle 14. The method may comprise maintaining, in the third processing cycle, a further deactivated subset of the set of computation resources in the idle state. The further deactivated subset of the set of computation resources may be CEs 1 and 2. The further deactivated subset may be larger than the deactivated subset.

In step S6, the second data is inhibited from affecting the output feature map data. The method may comprise inhibiting the third data from affecting the output feature map data.

In step S7, the output feature map data is generated based at least in part on the computational result. The output feature map data may be generated at least in part on the preceding computational result, the ramping computational result and/or the further ramping computational result.

The above embodiments are to be understood as illustrative examples of the invention. Further embodiments of the invention are envisaged. Instead of gradually raising or reducing the number of activated CEs during an artificial ramp-up or ramp-down, the content of the artificial data provided in successive clock cycles could be adjusted so that the power consumed by the NPU 400 increases or decreases gradually. Furthermore, while in the presently described examples the same IFM data is provided synchronously to the IFM buffers, instead, the same weights may be provided synchronously to the weight buffers. It is to be understood that any feature described in relation to any one embodiment may be used alone, or in combination with other features described, and may also be used in combination with one or more features of any other of the embodiments, or any combination of any other of the embodiments. Furthermore, equivalents and modifications not described above may also be employed without departing from the scope of the invention, which is defined in the accompanying claims.

Claims

1. A method for generating output feature map data during operation of neural network processing by a processing unit, the processing unit comprising a plurality of computation resources, the method comprising:

obtaining first, real, data to be processed;

loading the first data into a set of the plurality of computation resources, causing the set of computation resources to generate a computational result, in a first processing cycle of the processing unit;

detecting a lack of real data for processing in a second processing cycle of the processing unit, which is subsequent to the first processing cycle;

obtaining second, artificial, data;

loading the second data into an artificially activated set, of the set of computation resources, in the second processing cycle;

inhibiting the second data from affecting the output feature map data; and

generating the output feature map data based at least in part on the computational result.

2. The method of claim 1, wherein the second processing cycle is immediately subsequent to the first processing cycle.

3. The method of claim 1, comprising deriving the second data from real data.

4. The method of claim 3, comprising:

obtaining preceding, real, data to be processed before the first data;

loading the preceding data into a preceding set, of the set of computation resources, causing the preceding set of computation resources to generate a preceding computational result, in a preceding processing cycle of the processing unit;

deriving the second data from the preceding data; and

generating the output feature map data based at least in part on the preceding computational result.

5. The method of claim 1, comprising maintaining, in the second processing cycle, a deactivated subset of the set of computation resources in an idle state.

6. The method of claim 5, comprising:

obtaining third, artificial, data;

loading the third data into a further artificially activated set, of the set of computation resources, in a third processing cycle of the processing unit subsequent to the second processing cycle;

inhibiting the third data from affecting the output feature map data; and

maintaining, in the third processing cycle, a further deactivated subset of the set of computation resources in the idle state,

wherein the further deactivated subset is larger than the deactivated subset.

7. The method of claim 1, comprising:

obtaining ramping, real, data;

loading the ramping data into a ramping activated subset, of the set of computation resources, causing the ramping activated subset of computation resources to generate a ramping computational result, in a ramping processing cycle of the processing unit prior to the first processing cycle;

maintaining, in the ramping processing cycle, a ramping deactivated subset of the set of computation resources in an idle state; and

generating the output feature map data based at least in part on the ramping computational result.

8. The method of claim 7, comprising:

obtaining further ramping, real, data;

loading the further ramping data into a further ramping activated subset, of the set of computation resources, causing the further ramping activated subset of computation resources to generate a further ramping computational result, in a further ramping processing cycle of the processing unit prior to the ramping processing cycle;

maintaining, in the further ramping processing cycle, a further ramping deactivated subset of the set of computation resources in the idle state; and

generating the output feature map data based at least in part on the further ramping computational result,

wherein the ramping deactivated subset comprises at least one computation resource of the further ramping activated subset of computation resources.

9. The method of claim 1, wherein the lack of real data comprises at least one of a lack of real input feature map data and a lack of real weights, and the second, artificial, data comprises artificial input feature map data and artificial weights.

10. The method of claim 1, wherein each computation resource of the plurality of computation resources comprises at least one multiply-accumulate unit, each multiply-accumulate unit configured to multiply a portion of input feature map data by at least one weight.

11. The method of claim 1, comprising:

causing the artificially activated set to generate an artificial computational result in the second processing cycle,

wherein inhibiting the second data from affecting the output feature map data comprises discarding the artificial computational result.

12. The method of claim 1, wherein:

each of the set of computation resources is associated with a buffer,

the method comprises: synchronously providing the first data to the buffers associated with the set of computation resources; and synchronously providing the second data to the buffers associated with the artificially activated set of computation resources,

loading the first data into the set of computation resources comprises, for each of the set of computation resources, loading the first data from the buffer associated with the computation resource into the computation resource, and

loading the second data into the artificially activated set comprises, for each of the artificially activated set of computation resources, loading the second data from the buffer associated with the computation resource into the computation resource.

13. The method of claim 1, comprising:

after detecting the lack, lengthening a processing cycle duration of the processor so that a duration of the second processing cycle is greater than a duration of the first processing cycle.

14. The method of claim 1, wherein the processing unit is a neural processing unit.

15. A processing unit configured to perform the method of claim 1.