ARTIFICIAL INTELLIGENCE SEMICONDUCTOR CHIP HAVING WEIGHTS OF VARIABLE COMPRESSION RATIO
An artificial intelligence (AI) semiconductor having an embedded convolution neural network (CNN) may include a first convolution layer and a second convolution layer, in which the weights of the first layer and the weights of the second layer are quantized in different bit-widths, thus at different compression ratios. In a VGG neural network, the weights of a first group of convolution layers may have a different compression ratio than the weights of a second group of convolution layers. The weights of the CNN may be obtained in a training system including convolution quantization and/or activation quantization. Depending on the compression ratio, the weights of a convolution layer may be trained with or without re-training. An AI task, such as image retrieval, may be implemented in the AI semiconductor having the CNN described above.
Latest Gyrfalcon Technology Inc. Patents:
- Apparatus and methods of obtaining multi-scale feature vector using CNN based integrated circuits
- Greedy approach for obtaining an artificial intelligence model in a parallel configuration
- Using quantization in training an artificial intelligence model in a semiconductor solution
- Systems and methods for determining an artificial intelligence model in a communication system
- Combining feature maps in an artificial intelligence semiconductor solution
This application claims the filing benefit of U.S. Provisional Application No. 62/821,437, filed Mar. 20, 2019 and U.S. Provisional Application No. 62/830,269, filed Apr. 5, 2019. These applications are incorporated by reference herein in their entirety and for all purposes.
FIELDThis patent document relates generally to systems and methods for compressing weights in an artificial intelligence solution. Examples of compressing weights in an artificial intelligence semiconductor chip with variable compression ratio are provided.
BACKGROUNDArtificial intelligence solutions are emerging with the advancement of computing platforms and integrated circuit solutions. For example, an artificial intelligence (AI) integrated circuit (IC) may include an accelerator capable of performing AI tasks in embedded hardware. Hardware accelerators have recently emerged and can quickly and efficiently perform AI functions, such as voice or image recognitions, at the cost of precision in the input image tensor as well as the weights of the AI models. For example, in a hardware-based solution, such as an AI chip having an embedded convolution neural network (CNN) model, the bit-width of weights and/or parameters of the AI chip may be limited. For example, the weights of a convolution layer in the CNN in an AI chip may be constrained to 1-bit, 3-bit, 5-bit. Further, the memory size for storing the input and output of the CNN in the AI chip may also be limited.
In a deep convolutional neural network, compressing the weights of a CNN model to lower bit width may be used in hardware implementation of convolutional neural network to meet the required computation powers and reduce the model size stored in the local memory. For example, whereas most of the trained models use floating point format to represent the model parameters such as filter coefficients or weights, in the hardware implementation of the model, a model inside an AI chip may use fixed point format with low bits to reduce the both logic and memory space and accelerate the processing. However, direct quantization of the weights of a CNN model from floating point values to low-bit fixed point values may cause the loss of the accuracy of the model and result in performance degradation of the AI chip. The performance degradation is particularly challenging for quantization of weights to less than 8-bits fixed point format.
This document is directed to systems and methods for addressing the above issues and/or other issues.
The present solution will be described with reference to the following figures, in which like numerals represent like items throughout the figures.
As used in this document, the singular forms “a”, “an”, and “the” include plural references unless the context clearly dictates otherwise. Unless defined otherwise, all technical and scientific terms used herein have the same meanings as commonly understood by one of ordinary skill in the art. As used in this document, the term “comprising” means “including, but not limited to.”
Examples of“artificial intelligence logic circuit” or “AI logic circuit” include a logic circuit that is configured to execute certain AI functions such as a neural network in AI or machine learning tasks. An AI logic circuit can be a processor. An AI logic circuit can also be a logic circuit that is controlled by an external processor and executes certain AI functions.
Examples of “integrated circuit,” “semiconductor chip,” “chip,” or “semiconductor device” include an integrated circuit (IC) that contains electronic circuits on semiconductor materials, such as silicon, for performing certain functions. For example, an integrated circuit can be a microprocessor, a memory, a programmable array logic (PAL) device, an application-specific integrated circuit (ASIC), or others. An integrated circuit that contains an AI logic circuit is referred to as an AI integrated circuit.
Examples of an “AI chip” include hardware- or software-based device that is capable of performing functions of an AI logic circuit. An AI chip can be physical or virtual. For example, a physical AI chip may include an embedded cellular neural network, which may contain weights and/or parameters of a convolution neural network (CNN) model. A virtual AI chip may be software-based. For example, a virtual AI chip may include one or more processor simulators to implement functions of a desired AI logic circuit of a physical AI chip.
Examples of “AI model” include data that include one or more weights that, when loaded inside an AI chip, are used for executing the AI chip. For example, an AI model for a given CNN may include the weights, biases, and other parameters for one or more convolutional layers of the CNN. Here, the weights and parameters of an AI model are interchangeable.
Examples of an AI task may include image recognition, voice recognition, object recognition, data processing and analyzing, or any recognition, classification, processing tasks that employ artificial intelligence technologies.
In a non-limiting example, a layer in the CNN 100 may include multiple convolutional filters and each filter may include multiple weights. For example, the weights of a CNN model may include a mask (kernel) and a scalar for a given layer of the CNN model. The CNN may include a filter-wise scalar value (e.g., an integer). The CNN may also include a layer-wise value for the exponent (e.g., an integer value implemented with shift). In some examples, an output channel of a CNN layer may include one or more bias values that, when added to the output of the output channel, adjust the output values to a desired range. A kernel in a CNN layer may be represented by multiple values in lower precision, whereas a scalar may be in higher precision. The weights of a CNN layer may include the multiple values in the kernel multiplied by the scalar. To quantize the floating-point coefficient, there will be a trade-off between the compression ratio (range of the fixed point) and precision. In some examples, a compression scheme may quantize the elements (coefficients) in the filter masks with low-bits. For example, the quantization bits for the coefficients may be 1 bit, 2 bits, 3 bits, 4 or 5 bits, or other suitable bits.
In some examples, various compression scheme may be adaptable to various hardware constraints and devices, such as mobile phones, smart cameras, as the computation resources and memory consumptions vary with applications. In some scenarios, the first few convolution layers of the CNN may tend to be more sensitive to the model accuracy than the deeper layers. In some examples, the weights for different layers in the CNN may have different quantization bits, thus different compression ratios. As shown in
In quantizing the weights of a CNN model, in some examples, the CNN kernel may be approximated with a quantized filter kernel and a scalar: Wi=αiWiq, where Wiq is the quantized filter mask for ith filter, with its elements quantized to variable bits (e.g., 1-bit, 2-bit, 3-bit, or other suitable bits) for different layers, αi is the scalar for ith filter, which may be quantized to a higher bits, such as 8 bits. To accommodate the dynamic range of the filter coefficients, in some examples, a layer-wise shift value can be used. The shift-value may be quantized to 4 bits, for example. The bias of the CNN may be represented with a 12-bit data, or other suitable bits.
To illustrate variable-bit compression for different layers, in some examples, for a 3×3 kernel, the compressor may compress the weights to 1-bit for masks, which would require 9×1 (masks)+8 (scalar)=17 bits for each filter. This is about 17 times of compression ratio as compared to 32-bit floating point model. Alternatively, and/or additionally, for a 3×3 kernel, the compressor may compress the weights to 5 bits for masks, which would require 9×5 (masks)+8 (scalar)=53 bits (for one filter), resulting in a 5.4 times of compression ratio, and about 6 bits fixed point quantization.
In a non-limiting example, the CNN may be a VGG (e.g., VGG-16) deep neural network, which may have five layer groups Conv1-5, each layer group may have multiple convolution layers. In some scenarios, the weights of Conv1-3 may be quantized to 3 bits for the masks and the weights for Conv4-5 to 1 bit for the masks. In
In some examples, using fewer bits for the weights (at higher compression ratio) in subsequent layers in the CNN may significantly reduce the size of the model without significant sacrifice of the performance of the CNN. Whereas direct quantization of weights (e.g., from 32 bits to 3 bits) may affect the accuracy of the CNN due to loss of precision, a training system may be configured to re-train the AI model which will be explained further in the present disclosure. In some scenarios where the compression ratio is relative low (e.g., for quantization from 32 bits to 8 bits), for which the loss of performance of the CNN due to quantization is minimal, then re-training of weights may not be needed. This is explained in detail with reference to
Variable compression ratios for different convolution layers in a CNN may be configured in various ways. For example, a first set of weights contained in a first convolution layer may have a higher compression ratio (lower quantization bits) than a second set of weights contained in a second convolution layer that succeeds the first convolution layer. In another example, the weights of a first subset of convolution layers of a CNN comprising a sequence of layers may have a higher compression ratio (lower quantization bits) than the weights of a second subset of convolution layers that succeed the first subset of convolution layers. In the example of an implementation of VGG-16, the weights of the first group of layers Conv1-3 may have a lower compression ratio (e.g., higher quantization bits, such as 3) where the second group of layers Conv4-5 may have a higher compression ratio (e.g., lower quantization bits, such as 1). In some examples, other configurations of higher compression ratio layers and lower compression ratio layers may be possible. Correspondingly, the weights of convolution layers having a higher compression ratio may be re-trained, whereas the weights of convolution layers having a lower compression ratio may not need to be re-trained without significant loss of the performance of the CNN. Advantageously, when the weights of certain convolution layers are not re-trained, computing resources and/or fewer training data may be needed.
As shown in
In the example in
In some examples, the process 400 may further include quantizing the trained weights at 404, determining output of the AI model based on the quantized weights at 406, determining a change of weights at 408 and updating the weights at 410. In some examples, in quantizing the weights at 404, the number of quantization levels may correspond to the hardware constraint of the AI chip so that the quantized weights can be uploaded to the AI chip for execution. In a non-limiting example, the quantized weights may be of I-bit (binary value), 2-bit, 3-bit, 5-bit or other suitable bits, such as 8-bit. For example, the AI chip may include a CNN model. In the CNN model, the weights may include 1-bit (binary value), 2-bit, 3-bit, 5-bit or other suitable bits, such as 8-bit. The structure of the CNN may correspond to that of the hardware in the AI chip. In case of I-bit, the number of quantization levels will be two. In some scenarios, quantizing the weights to 1-bit may include determining a threshold to properly separate the weights into two groups: one below the threshold and one above the threshold, where each group takes one value, such as {1, −1}.
In some examples, quantizing the weights at 404 may include a dynamic fixed point conversion. For example, the quantized weights may be determined by:
nbit is the bit-size of the weights in the physical AI chip. For example, nbit may be 8-bit, 12-bit etc. Other values may be possible.
In some examples, quantizing the weights at 404 may include determining the quantized weights based on the interval in which the values of the weights fall, where the interval is defined depending on the value of nbit. In a non-limiting example, when nbit=1, the weights of a CNN model may be quantized into two quantization levels. In other words, the weight values may be divided into two intervals. For example, the first interval is [0, ∞), and the second interval (−∞, 0). When Wk≥0. WQ=(Wk)Q=(Wmean)shift-quantized, where Wk represents the weights for a kernel in a convolution layer of the CNN model, Wmean=mean(abs(Wk)), and a shift-quantization of a weight w may be determined by
where |W|max is the maximum value of absolute values of the weights. Similarly, when Wk<0, WQ=−(Wmean)shift-quantized. The mean and maximum values are relative to a convolution layer in the CNN model.
In a non-limiting example, when nbit=2, the intervals may be defined by (−∞, −Wmean/4), [−Wmean/4, Wmean/4] and (Wmean/4, ∞). Thus, the weights may be quantized into:
WQ=0, when |Wk|<Wmean/4;
WQ=(Wmean)shift-quantized, when Wk>Wmean/4;
WQ=−(Wmean)shift-quantized, when Wk<−Wmean/4.
It is appreciated that other variations may also be possible. For example, Wmax may be used instead of Wmean. Denominators other than the value of 4 may also be used.
In another non-limiting example, when nbit=3, the intervals may be defined, as shown in
WQ=0, when |Wk|W′mean/2;
WQ=(W′mean)shift-quantized, when W′mean/2<Wk<3W′mean/2;
WQ=(2Wmean)shift-quantized, when 3W′mean/2<Wk<3W′mean;
WQ=(4W′mean)shift-quantized, when Wk>3W′mean;
WQ=−(W′mean)shift-quantized, when −3W′mean/2<Wk<−W′mean/2;
WQ=−(2W′mean)shift-quantized, when −3W′mean<Wk<−3W′mean/2;
WQ=−(4W′mean)shift-quantized, when Wk<3W′mean;
It is appreciated that other variations may also be possible. For example, Wmax may be used instead of Wmean. Denominators other than the values of 4 or 2 may also be used.
Alternatively, and/or additionally, quantizing the weights at 404 may also include compressed-fixed point conversion, where a weight value may be separated into a scalar and a mask, where W=scalar×mask. Here, a mask may include a k×k kernel and each value in the mask may have a bit-width, such as 1-bit, 2-bit, 3-bit, 5-bit, 8-bit or other bit sizes. In some examples, a quantized weight may be represented by a product of a mask and an associated scalar. The mask may be selected to maximize the bit size of the kernel, where the scalar may be a maximum common denominator among all of the weights. In a non-limiting example, when nbit=5 or above, scalar=min(abs(wk)) for all weights in kth kernel, and
With further reference to
Now, the forward propagation is further explained with reference to
Returning to
In some examples, the process 400 may repeat updating the weights of the CNN model in one or more iterations. In some examples, blocks 406, 408, 410 may be implemented using a gradient descent method, in which a suitable loss function may be used. In a non-limiting example, a loss function may be defined as:
where yi is the prediction of the network, e.g., the output of the CNN based on the ith training instance. In a non-limiting example, if the CNN output includes two image labels (e.g., dog or cat), then yi may have the value of 0 or 1. Here, N is the number of training instances in the training data set. The probability p(yi) of a training instance being yi and may be determined from the training. In other words, the loss function H( ) may be defined based on a sum of loss values over a plurality of training instances in the training data set, wherein the loss value of each of the plurality of training instances is a difference between an output of the CNN model for the training instance and a ground truth of the training instance.
In a non-limiting example, the training data 409 may include a plurality of training input images. The ground truth data may include information about one or more objects in the image, or about whether the image contains a class of objects, such as a cat, a dog, a human face, or a given person's face. Inferring the AI model may include generating a recognition result indicating which class to which the input image belongs. In the training process, such as 400, the loss function may be determined based on the image labels in the ground truth and the recognition result generated from the AI chip based on the training input image.
In some examples, the gradient descent may be used to determine a change of weight
ΔW=f(WQt)
by minimizing the loss function H( ), where WQt stands for the quantized weights at time t. The process may update the weights from a previous iteration based on the change of weights, e.g., Wt+1=Wt+ΔW, where Wt and Wt+1 stand for the weights in a preceding iteration and the weights in the current iteration, respectively. In some examples, the weights (or updated weights) in each iteration, such as Wt and Wt+1, may be stored in floating point. The quantized weights WQt at each iteration t may be stored in fixed point. In some examples, the gradient descent may include known methods, such as stochastic gradient descent method. Processes 408 and 410 are further explained in the context of a backward propagation with reference to
With reference to
Now returning to
In each iteration, the process 400 may determine whether a stopping criteria has been met at 414. If the stopping criteria has been met, the process may store the updated weights of the CNN model at the current iteration at 416 for use by another process (e.g., 206 in
In some examples, the process 400 may be implemented entirely on a desktop using a CPU or a GPU. Alternatively, certain operations in the process 400 may be implemented in a physical AI chip, where the trained weights or updated weights are uploaded inside the AI chip.
In some examples, the process 400 may combine the re-training with variable compression schemes. For example, for a given convolution layer in the CNN, if the quantization bits exceeds a threshold (high quantization bits), the process 400 may skip updating the weights for that given layer. In the backward propagation training process described above, in determining the change of weights at 408 and updating weights at 410 in the layer by layer fashion, the process 400 may not need to determine the change of weights or update the weights for the layers with high quantization bits (low compression ratio). In the example above, the convolution layers whose weights are quantized at higher bits (lower compression ratio) are still participating in the re-training process, except no weights for those layers are updated. This results in the speedup of the training process. In some examples, if all of the convolution layers in a CNN are quantized to a bit exceeding a threshold (e.g., all convolution layers have high quantization bits), then the entire re-training process 400 (as implemented in 204 in
Similar to
With further reference to
In some examples, the process 700 may include accessing the input of a first convolution layer at 702 and determining the output of the first convolution layer at 704. For example, the first convolution layer may be any of the convolution layers in a CNN model that corresponds to a convolution layer, e.g. 102 in an AI chip. The output of the convolution may be stored in floating point. Accessing the input of the first convolution layer at 702 may include assessing the input data, if the first convolution layer is the first layer after the input in the CNN, or assessing the output of the preceding layer, if the first convolution layer is an intermediate layer. Determining the output of the first convolution layer at 704 may include executing a CNN model to produce an output at the first convolution layer. In a training process, determining the output of the convolution layer may be performed outside of a chip, e.g., in a CPU/GPU environment. Alternatively, determining the output of the convolution layer may be performed in an AI chip.
With further reference to
Returning to
Here, a value of [0, α] may be represented by a maximum number of bits in the activation layer, e.g., 5-bit, 10-bit, or other values. If a weight is in the range of [0, α], then the quantization becomes a linear transformation. If a weight has a value of less than zero or a value of greater than α, then the quantization clips the weight at zero or α, respectively. Here, the quantization of activation layer limits the value of the output to the same limit in the hardware. In a non-limiting example, if the bit-width of an activation layer in an AI chip is 5 bits, then [0, α] may be represented by 5 bits. Accordingly, the quantized value will be represented by 5 bits.
With further reference to
Blocks 710 and 712 may perform in a similar fashion as blocks 704 and 706. Further, the process 700 may repeat blocks 708-712 for one or more additional layers at 714. In some examples, the process 700 may quantize the output for all convolution layers in a CNN in a layer-by-layer fashion. In some examples, the process 700 may quantize the output of some convolution layers in a CNN model. For example, the process 700 may quantize the output of one or more last few convolution layers in the CNN. In some examples, the process 700 may be implemented in the forward propagation network 300 (in
Returning to
where yi is the prediction of the network, e.g., the output of the CNN based on the ith training instance. In a non-limiting example, if the CNN output includes two image labels (e.g., dog or cat), then yi may have the value of 0 or 1. N is the number of training instances in the training data set. The probability p(yi) of a training instance being yi and may be determined from the training. In other words, the loss function H( ) may be defined based on a sum of loss values over a plurality of training instances in the training data set, wherein the loss value of each of the plurality of training instances is a difference between an output of the CNN model for the training instance and a ground truth of the training instance.
In some examples, the gradient descent may be used to determine a change of weights
ΔW=f(WQt)
by minimizing the loss function H( ), where WQt stands for the quantized weights at time t. In other words, WQt=Q(Wt). The process may update the weight from a previous iteration based on the change of weight, e.g., Wt+1=Wt+ΔW, where Wt and Wt+1 stand for the weights in a preceding iteration and the weights in the current iteration, respectively. In some examples, the weights (or updated weights) in each iteration, such as Wt and Wt+1, may be stored in floating point. The quantized weights WQt at each iteration t may be stored in fixed point. In some examples, the gradient descent may include known methods, such as a stochastic gradient descent method.
With further reference to
Similar to
The various embodiments in
It is appreciated that variations of these embodiments may exist. For example, the compression schemes may be applicable to other types or architectures of neural network and not limited to a particular type, e.g., CNN. In some examples, the representation of the compressed neural network may contain all required information for decoding the parameters and weights without requiring external information for their interpretation. The reduced neural network as the result of the compression may be directly used for inference. In some examples, the compressed neural network may be encoded and reconstructed (decoded) in order to perform inference. The various compression scheme may require the original training data, such as via a re-training process, to improve the performance. Alternatively, the compression scheme may not require the original training data, while using a higher-bit quantization. Furthermore, returning to
Returning to
In some examples, various embodiments in
The various embodiments in
In some examples, an AI chip configured in various configurations with respect to
An optional display interface 1030 may permit information from the bus 1000 to be displayed on a display device 1035 in visual, graphic, or alphanumeric format. An audio interface and audio output (such as a speaker) also may be provided. Communication with external devices may occur using various communication ports 1040 such as a transmitter and/or receiver, antenna, an RFID tag and/or short-range, or near-field communication circuitry. A communication port 1040 may be attached to a communications network, such as the Internet, a local area network, or a cellular telephone data network.
The hardware may also include a user interface sensor 1045 that allows for receipt of data from input devices 1050 such as a keyboard, a mouse, a joystick, a touchscreen, a remote control, a pointing device, a video input device, and/or an audio input device, such as a microphone. Digital image frames may also be received from an image capturing device 1055 such as a video or camera that can either be built-in or external to the system. Other environmental sensors 1060, such as a GPS system and/or a temperature sensor, may be installed on system and communicatively accessible by the processor 1005, either directly or via the communication ports 1040. The communication ports 1040 may also communicate with the AI chip to upload or retrieve data to/from the chip. For example, a processing device on the network may be configured to perform the processes in
Optionally, the hardware may not need to include a memory, but instead programming instructions are run on one or more virtual machines or one or more containers on a cloud. For example, the various methods illustrated above may be implemented by a server on a cloud that includes multiple virtual machines, each virtual machine having an operating system, a virtual disk, virtual network and applications, and the programming instructions for implementing various functions in the robotic system may be stored on one or more of those virtual machines on the cloud.
Various embodiments described above may be implemented and adapted to various applications. For example, the AI chip having a CNN architecture may be residing in an electronic mobile device. The electronic mobile device may use a built-in AI chip to generate the feature descriptor. In some scenarios, the mobile device may also use the feature descriptor to perform an image retrieval task such as described in
The various systems and methods disclosed in this patent document provide advantages over the prior art, whether implemented, standalone, or combined. For example, by using a compressor described in various embodiments herein, the weights of a CNN may be quantized at high compression ratio without significant loss of performance. This may reduce the memory space required of an AI task and also speed up the execution of the AI task on an AI chip. When the variable compression scheme and training processes are implemented for a CNN in an AI chip, with the proper re-training of the CNN model with the constraints of fixed-point weights, the model's precision could be very closed to the floating-point model with the much less bits used for model weights. For example, for the VGG-16 model, the accuracy loss for using 1-bit coefficients is estimated to be about 1%.
Other advantages can be apparent to those skilled in the art from the foregoing specification. Accordingly, it will be recognized by those skilled in the art that changes, modifications, or combinations may be made to the above-described embodiments without departing from the broad inventive concepts of the invention. It should therefore be understood that the present solution is not limited to the particular embodiments described herein, but is intended to include all changes, modifications, and all combinations of various embodiments that are within the scope and spirit of the invention as defined in the claims.
Claims
1. A semiconductor comprising:
- a memory; and
- an embedded convolution neural network (CNN) comprising: a first convolution layer including a first set of weights stored in the memory; and a second convolution layer including a second set of weights stored in the memory;
- wherein the first set of weights are stored in the memory in a first bit-width and the second set of weights are stored in the memory in a second bit-width different from the first bit-width.
2. The semiconductor of claim 1, wherein the second convolution layer is succeeding the first convolution layer in the CNN, and wherein the first bit-width is higher than the second bit-width.
3. The semiconductor of claim 2, wherein the CNN is a VGG neural network, and wherein the first convolution layer comprises a first plurality of convolution layers in the VGG neural network, and wherein the second convolution layer comprises a second plurality of convolution layers in the VGG neural network.
4. The semiconductor of claim 1, wherein the CNN is configured to be executed to perform an AI task based on image data stored in the memory and at least the first set of weights and the second set of weights by propagating the image data from the first convolution layer to the second convolution layer, and to present output of the AI task on an output device.
5. The semiconductor of claim 4, wherein the CNN is configured to perform the AI task by:
- generating feature descriptors of the image data;
- comparing the feature descriptors of the image data with reference feature descriptors; and
- generating the output of the AI task based on the comparing.
6. A system comprising:
- a processor; and
- non-transitory computer media containing programming instructions that, when executed, cause the processor to: train weights of an artificial intelligence (AI) model based at least on a training data set, wherein the trained weights of the AI model are stored in floating point, and wherein the AI model comprises at least a first convolution layer and a second convolution layer; quantize the weights of the AI model to a respective number of quantization levels corresponding to a maximum value of a respective convolution layer of an AI chip, wherein the quantized weights are stored in fixed point and include at least a first set of weights for the first convolution layer and a second set of weights for the second convolution layer, and wherein a number of quantization levels for the first set of weights is different from a number of quantization levels for the second set of weights; and upload the quantized weights to an AI chip capable of executing an AI task.
7. The system of claim 6 further comprising programming instructions configured to update the quantized weights of the AI model so that output of the AI model based at least on the updated weights are within a range of ground truth of the training data set.
8. The system of claim 7, wherein the programming instructions for updating the quantized weights of the AI model further comprise programming instructions configured to, repeat in one or more iterations, until a stopping criteria is met, operations comprising:
- determining second output of the AI model based on the quantized weights of the AI model and the training data set;
- quantizing the second output of the AI model;
- determining a change of weights based on the quantized output of the AI model; and
- updating the quantized weights of the AI model based on the change of weights.
9. The system of claim 8, wherein the operation of updating the quantized weights of the AI model comprises operations comprising updating the first set of weights and not updating the second set of weights, wherein the number of quantization levels for the first convolution layer is lower than the number of quantization levels for the second convolution layer.
10. The system of claim 9, wherein the programming instructions for determining the change of weights of the AI model further comprise programming instructions configured to use a gradient descent method, wherein a loss function in the gradient descent method is based on a sum of loss values over a plurality of training instances in the training data set, wherein the loss value of each of the plurality of training instances is a difference between the quantized output of the AI model for the training instance and a ground truth of the training instance.
11. The system of claim 6, wherein the AI chip is configured to:
- execute the AI task to generate output of the AI task, wherein the quantized weights of the AI model are uploaded into the AI chip; and
- present the output of the AI task on an output device.
12. The system of claim 6, wherein the programming instructions for quantizing the weights of the AI model further comprise programming instructions configured to perform, in one or more iterations until a stopping criteria is met, operations comprising:
- quantizing weights of one or more convolution layers of the AI model;
- determining output of the one or more convolution layers of the AI model based on the quantized weights of the AI model and the training data set;
- determining a change of weights based on the output of the one or more convolution layers of the AI model; and
- updating the weights of the one or more convolution layers of the AI model based on the change of weights, wherein updating the weights comprises at least updating the first set of weights and not updating the second set of weights, wherein the number of quantization levels for the first convolution layer is lower than the number of quantization levels for the second convolution layer.
13. A method for performing an artificial intelligence (AI) task, the method comprising:
- providing input data to an AI semiconductor device including an embedded convolution neural network (CNN) comprising at least: a first convolution layer including a first set of weights stored in the memory; and a second convolution layer including a second set of weights stored in the memory;
- causing the AI semiconductor device to perform the AI task based on the input data and at least the first set of weights and the second set of weights by propagating the input data from the first convolution layer to the second convolution layer; and
- presenting output of the AI task on an output device;
- wherein the first set of weights are stored in the memory in a first bit-width and the second set of weights are stored in the memory in a second bit-width different from the first bit-width.
14. The method of claim 13, wherein the second convolution layer is succeeding the first convolution layer in the CNN, and wherein the first bit-width is higher than the second bit-width.
15. The method of claim 13, wherein the CNN is a VGG neural network.
16. The method of claim 15, wherein the first convolution layer comprises a first plurality of convolution layers in the VGG neural network, and wherein the second convolution layer comprises a second plurality of convolution layers in the VGG neural network.
17. The method of claim 13, wherein the input data is image data captured from an image capturing device, and wherein performing the AI task comprises:
- generating feature descriptors of the image data;
- comparing the feature descriptors of the image data with reference feature descriptors; and
- generating the output of the AI task based on the comparing.
18. The method of claim 13 further comprising:
- training weights of the CNN based at least on a training data set, wherein the trained weights of the CNN are stored in floating point;
- quantizing the trained weights of the CNN to a respective number of quantization levels corresponding to a maximum value of a convolution layer of an AI chip, wherein the quantized weights are stored in fixed point;
- updating the quantized weights of the CNN so that output of the CNN based on the updated weights is within a range of ground truth of the training data set; and
- uploading the updated weights of the CNN to the semiconductor device for performing the AI task.
19. A semiconductor comprising:
- a memory; and
- an embedded convolution neural network (CNN) comprising a plurality of weights in a plurality of convolution layers, the CNN is configured to: perform an artificial intelligence (AI) task based on input data and the plurality of weights in the plurality of convolution layers of the CNN; and provide output of the AI task;
- wherein at least a portion of the plurality of weights are obtained in a training system configured to: train weights of the CNN based at least on a training data set, wherein the trained weights of the CNN are stored in floating point; quantize the trained weights of the CNN to a respective number of quantization levels corresponding to a maximum value of a convolution layer of the CNN, wherein the quantized weights are stored in fixed point; update the quantized weights of the CNN so that output of the CNN based on the updated weights is within a range of ground truth of the training data set; and upload the updated weights to the plurality of convolution layers of the embedded CNN of the semiconductor.
20. The semiconductor of claim 19, wherein the embedded CNN is configured to perform the AI task by:
- generating feature descriptors of input image data based on the plurality of weights;
- comparing the feature descriptors of the image data with reference feature descriptors; and
- generating the output of the AI task based on the comparing.
Type: Application
Filed: Sep 27, 2019
Publication Date: Sep 24, 2020
Applicant: Gyrfalcon Technology Inc. (Milpitas, CA)
Inventors: Lin Yang (Milpitas, CA), Bin Yang (San Jose, CA), Hua Zhou (San Jose, CA), Xiaochun Li (San Ramon, CA), Wenhan Zhang (Mississauga), Qi Dong (San Jose, CA), Yequn Zhang (San Jose, CA), Yongxiong Ren (San Jose, CA), Patrick Dong (San Jose, CA)
Application Number: 16/586,500