In the world of deep learning AI model, ‘Bigger is Better’ is the norm. Which means bigger models perform better. Earlier this year NVIDIA announced the MegatronLM – a monster language model that packs 8.3 billion parameters inside the model. The parameters occupy 33 GB disk space. 

512 V100 GPUs ran continuously for more than 9 days to train the model. The energy expended for the training is 3X the average energy consumption by an American in a year. The energy spent in training the BERT language model is equivalent to the fuel spent on a single transatlantic flight. A direct effect of building such massive models is the carbon footprint generated by these models.

There are probably a little over 100 million processors in the cloud today. However, there are close to 3 billion smartphones and several billion IOT devices out there. A standard mobile phone has enough processing power to carry out model inference. These are the ‘Edge’ devices and most of the future computing will happen on these devices. However, these Edge devices cannot host such massive deep learning models to carry out local prediction.

There is also the question of security, privacy, and latency when hitting an API in the cloud to carry out a prediction as opposed to doing it locally on the device. Cost may also be a factor in the long run in choosing to run a model locally as opposed to the cloud.

For all the above reasons it is extremely important to come up with smaller models that can easily fit into Edge devices so that inferences can happen locally. Recent research has brought up innovative ideas to reduce the size of the models via compression and other techniques.

It is logical to assume that smaller models will have reduced performance i.e. accuracy. This is no doubt true, but the reduction inaccuracy may not be that significant.

The chart below shows the accuracy of Image models vs the # of model parameters. The size of the bubble represents the model size. It is quite evident that the bigger models do not necessarily deliver higher accuracy.

Read about the history of chatbots here

E.g. smaller models such as MobileNet (20MB in size) do not perform worse than say VGG-16 while they are around 25X smaller in size.

In this post we will focus on compression techniques that have been used off late to reduce model size. Compressing a model not only reduces its size so that it consumes less memory but also makes it faster. Compression reduces the number of parameters (weights) or their precision.

The following three techniques are popular and have been used to compress big size language and image models.

1. Quantization

Quantization means reducing the numerical precision of the model parameters or weights. Typically the weights are stored in 32-bit floating-point numbers. However, for many applications, this level of precision may not be necessary. Quantization maps each floating-point weight to a fixed precision integer containing lesser bits. Quantization can be done post-training or during training. Post-training can result in 4X smaller models with < 2% accuracy loss. Quantization during training can result in 8X-16X smaller models. Quantization can also be applied to the activation functions of the models.

2. Pruning

Pruning involves removing the parameters or weights that contribute least to overall model accuracy. 

Weight pruning can be used to remove individual connection weights. Weight magnitude pruning removes the weights having less magnitude, using magnitude as a measure of connection importance. 

Neuron pruning removes entire neurons. If we can rank the neurons in the network based on their contribution, then the lower contributing neurons can be removed.

3. Knowledge Distillation

Although knowledge distillation is an indirect way of compressing a model, it achieves the same result i.e of creating smaller models. In this technique, a larger (already created) model called ‘teacher’ is used to train a smaller model called ‘student’. Here the goal is to have the same distribution in the student model as available in the teacher model. This is achieved by transferring the knowledge from the teacher to the student by minimizing a loss function in which the target is the distribution of class probabilities predicted by the teacher model.

Recent successes in compressing language models is evident by the availability of many smaller transformer models based on BERT – ALBERT (Google and Toyota), DistilBERT (Huggingface), TinyBERT (Huawei), MobileBERT and Q8BERT (Quantized 8Bit BERT).

Some of the popular tools used to achieve deep learning model compression are:

  1. TensorFlowLite from Google.
  2. Core ML from Apple
  3. Caffe2Go from Facebook
  4. Neural Network Distiller from Intel AI Lab

It is good to see the discussion and research on model compression techniques. This has resulted in a push to build smaller models, especially to cater to Edge devices. This can give a much-needed fillip to Edge computing which is the future of computing. It also has the added advantage of better resource utilization since the mobile phone processors will be put to use much more than what happens currently.