Optimizers: Lion vs Adam (2024)

Brahmika Shree

4 min read

Jan 12, 2024

Deep Learning is a sub-section of machine learning that allows machines to process data in a similar manner as the human brain. The backbone of Deep Learning is a network of nodes that connect to each other forming layers, and a combination of these layers form a neural network. Input data passes through several layers of neural networks and is refined to make accurate predictions. How a neural network ideally works is that data (features) is passed through an Input Layer and produced out of an Output Layer. But between these two layers is where the major processing and fine-tuning of data takes place, in the Hidden Layer. In this post, I am focusing on understanding an algorithm that is responsible for the fine-tuning between layers. These algorithms are called Optimizers.

Optimizers are algorithms that tweak certain attributes of your neural network such as weights, biases and learning rates to reduce losses. We are targeting four technical terms in the Deep Learning world, let’s try and understand what they mean.

1. Weights: They control the strength of the connection between two consecutive nodes. They help decide how much one layer will affect the next layer. This helps us understand how the input layer has contributed to the results provided by the output layer.

2. Biases: These are constant numbers that adjust the level at which an activation function is triggered. This function is responsible for whether a neuron is activated or not. It’s like a constant in a linear equation. It’s an additional parameter that adjusts the output.

3. Learning Rate: It’s important for deep learning models to be able to take in new and updated data and train on it. Learning Rate is a variable that shows how quickly a model can adapt to change.

4. Losses: Losses calculate how far off our predicted value is from the target value. It helps understand the accuracy of the model.

Optimizers are used to minimize these differences called Losses by adjusting parameters that we discussed above.

There are several types of Optimizers to pick from, some commonly used Optimizers are Gradient Descent Optimizer, Adam Optimizer and Stochastic Gradient Descent with Momentum. As a beginner, I always thought increased number of epochs could yield better results but that is not true. These optimizers need to be picked keeping in mind what parameter we are aiming for, and which is adaptable to the amount of data we are planning to feed the model. We are going to discuss about one such optimizer called “Adam”.

Adam is abbreviation for Adaptive Moment Estimation. It is a combination of two other optimizers called Stochastic Gradient Descent with Momentum (SGD) and RMSprop. To understand what these two algorithms bring to the table, let’s imagine a hill, and if we were trying to get to the lowest part of the hill, Gradient Descent algorithm does the job. We are aiming for the lowest point because that would mean the losses are low. To get to this lowest point of the hill we might have to cross highs as well, RMSprop does the job of deciding how big our steps must be for us to reach our point. It can neither be too big nor too small. With the help of SGD, we take smaller portions of the hill to navigate and use the momentum to point us in the right direction. Adam is a combination of RMSprop’s ability to improve learning rate and SGD’s capability to navigate in the right direction. Its primary focus is to adjust learning rates for better accuracy of a model. While Adam does a good job with clear data it doesn’t do very well with noisy data i.e., it makes major fluctuations of the learning rate when it encounters noisy data. Researchers might have found just the right solution for that.

While Adam might seem like an ideal optimizer despite its negatives, in the recent times researchers have come up with a new optimizer called the Lion Optimizer (EvoLved Sign Momentum) which solves the disadvantages of the Adam Optimizer. This algorithm was discovered by Google Brain along with the University of California (UCLA). It has proven to be better than Adam in several ways. The Lion optimizer focuses on tracking the momentum, while leveraging the Sign Operation (+) which helps the algorithm move in one direction despite all the noisy data. The simplicity of this algorithm also makes it memory efficient. But don’t let the simplicity of it make you question its accuracy. In several instances it has proven to perform better than Adam.

In conclusion, selecting the perfect Optimizer depends on more than one feature. Every kind of dataset would require a specific type of Optimizer and a lot of trial and errors help us understand how these things work and help us make better choices.