Why 0.9? Towards Better Momentum Strategies in Deep Learning. (2024)

Thoughts and Theory

How more sophisticated momentum strategies can make deep learning less painful.

Why 0.9? Towards Better Momentum Strategies in Deep Learning. (1)

Why 0.9? Towards Better Momentum Strategies in Deep Learning. (2)

·

10 min read

·

Feb 26, 2021

--

Why 0.9? Towards Better Momentum Strategies in Deep Learning. (3)

Momentum is a widely-used strategy for accelerating the convergence of gradient-based optimization techniques. Momentum was designed to speed up learning in directions of low curvature, without becoming unstable in directions of high curvature. In deep learning, most practitioners set the value of momentum to 0.9 without attempting to further tune this hyperparameter (i.e., this is the default value for momentum in many popular deep learning packages). However, there is no indication that this choice for the value of momentum is universally well-behaved.

Within this post, we overview recent research indicating that decaying the value of momentum throughout training can aid the optimization process. In particular, we recommend a novel Demon strategy for momentum decay. To support this recommendation, we conduct a large-scale analysis of different strategies for momentum decay in comparison to other popular optimization strategies, proving that momentum decay with Demon is practically useful.

Overview

This post will begin with a summary of relevant background knowledge for optimization in deep learning, highlighting the current go-to techniques for training deep models. Following this introduction, the Demon momentum decay strategy will be introduced and motivated. Finally, we will conclude with an extensive empirical analysis of Demon in comparison to a wide scope of popular optimization strategies. Overall, we aim to demonstrate through this post that significant benefit can be gained by developing better strategies for handling the momentum parameter within deep learning.

For any deep learning practitioner, it is no surprise that training a model can be computationally expensive. When the hyperparameter tuning process is taken into account, the computational expense of model training is even further exacerbated. For example, some state-of-the-art language models can cost millions of dollars to train on public cloud resources when…

Why 0.9? Towards Better Momentum Strategies in Deep Learning. (2024)
Top Articles
Latest Posts
Article information

Author: Corie Satterfield

Last Updated:

Views: 5698

Rating: 4.1 / 5 (62 voted)

Reviews: 93% of readers found this page helpful

Author information

Name: Corie Satterfield

Birthday: 1992-08-19

Address: 850 Benjamin Bridge, Dickinsonchester, CO 68572-0542

Phone: +26813599986666

Job: Sales Manager

Hobby: Table tennis, Soapmaking, Flower arranging, amateur radio, Rock climbing, scrapbook, Horseback riding

Introduction: My name is Corie Satterfield, I am a fancy, perfect, spotless, quaint, fantastic, funny, lucky person who loves writing and wants to share my knowledge and understanding with you.