# Optimizers¶

The various optimizers that you can use to tune your parameters

struct dynet::SimpleSGDTrainer
#include <training.h>

This trainer performs stochastic gradient descent, the goto optimization procedure for neural networks. In the standard setting, the learning rate at epoch $$t$$ is $$\eta_t=\frac{\eta_0}{1+\eta_{\mathrm{decay}}t}$$

Reference : reference needed

Inherits from dynet::Trainer

Public Functions

dynet::SimpleSGDTrainerSimpleSGDTrainer(ParameterCollection &m, real learning_rate = 0.1)

Constructor.

Parameters
• m: ParameterCollection to be trained
• learning_rate: Initial learning rate

struct dynet::CyclicalSGDTrainer
#include <training.h>

Cyclical learning rate SGD.

This trainer performs stochastic gradient descent with a cyclical learning rate as proposed in Smith, 2015.

This uses a triangular function with optional exponential decay.

More specifically, at each update, the learning rate $$\eta$$ is updated according to :

$$\begin{split} \text{cycle} &= \left\lfloor 1 + \frac{\texttt{it}}{2 \times\texttt{step_size}} \right\rfloor\\ x &= \left\vert \frac{\texttt{it}}{\texttt{step_size}} - 2 \times \text{cycle} + 1\right\vert\\ \eta &= \eta_{\text{min}} + (\eta_{\text{max}} - \eta_{\text{min}}) \times \max(0, 1 - x) \times \gamma^{\texttt{it}}\\ \end{split}$$

Inherits from dynet::Trainer

Public Functions

dynet::CyclicalSGDTrainerCyclicalSGDTrainer(ParameterCollection &m, float learning_rate_min = 0.01, float learning_rate_max = 0.1, float step_size = 2000, float gamma = 0.0, float edecay = 0.0)

Constructor.

Parameters
• m: ParameterCollection to be trained
• learning_rate_min: Lower learning rate
• learning_rate_max: Upper learning rate
• step_size: Period of the triangular function in number of iterations (not epochs). According to the original paper, this should be set around (2-8) x (training iterations in epoch)
• gamma: Learning rate upper bound decay parameter
• edecay: Learning rate decay parameter. Ideally you shouldn’t use this with cyclical learning rate since decay is already handled by $$\gamma$$

struct dynet::MomentumSGDTrainer
#include <training.h>

This is a modified version of the SGD algorithm with momentum to stablize the gradient trajectory. The modified gradient is $$\theta_{t+1}=\mu\theta_{t}+\nabla_{t+1}$$ where $$\mu$$ is the momentum.

Reference : reference needed

Inherits from dynet::Trainer

Public Functions

dynet::MomentumSGDTrainerMomentumSGDTrainer(ParameterCollection &m, real learning_rate = 0.01, real mom = 0.9)

Constructor.

Parameters
• m: ParameterCollection to be trained
• learning_rate: Initial learning rate
• mom: Momentum

struct dynet::AdagradTrainer
#include <training.h>

The adagrad algorithm assigns a different learning rate to each parameter according to the following formula : $$\delta_\theta^{(t)}=-\frac{\eta_0}{\epsilon+\sum_{i=0}^{t-1}(\nabla_\theta^{(i)})^2}\nabla_\theta^{(t)}$$

Reference : Duchi et al., 2011

Inherits from dynet::Trainer

Public Functions

dynet::AdagradTrainer::AdagradTrainer(ParameterCollection & m, real learning_rate = 0.1, real eps = 1e-20)

Constructor.

Parameters
• m: ParameterCollection to be trained
• learning_rate: Initial learning rate
• eps: Bias parameter $$\epsilon$$ in the adagrad formula

struct dynet::AdadeltaTrainer
#include <training.h>

The AdaDelta optimizer is a variant of Adagrad where $$\frac{\eta_0}{\sqrt{\epsilon+\sum_{i=0}^{t-1}(\nabla_\theta^{(i)})^2}}$$ is replaced by $$\frac{\sqrt{\epsilon+\sum_{i=0}^{t-1}\rho^{t-i-1}(1-\rho)(\delta_\theta^{(i)})^2}}{\sqrt{\epsilon+\sum_{i=0}^{t-1}(\nabla_\theta^{(i)})^2}}$$, hence eliminating the need for an initial learning rate.

Inherits from dynet::Trainer

Public Functions

dynet::AdadeltaTrainer::AdadeltaTrainer(ParameterCollection & m, real eps = 1e-6, real rho = 0.95)

Constructor.

Parameters
• m: ParameterCollection to be trained
• eps: Bias parameter $$\epsilon$$ in the adagrad formula
• rho: Update parameter for the moving average of updates in the numerator

struct dynet::RMSPropTrainer
#include <training.h>

RMSProp optimizer.

The RMSProp optimizer is a variant of Adagrad where the squared sum of previous gradients is replaced with a moving average with parameter $$\rho$$.

Reference : reference needed

Inherits from dynet::Trainer

Public Functions

dynet::RMSPropTrainer::RMSPropTrainer(ParameterCollection & m, real learning_rate = 0.1, real eps = 1e-20, real rho = 0.95)

Constructor.

Parameters
• m: ParameterCollection to be trained
• learning_rate: Initial learning rate
• eps: Bias parameter $$\epsilon$$ in the adagrad formula
• rho: Update parameter for the moving average (rho = 0 is equivalent to using Adagrad)

struct dynet::AdamTrainer
#include <training.h>

The Adam optimizer is similar to RMSProp but uses unbiased estimates of the first and second moments of the gradient

Reference : Adam: A Method for Stochastic Optimization

Inherits from dynet::Trainer

Public Functions

dynet::AdamTrainer::AdamTrainer(ParameterCollection & m, float learning_rate = 0.001, float beta_1 = 0.9, float beta_2 = 0.999, float eps = 1e-8)

Constructor.

Parameters
• m: ParameterCollection to be trained
• learning_rate: Initial learning rate
• beta_1: Moving average parameter for the mean
• beta_2: Moving average parameter for the variance
• eps: Bias parameter $$\epsilon$$

struct dynet::AmsgradTrainer
#include <training.h>

The AMSGrad optimizer is similar to Adam which uses unbiased estimates of the first and second moments of the gradient, however AMSGrad keeps the maximum of all the second moments and uses that instead of the actual second moments

Reference : On the Convergence of Adam and Beyond

Inherits from dynet::Trainer

Public Functions

dynet::AmsgradTrainer::AmsgradTrainer(ParameterCollection & m, float learning_rate = 0.001, float beta_1 = 0.9, float beta_2 = 0.999, float eps = 1e-8)

Constructor.

Parameters
• m: ParameterCollection to be trained
• learning_rate: Initial learning rate
• beta_1: Moving average parameter for the mean
• beta_2: Moving average parameter for the variance
• eps: Bias parameter $$\epsilon$$

struct dynet::EGTrainer
#include <training.h>

Exponentiated gradient optimizer with momentum and cyclical learning rate.

FIXME

Reference : FIXME

Inherits from dynet::Trainer

struct dynet::Trainer
#include <training.h>

General trainer struct.

Public Functions

dynet::TrainerTrainer(ParameterCollection &m, real learning_rate)

General constructor for a Trainer.

Parameters
• m: ParameterCollection to be trained
• learning_rate: Initial learning rate

void dynet::Trainerupdate()

Update parameters.

Update the parameters according to the appropriate update rule

void dynet::Trainerupdate(const std::vector<unsigned> &updated_params, const std::vector<unsigned> &updated_lookup_params)

Update subset of parameters.

Update some but not all of the parameters included in the model. This is the update_subset() function in the Python bindings. The parameters to be updated are specified by index, which can be found for Parameter and LookupParameter objects through the “index” variable (or the get_index() function in the Python bindings).

Parameters
• updated_params: The parameter indices to be updated
• updated_lookup_params: The lookup parameter indices to be updated

virtual void dynet::Trainerrestart() = 0

Restarts the optimizer.

Clears all momentum values and assimilate (if applicable)

void dynet::Trainerrestart(real lr)

Restarts the optimizer with a new learning rate.

Clears all momentum values and assimilate (if applicable) and resets the learning rate

Parameters
• learning_rate: New learning rate

float dynet::Trainerclip_gradients()

bool dynet::Trainersparse_updates_enabled