Optimizers¶

The various optimizers that you can use to tune your parameters

struct dynet::SimpleSGDTrainer
#include <training.h>

This trainer performs stochastic gradient descent, the goto optimization procedure for neural networks. In the standard setting, the learning rate at epoch $$t$$ is $$\eta_t=\frac{\eta_0}{1+\eta_{\mathrm{decay}}t}$$

Reference : reference needed

Inherits from dynet::Trainer

Public Functions

SimpleSGDTrainer(Model &m, real e0 = 0.1, real edecay = 0.0)

Constructor.

Parameters
• m: Model to be trained
• e0: Initial learning rate
• edecay: Learning rate decay parameter.

struct dynet::MomentumSGDTrainer
#include <training.h>

This is a modified version of the SGD algorithm with momentum to stablize the gradient trajectory. The modified gradient is $$\theta_{t+1}=\mu\theta_{t}+\nabla_{t+1}$$ where $$\mu$$ is the momentum.

Reference : reference needed

Inherits from dynet::Trainer

Public Functions

MomentumSGDTrainer(Model &m, real e0 = 0.01, real mom = 0.9, real edecay = 0.0)

Constructor.

Parameters
• m: Model to be trained
• e0: Initial learning rate
• mom: Momentum
• edecay: Learning rate decay parameter

struct dynet::AdagradTrainer
#include <training.h>

The adagrad algorithm assigns a different learning rate to each parameter according to the following formula : $$\delta_\theta^{(t)}=-\frac{\eta_0}{\epsilon+\sum_{i=0}^{t-1}(\nabla_\theta^{(i)})^2}\nabla_\theta^{(t)}$$

Reference : Duchi et al., 2011

Inherits from dynet::Trainer

Public Functions

AdagradTrainer(Model &m, real e0 = 0.1, real eps = 1e-20, real edecay = 0.0)

Constructor.

Parameters
• m: Model to be trained
• e0: Initial learning rate
• eps: Bias parameter $$\epsilon$$ in the adagrad formula
• edecay: Learning rate decay parameter

struct dynet::AdadeltaTrainer
#include <training.h>

The AdaDelta optimizer is a variant of Adagrad where $$\frac{\eta_0}{\sqrt{\epsilon+\sum_{i=0}^{t-1}(\nabla_\theta^{(i)})^2}}$$ is replaced by $$\frac{\sqrt{\epsilon+\sum_{i=0}^{t-1}\rho^{t-i-1}(1-\rho)(\delta_\theta^{(i)})^2}}{\sqrt{\epsilon+\sum_{i=0}^{t-1}(\nabla_\theta^{(i)})^2}}$$, hence eliminating the need for an initial learning rate.

Inherits from dynet::Trainer

Public Functions

AdadeltaTrainer(Model &m, real eps = 1e-6, real rho = 0.95, real edecay = 0.0)

Constructor.

Parameters
• m: Model to be trained
• eps: Bias parameter $$\epsilon$$ in the adagrad formula
• rho: Update parameter for the moving average of updates in the numerator
• edecay: Learning rate decay parameter

struct dynet::RmsPropTrainer
#include <training.h>

RMSProp optimizer.

The RMSProp optimizer is a variant of Adagrad where the squared sum of previous gradients is replaced with a moving average with parameter $$\rho$$.

Reference : reference needed

Inherits from dynet::Trainer

Public Functions

RmsPropTrainer(Model &m, real e0 = 0.1, real eps = 1e-20, real rho = 0.95, real edecay = 0.0)

Constructor.

Parameters
• m: Model to be trained
• e0: Initial learning rate
• eps: Bias parameter $$\epsilon$$ in the adagrad formula
• rho: Update parameter for the moving average (rho = 0 is equivalent to using Adagrad)
• edecay: Learning rate decay parameter

struct dynet::AdamTrainer
#include <training.h>

The Adam optimizer is similar to RMSProp but uses unbiased estimates of the first and second moments of the gradient

Reference : Adam: A Method for Stochastic Optimization

Inherits from dynet::Trainer

Public Functions

AdamTrainer(Model &m, float e0 = 0.001, float beta_1 = 0.9, float beta_2 = 0.999, float eps = 1e-8, real edecay = 0.0)

Constructor.

Parameters
• m: Model to be trained
• e0: Initial learning rate
• beta_1: Moving average parameter for the mean
• beta_2: Moving average parameter for the variance
• eps: Bias parameter $$\epsilon$$
• edecay: Learning rate decay parameter

struct dynet::Trainer
#include <training.h>

General trainer struct.

Public Functions

Trainer(Model &m, real e0, real edecay = 0.0)

General constructor for a Trainer.

Parameters
• m: Model to be trained
• e0: Initial learning rate
• edecay: Learning rate decay

void update(real scale = 1.0)

Update parameters.

Update the parameters according to the appropriate update rule

Parameters
• scale: The scaling factor for the gradients

void update(const std::vector<unsigned> &updated_params, const std::vector<unsigned> &updated_lookup_params, real scale = 1.0)

Update subset of parameters.

Update some but not all of the parameters included in the model. This is the update_subset() function in the Python bindings. The parameters to be updated are specified by index, which can be found for Parameter and LookupParameter objects through the “index” variable (or the get_index() function in the Python bindings).

Parameters
• updated_params: The parameter indices to be updated
• updated_lookup_params: The lookup parameter indices to be updated
• scale: The scaling factor for the gradients

float clip_gradients(real scale)

If clipping is enabled and the gradient is too big, return the amount to scale the gradient by (otherwise 1)

Return
The appropriate scaling factor
Parameters
• scale: The clipping limit

Public Members

bool sparse_updates_enabled