Optimizers¶
The various optimizers that you can use to tune your parameters
-
struct
SimpleSGDTrainer
: public dynet::Trainer¶ - #include <training.h>
Stochastic gradient descent trainer.
This trainer performs stochastic gradient descent, the goto optimization procedure for neural networks. In the standard setting, the learning rate at epoch \(t\) is \(\eta_t=\frac{\eta_0}{1+\eta_{\mathrm{decay}}t}\)
Reference : reference needed
Public Functions
-
SimpleSGDTrainer
(ParameterCollection &m, real learning_rate = 0.1)¶ Constructor.
- Parameters
m
: ParameterCollection to be trainedlearning_rate
: Initial learning rate
-
-
struct
CyclicalSGDTrainer
: public dynet::Trainer¶ - #include <training.h>
Cyclical learning rate SGD.
This trainer performs stochastic gradient descent with a cyclical learning rate as proposed in Smith, 2015.
This uses a triangular function with optional exponential decay.
More specifically, at each update, the learning rate \(\eta\) is updated according to :
\( \begin{split} \text{cycle} &= \left\lfloor 1 + \frac{\texttt{it}}{2 \times\texttt{step_size}} \right\rfloor\\ x &= \left\vert \frac{\texttt{it}}{\texttt{step_size}} - 2 \times \text{cycle} + 1\right\vert\\ \eta &= \eta_{\text{min}} + (\eta_{\text{max}} - \eta_{\text{min}}) \times \max(0, 1 - x) \times \gamma^{\texttt{it}}\\ \end{split} \)
Reference : Cyclical Learning Rates for Training Neural Networks
Public Functions
-
CyclicalSGDTrainer
(ParameterCollection &m, float learning_rate_min = 0.01, float learning_rate_max = 0.1, float step_size = 2000, float gamma = 1.0, float edecay = 0.0)¶ Constructor.
- Parameters
m
: ParameterCollection to be trainedlearning_rate_min
: Lower learning ratelearning_rate_max
: Upper learning ratestep_size
: Period of the triangular function in number of iterations (not epochs). According to the original paper, this should be set around (2-8) x (training iterations in epoch)gamma
: Learning rate upper bound decay parameteredecay
: Learning rate decay parameter. Ideally you shouldn’t use this with cyclical learning rate since decay is already handled by \(\gamma\)
-
-
struct
MomentumSGDTrainer
: public dynet::Trainer¶ - #include <training.h>
Stochastic gradient descent with momentum.
This is a modified version of the SGD algorithm with momentum to stablize the gradient trajectory. The modified gradient is \(\theta_{t+1}=\mu\theta_{t}+\nabla_{t+1}\) where \(\mu\) is the momentum.
Reference : reference needed
Public Functions
-
MomentumSGDTrainer
(ParameterCollection &m, real learning_rate = 0.01, real mom = 0.9)¶ Constructor.
- Parameters
m
: ParameterCollection to be trainedlearning_rate
: Initial learning ratemom
: Momentum
-
-
struct
AdagradTrainer
: public dynet::Trainer¶ - #include <training.h>
Adagrad optimizer.
The adagrad algorithm assigns a different learning rate to each parameter according to the following formula : \(\delta_\theta^{(t)}=-\frac{\eta_0}{\epsilon+\sum_{i=0}^{t-1}(\nabla_\theta^{(i)})^2}\nabla_\theta^{(t)}\)
Reference : Duchi et al., 2011
Public Functions
-
AdagradTrainer
(ParameterCollection &m, real learning_rate = 0.1, real eps = 1e-20)¶ Constructor.
- Parameters
m
: ParameterCollection to be trainedlearning_rate
: Initial learning rateeps
: Bias parameter \(\epsilon\) in the adagrad formula
-
-
struct
AdadeltaTrainer
: public dynet::Trainer¶ - #include <training.h>
AdaDelta optimizer.
The AdaDelta optimizer is a variant of Adagrad where \(\frac{\eta_0}{\sqrt{\epsilon+\sum_{i=0}^{t-1}(\nabla_\theta^{(i)})^2}}\) is replaced by \(\frac{\sqrt{\epsilon+\sum_{i=0}^{t-1}\rho^{t-i-1}(1-\rho)(\delta_\theta^{(i)})^2}}{\sqrt{\epsilon+\sum_{i=0}^{t-1}(\nabla_\theta^{(i)})^2}}\), hence eliminating the need for an initial learning rate.
Reference : ADADELTA: An Adaptive Learning Rate Method
Public Functions
-
AdadeltaTrainer
(ParameterCollection &m, real eps = 1e-6, real rho = 0.95)¶ Constructor.
- Parameters
m
: ParameterCollection to be trainedeps
: Bias parameter \(\epsilon\) in the adagrad formularho
: Update parameter for the moving average of updates in the numerator
-
-
struct
RMSPropTrainer
: public dynet::Trainer¶ - #include <training.h>
RMSProp optimizer.
The RMSProp optimizer is a variant of Adagrad where the squared sum of previous gradients is replaced with a moving average with parameter \(\rho\).
Reference : reference needed
Public Functions
-
RMSPropTrainer
(ParameterCollection &m, real learning_rate = 0.1, real eps = 1e-20, real rho = 0.95)¶ Constructor.
- Parameters
m
: ParameterCollection to be trainedlearning_rate
: Initial learning rateeps
: Bias parameter \(\epsilon\) in the adagrad formularho
: Update parameter for the moving average (rho = 0
is equivalent to using Adagrad)
-
-
struct
AdamTrainer
: public dynet::Trainer¶ - #include <training.h>
Adam optimizer.
The Adam optimizer is similar to RMSProp but uses unbiased estimates of the first and second moments of the gradient
Reference : Adam: A Method for Stochastic Optimization
Public Functions
-
AdamTrainer
(ParameterCollection &m, float learning_rate = 0.001, float beta_1 = 0.9, float beta_2 = 0.999, float eps = 1e-8)¶ Constructor.
- Parameters
m
: ParameterCollection to be trainedlearning_rate
: Initial learning ratebeta_1
: Moving average parameter for the meanbeta_2
: Moving average parameter for the varianceeps
: Bias parameter \(\epsilon\)
-
-
struct
AmsgradTrainer
: public dynet::Trainer¶ - #include <training.h>
AMSGrad optimizer.
The AMSGrad optimizer is similar to Adam which uses unbiased estimates of the first and second moments of the gradient, however AMSGrad keeps the maximum of all the second moments and uses that instead of the actual second moments
Reference : On the Convergence of Adam and Beyond
Public Functions
-
AmsgradTrainer
(ParameterCollection &m, float learning_rate = 0.001, float beta_1 = 0.9, float beta_2 = 0.999, float eps = 1e-8)¶ Constructor.
- Parameters
m
: ParameterCollection to be trainedlearning_rate
: Initial learning ratebeta_1
: Moving average parameter for the meanbeta_2
: Moving average parameter for the varianceeps
: Bias parameter \(\epsilon\)
-
-
struct
EGTrainer
: public dynet::Trainer¶ - #include <training.h>
Exponentiated gradient optimizer with momentum and cyclical learning rate.
FIXME
Reference : FIXME
-
struct
Trainer
¶ - #include <training.h>
General trainer struct.
Subclassed by dynet::AdadeltaTrainer, dynet::AdagradTrainer, dynet::AdamTrainer, dynet::AmsgradTrainer, dynet::CyclicalSGDTrainer, dynet::EGTrainer, dynet::MomentumSGDTrainer, dynet::RMSPropTrainer, dynet::SimpleSGDTrainer
Public Functions
-
Trainer
(ParameterCollection &m, real learning_rate)¶ General constructor for a Trainer.
- Parameters
m
: ParameterCollection to be trainedlearning_rate
: Initial learning rate
-
void
update
()¶ Update parameters.
Update the parameters according to the appropriate update rule
-
void
update
(const std::vector<unsigned> &updated_params, const std::vector<unsigned> &updated_lookup_params)¶ Update subset of parameters.
Update some but not all of the parameters included in the model. This is the update_subset() function in the Python bindings. The parameters to be updated are specified by index, which can be found for Parameter and LookupParameter objects through the “index” variable (or the get_index() function in the Python bindings).
- Parameters
updated_params
: The parameter indices to be updatedupdated_lookup_params
: The lookup parameter indices to be updated
-
virtual void
restart
() = 0¶ Restarts the optimizer.
Clears all momentum values and assimilate (if applicable). This method does not update the current hyperparameters . (for example the bias parameter of the AdadeltaTrainer is left unchanged).
-
void
restart
(real lr)¶ Restarts the optimizer with a new learning rate.
Clears all momentum values and assimilate (if applicable) and resets the learning rate. This method does not update the current hyperparameters . (for example the bias parameter of the AdadeltaTrainer is left unchanged).
- Parameters
learning_rate
: New learning rate
-
void
save
(std::ostream &os)¶ Save the optimizer state.
Write all hyperparameters, momentum values and assimilate (if applicable) to stream. If the parameters are swapped with their moving averages, only the latters are saved.
- Parameters
os
: Output stream
-
void
populate
(std::istream &is)¶ Load the optimizer state.
Read all hyperparameters, momentum values and assimilate (if applicable) from stream.
- Parameters
os
: Input stream
-
void
populate
(std::istream &is, real lr)¶ Load the optimizer state.
Read all hyperparameters, momentum values and assimilate (if applicable) from stream.
- Parameters
os
: Input streamlr
: New learning rate
-
float
clip_gradients
()¶ Clip gradient.
If clipping is enabled and the gradient is too big, return the amount to scale the gradient by (otherwise 1)
- Return
- The appropriate scaling factor
-
MovingAverage
moving_average
()¶ Whether the the trainer is storing the moving average of parameters
- Return
- The moving average mode
-
void
exponential_moving_average
(float beta, unsigned update_freq = 1u)¶ Enable the computation of the exponential moving average of parameters.
This function must be called before any update.
- Parameters
beta
: The degree of weighting decreaseupdate_freq
: Frequency of update of the EMA
-
void
cumulative_moving_average
(unsigned update_freq = 1u)¶ Enable the computation of the cumulative moving average of parameters.
This function must be called before any update.
- Parameters
update_freq
: Frequency of update of the moving average
-
void
swap_params_to_moving_average
(bool save_weights = true, bool bias_correction = false)¶ Set the network parameters to their moving average
If the current weights are not saved, the optimizer cannot be used anymore (e.g. the update() function will throw an exception)
- Parameters
save_weights
: Whether to save the current weights.bias_bias_correction
: Whether to apply bias correction (used for exponential moving average only)
-
void
swap_params_to_weights
()¶ Restore the parameters of the model if they are set to their moving average
Public Members
-
bool
sparse_updates_enabled
¶ Whether to perform sparse updates.
DyNet trainers support two types of updates for lookup parameters, sparse and dense. Sparse updates are the default. They have the potential to be faster, as they only touch the parameters that have non-zero gradients. However, they may not always be faster (particulary on GPU with mini-batch training), and are not precisely numerically correct for some update rules such as MomentumTrainer and AdamTrainer. Thus, if you set this variable to false, the trainer will perform dense updates and be precisely correct, and maybe faster sometimes.
-