RNNs tutorial¶
In [1]:
# we assume that we have the dynet module in your path.
# OUTDATED: we also assume that LD_LIBRARY_PATH includes a pointer to where libcnn_shared.so is.
import dynet as dy
An LSTM/RNN overview:¶
An (1-layer) RNN can be thought of as a sequence of cells, \(h_1,...,h_k\), where \(h_i\) indicates the time dimenstion.
Each cell \(h_i\) has an input \(x_i\) and an output \(r_i\). In addition to \(x_i\), cell \(h_i\) receives as input also \(r_{i-1}\).
In a deep (multi-layer) RNN, we don’t have a sequence, but a grid. That is we have several layers of sequences:
- \(h_1^3,...,h_k^3\)
- \(h_1^2,...,h_k^2\)
- \(h_1^1,...h_k^1\),
Let \(r_i^j\) be the output of cell \(h_i^j\). Then:
The input to \(h_i^1\) is \(x_i\) and \(r_{i-1}^1\).
The input to \(h_i^2\) is \(r_i^1\) and \(r_{i-1}^2\), and so on.
The LSTM (RNN) Interface¶
RNN / LSTM / GRU follow the same interface. We have a “builder” which is in charge of creating definining the parameters for the sequence.
In [2]:
pc = dy.ParameterCollection()
NUM_LAYERS=2
INPUT_DIM=50
HIDDEN_DIM=10
builder = dy.LSTMBuilder(NUM_LAYERS, INPUT_DIM, HIDDEN_DIM, pc)
# or:
# builder = dy.SimpleRNNBuilder(NUM_LAYERS, INPUT_DIM, HIDDEN_DIM, pc)
Note that when we create the builder, it adds the internal RNN
parameters to the ParameterCollection
. We do not need to care about
them, but they will be optimized together with the rest of the network’s
parameters.
In [3]:
s0 = builder.initial_state()
In [4]:
x1 = dy.vecInput(INPUT_DIM)
In [5]:
s1=s0.add_input(x1)
y1 = s1.output()
# here, we add x1 to the RNN, and the output we get from the top is y (a HIDEN_DIM-dim vector)
In [6]:
y1.npvalue().shape
Out[6]:
(10,)
In [7]:
s2=s1.add_input(x1) # we can add another input
y2=s2.output()
If our LSTM/RNN was one layer deep, y2 would be equal to the hidden state. However, since it is 2 layers deep, y2 is only the hidden state (= output) of the last layer.
If we were to want access to the all the hidden state (the output of
both the first and the last layers), we could use the .h()
method,
which returns a list of expressions, one for each layer:
In [8]:
print s2.h()
(exprssion 54/0, exprssion 66/0)
The same interface that we saw until now for the LSTM, holds also for the Simple RNN:
In [9]:
# create a simple rnn builder
rnnbuilder=dy.SimpleRNNBuilder(NUM_LAYERS, INPUT_DIM, HIDDEN_DIM, pc)
# initialize a new graph, and a new sequence
rs0 = rnnbuilder.initial_state()
# add inputs
rs1 = rs0.add_input(x1)
ry1 = rs1.output()
print "all layers:", s1.h()
all layers: (exprssion 32/0, exprssion 42/0)
In [10]:
print s1.s()
(exprssion 28/0, exprssion 38/0, exprssion 32/0, exprssion 42/0)
To summarize, when calling .add_input(x)
on an RNNState
what
happens is that the state creates a new RNN/LSTM column, passing it: 1.
the state of the current RNN column 2. the input x
The state is then returned, and we can call it’s output()
method to
get the output y
, which is the output at the top of the column. We
can access the outputs of all the layers (not only the last one) using
the .h()
method of the state.
``.s()`` The internal state of the RNN may be more involved than
just the outputs \(h\). This is the case for the LSTM, that keeps an
extra “memory” cell, that is used when calculating \(h\), and which
is also passed to the next column. To access the entire hidden state, we
use the .s()
method.
The output of .s()
differs by the type of RNN being used. For the
simple-RNN, it is the same as .h()
. For the LSTM, it is more
involved.
In [11]:
rnn_h = rs1.h()
rnn_s = rs1.s()
print "RNN h:", rnn_h
print "RNN s:", rnn_s
lstm_h = s1.h()
lstm_s = s1.s()
print "LSTM h:", lstm_h
print "LSTM s:", lstm_s
RNN h: (exprssion 74/0, exprssion 76/0)
RNN s: (exprssion 74/0, exprssion 76/0)
LSTM h: (exprssion 32/0, exprssion 42/0)
LSTM s: (exprssion 28/0, exprssion 38/0, exprssion 32/0, exprssion 42/0)
As we can see, the LSTM has two extra state expressions (one for each hidden layer) before the outputs h.
Extra options in the RNN/LSTM interface¶
Stack LSTM The RNN’s are shaped as a stack: we can remove the top
and continue from the previous state. This is done either by remembering
the previous state and continuing it with a new .add_input()
, or
using we can access the previous state of a given state using the
.prev()
method of state.
Initializing a new sequence with a given state When we call
builder.initial_state()
, we are assuming the state has random /0
initialization. If we want, we can specify a list of expressions that
will serve as the initial state. The expected format is the same as the
results of a call to .final_s()
. TODO: this is not supported yet.
In [12]:
s2=s1.add_input(x1)
s3=s2.add_input(x1)
s4=s3.add_input(x1)
# let's continue s3 with a new input.
s5=s3.add_input(x1)
# we now have two different sequences:
# s0,s1,s2,s3,s4
# s0,s1,s2,s3,s5
# the two sequences share parameters.
assert(s5.prev() == s3)
assert(s4.prev() == s3)
s6=s3.prev().add_input(x1)
# we now have an additional sequence:
# s0,s1,s2,s6
In [13]:
s6.h()
Out[13]:
(exprssion 184/0, exprssion 196/0)
In [14]:
s6.s()
Out[14]:
(exprssion 180/0, exprssion 192/0, exprssion 184/0, exprssion 196/0)
Aside: memory efficient transduction¶
The RNNState
interface is convenient, and allows for incremental
input construction. However, sometimes we know the sequence of inputs in
advance, and care only about the sequence of output expressions. In this
case, we can use the add_inputs(xs)
method, where xs
is a list
of Expression.
In [15]:
state = rnnbuilder.initial_state()
xs = [x1,x1,x1]
states = state.add_inputs(xs)
outputs = [s.output() for s in states]
hs = [s.h() for s in states]
print outputs, hs
[exprssion 200/0, exprssion 206/0, exprssion 212/0] [(exprssion 198/0, exprssion 200/0), (exprssion 203/0, exprssion 206/0), (exprssion 209/0, exprssion 212/0)]
This is convenient.
What if we do not care about .s()
and .h()
, and do not need to
access the previous vectors? In such cases we can use the
transduce(xs)
method instead of add_inputs(xs)
. transduce
takes in a sequence of Expression
s, and returns a sequence of
Expression
s. As a consequence of not returning RNNState
s,
trnasduce
is much more memory efficient than add_inputs
or a
series of calls to add_input
.
In [16]:
state = rnnbuilder.initial_state()
xs = [x1,x1,x1]
outputs = state.transduce(xs)
print outputs
[exprssion 216/0, exprssion 222/0, exprssion 228/0]
Character-level LSTM¶
Now that we know the basics of RNNs, let’s build a character-level LSTM language-model. We have a sequence LSTM that, at each step, gets as input a character, and needs to predict the next character.
In [17]:
import random
from collections import defaultdict
from itertools import count
import sys
LAYERS = 2
INPUT_DIM = 50
HIDDEN_DIM = 50
characters = list("abcdefghijklmnopqrstuvwxyz ")
characters.append("<EOS>")
int2char = list(characters)
char2int = {c:i for i,c in enumerate(characters)}
VOCAB_SIZE = len(characters)
In [18]:
pc = dy.ParameterCollection()
srnn = dy.SimpleRNNBuilder(LAYERS, INPUT_DIM, HIDDEN_DIM, pc)
lstm = dy.LSTMBuilder(LAYERS, INPUT_DIM, HIDDEN_DIM, pc)
params = {}
params["lookup"] = pc.add_lookup_parameters((VOCAB_SIZE, INPUT_DIM))
params["R"] = pc.add_parameters((VOCAB_SIZE, HIDDEN_DIM))
params["bias"] = pc.add_parameters((VOCAB_SIZE))
# return compute loss of RNN for one sentence
def do_one_sentence(rnn, sentence):
# setup the sentence
dy.renew_cg()
s0 = rnn.initial_state()
R = dy.parameter(params["R"])
bias = dy.parameter(params["bias"])
lookup = params["lookup"]
sentence = ["<EOS>"] + list(sentence) + ["<EOS>"]
sentence = [char2int[c] for c in sentence]
s = s0
loss = []
for char,next_char in zip(sentence,sentence[1:]):
s = s.add_input(lookup[char])
probs = dy.softmax(R*s.output() + bias)
loss.append( -dy.log(dy.pick(probs,next_char)) )
loss = dy.esum(loss)
return loss
# generate from model:
def generate(rnn):
def sample(probs):
rnd = random.random()
for i,p in enumerate(probs):
rnd -= p
if rnd <= 0: break
return i
# setup the sentence
dy.renew_cg()
s0 = rnn.initial_state()
R = dy.parameter(params["R"])
bias = dy.parameter(params["bias"])
lookup = params["lookup"]
s = s0.add_input(lookup[char2int["<EOS>"]])
out=[]
while True:
probs = dy.softmax(R*s.output() + bias)
probs = probs.vec_value()
next_char = sample(probs)
out.append(int2char[next_char])
if out[-1] == "<EOS>": break
s = s.add_input(lookup[next_char])
return "".join(out[:-1]) # strip the <EOS>
# train, and generate every 5 samples
def train(rnn, sentence):
trainer = dy.SimpleSGDTrainer(pc)
for i in xrange(200):
loss = do_one_sentence(rnn, sentence)
loss_value = loss.value()
loss.backward()
trainer.update()
if i % 5 == 0:
print loss_value,
print generate(rnn)
Notice that: 1. We pass the same rnn-builder to do_one_sentence
over
and over again. We must re-use the same rnn-builder, as this is where
the shared parameters are kept. 2. We dy.renew_cg()
before each
sentence – because we want to have a new graph (new network) for this
sentence. The parameters will be shared through the model and the shared
rnn-builder.
In [19]:
sentence = "a quick brown fox jumped over the lazy dog"
train(srnn, sentence)
142.737915039 lvawhaevbxulc yxg esuh vkyb gymj dzcnwgq dcjzzk
84.1147460938 woifoa odp jpt gxjofkaattj
44.212223053 a q io uoopr ouxducmwi jfxa j
23.4485988617 p tctflr
9.73490333557 w
3.23773050308 yaqzteu pux oa rntd bxumu yyvvfalejuyhed over the lazy dog
1.06309330463 a quick browe fow jumped over the lazy dog
0.671298980713 a quick broyn ox jumped over the lazy dog
0.490513861179 a quick brown fox jumped over the lazy dog
0.386095941067 a quick brown fox jumped over the lazy dog
0.318082690239 a quick brown fox jumped over the lazy dog
0.270276993513 a quick brown fox jumped over the lazy dog
0.234851941466 a quick brown foz jumped over the lazy dog
0.207555636764 a quick brown fox jumped over the lazy dog
0.185884565115 a quick brown fox jumped over the lazy dog
0.168265148997 a quiuk brown fox jumped over jhe lazy dog
0.153665527701 a quick brown fox jumped over the lazy dog
0.141367897391 a quick brown fox jumped over the lazy dog
0.130873680115 a quick brown fox jumped over the lazy dog
0.121810980141 a quick brown fox jumped over the lazy dog
0.113908931613 a quick brown fox jumped over the lazy dog
0.106958284974 a quick brown fox jumped over the lazy dog
0.100796818733 a quick brown fox jumped over the lazy dog
0.0953008085489 a quick brown fox jumped over the lazy dog
0.090367347002 a zuick brown for jumped over the lazy dog
0.0859087407589 a quick brown fox jumped over the lazy dog
0.0818664133549 a quick brown fox jumped over the lazy dog
0.0781841799617 a quick brown fox jumped over the lazy dog
0.0748091414571 a quick brown fox jumped over the lazy dog
0.0717144161463 a quick brown fox jumped over the lazy dog
0.0688648074865 a quick brown fox jumped over the lazy dog
0.0662328600883 a quick brown fox jumped over the lazy dog
0.0637853741646 a quick brown fox jumped over the lazy dog
0.0615109689534 a quick brown fox jumped over the lazy dog
0.0593910999596 a quick brown fox jumped over the lazy dog
0.0574130378664 a quick brown fox jumped over the lazy dog
0.0555621087551 a quick brown fox jumped over the lazy dog
0.0538215488195 a quick brown fox jumped over the lazy dog
0.0521896965802 a quick brown fox jumped over the lazy dog
0.0506477579474 a quick brown fox jumped over the lazy dog
In [20]:
sentence = "a quick brown fox jumped over the lazy dog"
train(lstm, sentence)
141.891098022 aoyekppy mocalmz xk atc jlg oaddk
128.925964355 hempeyud ki
121.445785522 qpveti fyobec ztmr eioknnueh ehecdvabxmc ydpmdm
110.670722961 z buws lmy vvrw
93.5055999756 vueoa cprlnkrd o ocazk nb olegiep o fftr t
82.1586227417 zj rvsr oej c toz bnarreow fffj
67.430847168 rzfik qoyc ohe hqe oea uitet ou udjkpme oak kdk oe fbu kcz fox dfoprl too o rxat luurnfowrrtj rbtram to url xlj okrr ooe otm hcy roab llsg doy ifzw rrbow rbowwb oke jxpee
54.9477920532 ba uiy doge she ueeze oejv
43.3301696777 qquc crgibbroej oxne ove rr
34.4687461853 uqckk owrbfo og uouk doge l
25.5408306122 reuk lfr own fox juamd ov
18.9417610168 qojn doo broww boan jover txe zacy moen crlw numk fox joge overwa trez quqk browx ox ruor oro fow j uoez kon fror bowe luccmd ogwr foy jodmoed ox
13.1646575928 qucy dov
9.46595668793 wiuuik brttxl laed over tre lazy dog
5.6522898674 rukc irown fox juaped over the lazy dov
3.38144731522 a quick brown fox jumver the lazy dog
1.80010521412 a bfoin fox jumped ovk fox luick brown fox jumped over the lazy dog
1.30616080761 a quic brownn fox jumped over the lazy dog
1.02201879025 a quick brown fox jumped over the lazy dog
0.83735615015 qucck brown fox jcmped over the lazy dog
0.708056390285 a quickz brown fox jumped over the lazy dog
0.612650871277 a quick brown fox jumped over the lazy dog
0.539469838142 a quick brown fox jumped over thel lazy dog
0.481610894203 va quick brown fox jumped over the lazy dog
0.434762001038 a quuck dovtbown fox jumped over the lazy dog
0.396079242229 a quick brown fox jumped over the lazy dog
0.363606244326 a quick brown fox jumped over the laza dog
0.335973978043 a quick brown fox jumped over the lazy dog
0.312186658382 a quick brown fox jumped over the lazy dog
0.291498303413 a quick brown fox qu
0.273335546255 a quick brown fox jumped ove
0.257278442383 a quick brown fox jumped over the lazy dog
0.242971763015 a quick brown fox jumped over the lazy dog
0.230153128505 a quick brown fox jumped over the lazy dog
0.218599274755 a quick brown fox jumped over the lazy dog
0.208135351539 a quick brown fox jumped over the lazy dog
0.198613137007 a quick brown fox jumped over tie lazy dog
0.189909905195 a quick brown fox jumped over the lazy dog
0.181928783655 a quick brown fox jumped over the lazy dog
0.174587100744 a quick brown fox jumped over the lazy dog
The model seem to learn the sentence quite well.
Somewhat surprisingly, the Simple-RNN model learn quicker than the LSTM!
How can that be?
The answer is that we are cheating a bit. The sentence we are trying to learn has each letter-bigram exactly once. This means a simple trigram model can memorize it very well.
Try it out with more complex sequences.
In [21]:
train(srnn, "these pretzels are making me thirsty")
332.651580811 a quick brown fox jumped over the lazy dog
133.209350586 a quick brown fox jumped over the lazy doe hu yum xd the
65.0720596313 azquick brown fox jumped over ohe iog
31.5592880249 a quick brown fox jumpedrovtretpede pretzelz are makink ma tui idmilt
13.2322559357 theve prwtumpede mhxtjaypny mreticv
1.87829053402 thele pretzelb mre laki loet dre za tuiri mtoina ma qui irwt ere sa taetsdaca qamtuioe ma ick mrolnn mhetsirstyyza qa luijuoethetsepsaaya quirk brmtze ehersjlyaa aumu orkrbtoeqz lrea quijk jrowza quiquihi sakiny mr tui ss thels theqetursy famtzi maethehe iretza lamqzd zretsels area qhirk browna yhetza quirkt rxkwn mox ja isi mq thirsty
0.680327475071 these pretzels are makind me thirsty
0.176128521562 these pretzels are making me thirsty
0.126334354281 these pretzels are making me thirsty
0.10075186193 these pretzels are making me thirsty
0.0846510156989 these pretzels are making me thirsty
0.0734022557735 these pretzels are making me thirsty
0.0650328546762 these pretzels are making me thirsty
0.0585154108703 these pretzels are making me thirsty
0.0532807298005 these pretzels are making me thirsty
0.0489665567875 these pretzels are making me thirsty
0.0453444086015 these pretzels are making me thirsty
0.0422535128891 these pretzels are making me thirsty
0.0395833179355 these pretzels are making me thirsty
0.0372485220432 these mretzels are making me thirsty
0.0351839251816 these pretzels are making me thirsty
0.0333509668708 these pretzels are making me thirsty
0.0317104011774 these pretzels are making me thirsty
0.0302277039737 these pretzels are making me thirsty
0.0288887582719 these pretzels are making me thirsty
0.0276643745601 these pretzels are making me thirsty
0.0265435613692 these pretzels are making me thirsty
0.0255212895572 these pretzels are making me thirsty
0.0245705824345 these pretzels are making me thirsty
0.0236932244152 these pretzels are making me thirsty
0.0228785891086 these pretzels are making me thirsty
0.0221205893904 these pretzels are making me thirsty
0.0214090794325 these pretzels are making me thirsty
0.0207556784153 these pretzels are making me thirsty
0.0201329570264 these pretzels are making me thirsty
0.0195484217256 these pretzels are making me thirsty
0.0190003421158 these pretzels are making me thirsty
0.0184785164893 these pretzels are making me thirsty
0.0179911740124 these pretzels are making me thirsty
0.0175334792584 these pretzels are making me thirsty