I’ve been thinking recently about a novel way to do Machine Translation. Here are a few of the papers I’ve explored to see if it’s been done before.
This paper develops a complete RNN encoder - decoder for translating from English to French. It uses this system to rescore hypotheses produced by a phrase-based system.
While the author has gone on to develop more advanced systems (see below), there are two major takeaways from this paper. The first is how well the RNN encoder-decoder generates a continuous-space representation of phrases. The 1000-dimensional encodings it produces capture both semantic and syntactic structures helpful in translating between languages.
The second major takeaway is the introduction of the Gated Recurrent Unit (GRU). This is similar to the Long Short Term Memory (LSTM) unit but much simpler and seemingly comparable2. It has a reset gate and an update gate. After non-linearities are applied, the hidden state is a linear combination of the those two gates and the input.
This famous paper references the above work by Cho as well as the work below by Bahdanau. I might be mistaken, but think that this is the first [successful] result for MT using a deep RNN encoder-decoder. They obtained a really good BLEU score (34.8) using an ensemble of five deep models utilizing 384M parameters and an 8,000 dimensional state.
The main takeaway that I had from this paper was the idea of reversing the source string. They found that this was extremely valuable. It makes intuitive sense because if the source string is
The dog is hereand the target string is
Le chien est ici, then it’s highly likely that we would want to associate
Le. By encoding that
Thelast, it has the highest contribution to the encoding. The decoder is then more likely to associate the two correctly.
Their hypothesis for this phenomenon is on page 4:
… Normally, when we concatenate a source sentence with a target sentence, each word in the source sentence is far from its corresponding word in the target sentence. As a result, the problem has a large “minimal time lag”4. By reversing the words in the source sentence, the average distance between corresponding words in the source and target language is unchanged. However, the first few words in the source language are now very close to the first few words in the target language, so the problem’s minimal time lag is greatly reduced. Thus, backpropagation has an easier time “establishing communication” between the source sentence and the target sentence…
A couple of other notes were that deep LSTM models worked much better than shallow ones. This isn’t surprising and theirs used four layers. The training time, however, was surprising. They had to parallelize the algorithm over an 8-GPU architecture, with the four LSTM layers each having their own GPU. The remaining four were used to run softmax calculations. This resulted in a speed of 6,300 words per second for a training time of ten days.
This paper was birthed from the work above by Cho (and he’s a co-author on this one). They conjectured that a bottleneck in the system is that the encoding is a fixed length vector. The input, a sentence, is of variable length. So is the output (another sentence). To get from one to the other, most systems encode the variable-length sequence of words into a fixed length vector and then decode them into the variable-length output. They hypothesized that this caused earlier systems to underperform on longer sentences6.
Their solution is an attention model. This has tremendous success jointly learning to translate and align the words correctly.
It works by having a bi-directional RNN for the encoder. The forward direction computes a sentence vector as usual (using GRUs). There is another backwards direction that does the same but in reverse. However, there isn’t just one of each - there is a pair for every word in the sentence. For an
N-length sentence, we’ll have
Npairs of these vectors. The
jthforward direction vector is the result of the running the forward RNN up to and including the
jthword. Similarly, the corresponding
jthbackward direction vector is the result of running the backward RNN up to and including the
jthword. Each of these pairs are concatenated together and called an Annotation. Think of them as being a summary of the preceding and following words, focused on the nearby words.
Now the decoder takes these Annotations as input. It then needs to decide which Annotation to pay attention to. It does this by computing a probability for each Annotation where the input is influenced by an alignment model. That model scores how well the inputs around each position match with the output at the specified position. It’s a feed-forward neural network that is jointly trained with the whole system.
Cho et al, Universite de Montreal ↩
Sutskever et al, Google ↩
Bahdanau et al, Jacobs University Bremen ↩