One of recent breakthroughs in machine translation is the application of the transformer. Compared with the sequence-to-sequence model based on recurrent neural networks (RNN), the attention-based transformer model makes use of encoder output at every time step, which further enhances machine translation performances. In this project, I trained both models and showed that attention-based models gained 6.8% in BLEU score on an English-French translation dataset from Tatoeba.
Background 🔗
Recently, machine translation has gained much success due to a series of endeavors to represent sentences using an sequence-to-sequence (seq-to-seq) model (Please click here for my demo of the seq-to-seq based machine translation model). In this structure, two sequence model are put together: the first one, the encoder, encodes the information of the source sentence (sentence to be translated when the structure is applied in machine translation); the second one, the decoder, produces the sentence in target language word by word based on the encoding (embedding) of the source sentence. This avoids the problem of word by word translation since languages vary in their syntax. For example, the French sentence “Je ne l’aime pas” can be translated into English as “I don’t like him”. “Ne…pas” in French is equivalent to “do not” in English. Due to the word order difference in the two languages, word-by-word translation is difficult. The encoder-decoder structure addressed this issue and gained better performance on machine translation tasks.
Dataset 🔗
The source of the dataset used in this project is Tatoeba, a website where people can upload sentences in any language and contribute their own versions of translations in other languages. The English-French dataset used here was downloaded from https://www.manythings.org/anki/, where they preprocessed the Tatoeba dataset so that it became a text file of English French sentence pairs.
Preprocessing 🔗
The dataset is ingested and the following steps are done before building the machine translation model:
- Texts are transformed to lower case letters, punctuations are removed.
- Word-number dictionaries are created to map every word appeared in the dataset to a unique number before putting in the model.
Model Setup 🔗
-
Encoder: The first sequence model in the seq-to-seq structure, the encoder, encodes the information of the sentence in the source language. The encoder outputs a fixed length embedding, which will be the input of the second sequence model, the decoder, to output prediction of word in target language. Several types of layers in the sequence model family can be used as the encoder: for example, RNN (recurrent neural network), GRU (gated recurrent unit), LSTM (long-short term memory) are all candidates.
-
Decoder: The second sequence model in the seq-to-seq structure, the decoder, takes the weights in the hidden layer of the encoder as input and generates sentences in target language in a word-by-word fashion.
-
Decoder with attention: Instead of taking the last output of the coder, a decoder with attention multiplies output at each step with all encoder outputs to and learns to ‘focus attention’ on different output, which leads to better performance.
Model Output 🔗
After training for 60,000 iterations, we show some examples of model generated sentences in the target language using the test dataset unseen by the model.
Example Output: SeqToSeq model without transformer 🔗
Almost Successful Example:
(Eng) thanks for your hard work. - (French) merci d avoir travaille si durement.
Model generated translation from Eng: merci pour ton travail.
Failed Example:
(Eng) we re not family. - (French) nous ne sommes pas de la meme famille.
Model generated translation from Eng: nous ne sommes pas.
Example Output: SeqToSeq model with transformer 🔗
Almost Successful Example:
(Eng) don t leave them alone. - (French) ne les laissez pas seuls.
Model generated translation from Eng: ne te laisse pas seule!
Failed Example:
(Eng) you re the best singer i know. - (French) vous etes le meilleur chanteur que je connaisse .
Model generated translation from Eng: vous etes la meilleure . . .
Measure Model Performance: the BLEU Score 🔗
It is difficult to compare model performance of machine translation by random sampling model outputs. One possible index to measure translation performance is to use the BLEU score.
The BLEU score, simply put, is the number of tokens generated by the model matching the correct answer generated by humans, divided by the total number of tokens in the correct answer, ignoring word orders. The closer the BLEU score is to 1, the better the generated translation is. Consider the following ground truth sentence pair:
(Eng) How are you doing - (French) Comment ca va
Based on the English sentence, the model generates the following sentence:
Tout va bien
If we measure the number 1 gram in the ground truth appears in the generated sentence, there is only va, which leads to the BLEU score of 1 / 3 ~ 0.33.
If we measure the number 2-grams in the ground truth that appear in the generated sentence, there is zero (Comment ca & ca va), which leads to the BLEU score of 0 / 2 = 0.
From this example, we can see the problem of BLEU score. The number of overlapping n-grams is not a perfect measure for translation because ‘comment ca va’ and ' and ‘tout va bien’ can both be correct translations for ‘how are you doing’. However, the BLEU score is still a widely used measure for machine learning tasks using large amounts of data.
Result and Next Steps 🔗
The 1-gram BLEU score of the Seq-to-Seq with transformer model on the test data set is .47, higher than the model without transformer (.44) in the demonstrated experiment.
Next Step: model enhancement.
Reference 🔗
- Cho, K., van Merrienboer, B., Gülçehre, Ç., Bougares, F., Schwenk, H., and Bengio, Y (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Conference on Empirical Methods in Natural Language Processing.
- Sutskever, I., Vinyals, O., and Le, Q. (2014). Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems (NIPS 2014).
- Papineni, K., Roukos, S., Ward, T., and Zhu. W. J. (2002). BLEU: a method for automatic evaluation of machine translation. In ACL, 2002.