Training Word Embeddings from Scratch

Β· 313 words Β· 2 minute read

Word embedding is an effective way to represent words, carrying semantic and syntactic information with a predefined length of numbers. In a seminal paper by Mikolov et al. (2013), they demonstrated the Word2vec technique to train word embeddings, including the continuous bag-of-words (CBOW) and skip-gram model. Words with similar neighbors (i.e. context) have similar embeddings. The technique has inspired many later works on training word embedding, which can be applied to analogy and other downstream tasks. In this project, I used NumPy to train word embeddings applying the CBOW model to demonstrate the structure and math of the model and applied the resulting embeddings to an analogy task.

Quick Summary πŸ”—

In this project, I tried to use an early version of Wikipedia Wikipedia content to train word embeddings using the above method. The resulting word embeddings were visualized on a 2D domain and the distances between them were calculated.

Dataset: πŸ”—

The dataset used in the project was an early backup version of Wikipedia content (~2006) containing about ~100M tokens. This choice was made due to computational resource constraints.

Background πŸ”—

Word embedding is a way to represent words with predefined dimensions that can be interpreted as bearing meaning or syntactic relationship. In the well-known paper by Mikolov et al. (2013), they proposed a continuous bag-of-words model using a given window of tokens to predict any token’s neighbors. Thus, words with similar neighbors have similar weights in the neural network model, or embeddings. Words with similar embeddings (e.g., defined using cosine similarity or Euclidean distance) tend to be similar semantically or syntactically.

Model Architecture πŸ”—

This project used a two-layer structure to train word embeddings using the above-mentioned Wikipedia dataset. The activation function used was softmax. The resulting word embedding was displayed on a 2D space.

Reference πŸ”—

Mikolov, T., Chen, K., Corrado, G., Dean, J. Efficient estimation of word representations in vector space. arXiv 2013, arXiv:1301.3781.

comments powered by Disqus