Train Word Embeddings with CBOW Model (Link)

Dec 5, 2021 · 284 words · 2 minute read

Word embedding is an effective way to represent the relationship between words. Words with similar neighbors (i.e. context) have similar embeddings. In a now well-known paper published by Mikolov et al. (2013), they demonstrated how to use the continuous bag-of-words (CBOW) model to train word embeddings, which can be applied to analogy and other downstream tasks. This project uses NumPy to train word embeddings by using the CBOW model from scratch and applying them to an analogy task.

Quick Summary 🔗

In this project, I tried to use an early version of Wikipedia Wikipedia content to train word embeddings using the above method. The resulting word embeddings were visualized on a 2D domain and the distances between them were calculated.

Dataset: 🔗

The dataset used in the project was an early backup version of Wikipedia content (~2006) containing about ~100M tokens. This choice was made due to computational resource constraints.

Background 🔗

Word embedding is a way to represent words with predefined dimensions that can be interpreted as bearing meaning or syntactic relationship. In the well-known paper by Mikolov et al. (2013), they proposed a continuous bag-of-words model using a given window of tokens to predict any token’s neighbors. Thus, words with similar neighbors have similar weights in the neural network model, or embeddings. Words with similar embeddings (e.g., defined using cosine similarity or Euclidean distance) tend to be similar semantically or syntactically.

Model Architecture 🔗

This project used a two-layer structure to train word embeddings using the above-mentioned Wikipedia dataset. The activation function used was softmax. The resulting word embedding was displayed on a 2D space.

Reference 🔗

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. In ICLR.

Word Embeddings Deep Learning Natural Language Processing