Word Embeddings – Word2Vec, GloVe & FastText Explained in Machine Learning
Word Embeddings – Word2Vec, GloVe & FastText Explained
Traditional NLP techniques like Bag of Words and TF-IDF treat words as independent symbols. However, language is contextual and semantic. Word embeddings revolutionized NLP by representing words as dense numerical vectors that capture meaning and relationships.
This tutorial explains how Word2Vec, GloVe, and FastText transformed text representation in modern NLP systems.
1. Why Do We Need Word Embeddings?
Bag of Words has limitations:
- No semantic understanding
- High dimensional sparse vectors
- No similarity relationships
Example:
King and Queen should be related. Car and Apple should not.
BoW cannot capture this relationship. Embeddings can.
2. What Is a Word Embedding?
A word embedding is a dense vector representation of a word in continuous space.
Example:
King → [0.25, -0.18, 0.93, ...] Queen → [0.27, -0.15, 0.90, ...]
Similar words have vectors close to each other in multi-dimensional space.
3. The Idea of Distributional Semantics
The fundamental idea:
"Words that appear in similar contexts have similar meanings."
If "doctor" and "physician" appear in similar sentences, their embeddings should be similar.
4. Word2Vec – Learning Word Representations
Developed by Google in 2013, Word2Vec introduced efficient neural models for embedding learning.
Two architectures:- CBOW (Continuous Bag of Words)
- Skip-Gram
5. CBOW (Continuous Bag of Words)
Predicts the target word from surrounding context.
Example:Context: "The cat sat on the ___" Target: mat
CBOW averages context embeddings and predicts the missing word.
6. Skip-Gram Model
Predicts surrounding context given a target word.
Target: cat Predict: The, sat, on
Skip-Gram performs better for rare words and larger corpora.
7. Training Mechanism
Word2Vec trains a shallow neural network:
- Input layer
- Hidden embedding layer
- Output layer (softmax)
Optimization uses negative sampling to improve efficiency.
8. Vector Arithmetic Magic
Embeddings capture relationships:
King - Man + Woman ≈ Queen
This property shows embeddings encode semantic structure.
9. Limitations of Word2Vec
- Ignores global corpus statistics
- Single embedding per word (no context awareness)
- Struggles with rare words
10. GloVe – Global Vectors for Word Representation
Developed by Stanford, GloVe combines:
- Global word co-occurrence statistics
- Matrix factorization techniques
Unlike Word2Vec, GloVe leverages entire corpus statistics rather than local context windows only.
11. GloVe Training Intuition
It builds a co-occurrence matrix:
Word i appears with word j = Xij times
Then factorizes the matrix to produce embeddings.
This captures global semantic relationships.
12. FastText – Subword Embeddings
Developed by Facebook AI, FastText improves embeddings by considering character n-grams.
Example:Running → run + n + ing
Instead of learning one vector per word, FastText learns vectors for subword units.
13. Why FastText Matters
- Handles rare words better
- Works well for morphologically rich languages
- Handles misspellings
14. Comparing Word2Vec, GloVe & FastText
- Word2Vec → Local context prediction
- GloVe → Global statistical factorization
- FastText → Subword modeling
15. Embeddings in Modern NLP
These static embeddings were foundational. Today:
- Contextual embeddings (BERT, GPT)
- Transformer-based embeddings
- Sentence embeddings
But Word2Vec and GloVe remain important educational building blocks.
16. Enterprise Use Cases
- Search ranking
- Recommendation engines
- Semantic similarity
- Clustering documents
- Chatbot understanding
17. Final Summary
Word embeddings marked a major shift in NLP by moving from symbolic word counts to semantic vector representations. Word2Vec introduced predictive embeddings, GloVe leveraged global corpus statistics, and FastText enhanced embeddings using subword information. These techniques form the backbone of modern NLP systems and paved the way for contextual transformer-based language models.

