Understanding Embeddings: From Words to Vectors

Understanding Embeddings: From Words to Vectors - A Beginner's Guide

Have you ever wondered how computers understand the meaning of words? How does a machine know that "cat" and "dog" are more similar than "cat" and "car"? The answer lies in a fascinating concept called embeddings - and once you understand them, you'll see why they're the foundation of modern AI.

What Exactly is an Embedding?

Think of an embedding as a way to give every word (or image, or any piece of data) a unique "address" in a mathematical space. Just like your house has coordinates that tell GPS where you live, each word gets coordinates that tell the computer where it "lives" in the world of meaning.

An embedding converts discrete objects like words into vectors - which are simply lists of numbers that represent a point in space. Instead of storing "cat" as the letters c-a-t, we store it as something like [0.7, 0.2, -0.3].

Why is this powerful? Because now we can measure how "close" words are to each other mathematically! Words with similar meanings end up with similar coordinates, clustering together in this mathematical space.

Understanding Vectors: Your GPS for Meaning

Before we dive deeper, let's understand what a vector really is. A vector is just a list of numbers that represents a position in space - like super-precise GPS coordinates.

Simple Examples:

2D location: [40.7128, -74.0060] (latitude, longitude of New York)
RGB color: [255, 0, 0] (pure red)
Word embedding: [0.2, -0.1, 0.8] (the "coordinates" of a word's meaning)

The magic happens when we realize that the distance between vectors tells us how similar things are. If two word vectors are close together, those words have similar meanings. If they're far apart, the words are very different.

The Problem: How Do We Get From Words to Numbers?

Here's where it gets interesting. How do we transform text like "I love cats" into meaningful numbers? Let's walk through the process step by step.

1Building the Word-Document Matrix

First, we need to understand what words appear in what documents. Let's say we have this tiny collection of sentences:

"I love cats"
"I love dogs"  
"Cats are cute"
"Dogs are friendly"
"Pets are wonderful"

We create something called a word-document matrix - a giant table where:

Each row represents a document (sentence)
Each column represents a unique word
Each cell contains the count of how many times that word appears in that document

Here's what it looks like:

from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    "I love cats",
    "I love dogs", 
    "Cats are cute",
    "Dogs are friendly",
    "Pets are wonderful"
]

# Create the word-document matrix
vectorizer = CountVectorizer()
word_doc_matrix = vectorizer.fit_transform(corpus)

print("Matrix shape:", word_doc_matrix.shape)  # (5 documents, 9 unique words)
print("Vocabulary:", vectorizer.get_feature_names_out())

This creates a matrix that looks like:

           are cats cute dogs friendly i love pets wonderful
Doc0:      [0,  1,   0,   0,   0,      1, 1,   0,   0     ]  # "I love cats"
Doc1:      [0,  0,   0,   1,   0,      1, 1,   0,   0     ]  # "I love dogs"
Doc2:      [1,  1,   1,   0,   0,      0, 0,   0,   0     ]  # "Cats are cute"
Doc3:      [1,  0,   0,   1,   1,      0, 0,   0,   0     ]  # "Dogs are friendly"
Doc4:      [1,  0,   0,   0,   0,      0, 0,   1,   1     ]  # "Pets are wonderful"

The Sparsity Problem

Notice something? Most values are zero! In our tiny example, 66% of the matrix is zeros. In real applications with thousands of words, this gets much worse - over 99% zeros are common.

This creates two problems:
1. Inefficiency: Storing and processing mostly-zero matrices is wasteful
2. No relationships: "cats" and "dogs" look completely unrelated in this representation

2The Magic of fit_transform()

The vectorizer.fit_transform(corpus) line does something clever - it combines two operations:

fit(): Learns the vocabulary from your text
- Scans all documents to find unique words
- Creates a mapping like {"cats": 1, "dogs": 3, "love": 6}
transform(): Converts text into numerical vectors
- Uses the learned vocabulary to convert each document into a vector
- "I love cats" becomes [0, 1, 0, 0, 0, 1, 1, 0, 0]

This separation is crucial in machine learning - you learn the "rules" from training data, then apply those same rules to new data.

Creating Dense Embeddings with SVD

Now comes the really cool part. We use a technique called Singular Value Decomposition (SVD) to compress our sparse, high-dimensional matrix into dense, low-dimensional embeddings.

from sklearn.decomposition import TruncatedSVD

# Create embeddings using SVD
svd = TruncatedSVD(n_components=3)  # Reduce to 3 dimensions
word_embeddings = svd.fit_transform(word_doc_matrix.T)  # Note the .T!

Why the Transpose (.T)?

The .T (transpose) is crucial - it flips our matrix so that:

Before:
Each row = document
Each column = word

After:
Each row = word
Each column = document

We want word embeddings, so we need words as rows! After transposing, SVD will create embeddings for each word.

What SVD Actually Does

SVD is like a master detective that finds hidden patterns in your data. It looks at the word-document matrix and asks: "What are the most important patterns here?"

Before SVD (sparse, 5-dimensional):
cats: [1, 0, 1, 0, 0]
dogs: [0, 1, 0, 1, 0]

After SVD (dense, 3-dimensional):
cats: [0.71, 0.23, -0.31]
dogs: [0.69, 0.19, -0.28]

SVD discovered that "cats" and "dogs" are similar because they appear in similar contexts - both appear with "I love" and both are described with positive adjectives.

The Complete Code Example

Here's the full pipeline that creates embeddings:

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics.pairwise import cosine_similarity

# Our text corpus
corpus = [
    "I love cats",
    "I love dogs", 
    "Cats are cute",
    "Dogs are friendly",
    "Pets are wonderful"
]

# Step 1: Create word-document matrix
vectorizer = CountVectorizer()
word_doc_matrix = vectorizer.fit_transform(corpus)

print("Original matrix shape:", word_doc_matrix.shape)
print("Vocabulary:", vectorizer.get_feature_names_out())

# Step 2: Create embeddings using SVD
svd = TruncatedSVD(n_components=3)
word_embeddings = svd.fit_transform(word_doc_matrix.T)

print("Embedding shape:", word_embeddings.shape)

# Step 3: Show the embeddings
words = vectorizer.get_feature_names_out()
print("\nWord embeddings:")
for word, embedding in zip(words, word_embeddings):
    print(f"{word:9}: [{embedding[0]:6.3f}, {embedding[1]:6.3f}, {embedding[2]:6.3f}]")

# Step 4: Calculate similarities
similarities = cosine_similarity(word_embeddings)

print("\nWord similarities:")
cats_idx = list(words).index('cats')
dogs_idx = list(words).index('dogs')
print(f"'cats' vs 'dogs': {similarities[cats_idx][dogs_idx]:.3f}")

Why This Matters: The Real-World Impact

These simple concepts power some of the most advanced AI systems today:

Search Engines: When you search for "puppy," Google knows to also show results about "dog" because their embeddings are similar.
Recommendation Systems: Netflix suggests movies by finding films with similar embeddings to what you've watched.
Language Models: ChatGPT and similar AI systems use embeddings to understand that "The cat sat on the mat" and "A feline rested on the rug" mean similar things.
Translation: Google Translate uses embeddings to find equivalent words and phrases across languages.

From Simple to Sophisticated

Our example used a tiny vocabulary and simple SVD, but the same principles apply to modern systems:

Word2Vec and GloVe create embeddings from massive text corpora
BERT and GPT create contextual embeddings that change based on surrounding words
Image embeddings work similarly but start with pixel values instead of word counts

Key Takeaways

Embeddings convert discrete objects into continuous vectors - turning words into lists of numbers
Similar objects get similar vectors - words with similar meanings cluster together
The process involves learning patterns from data - we analyze how words co-occur to understand relationships
Dimensionality reduction is crucial - we compress sparse, high-dimensional data into dense, low-dimensional representations
Distance in vector space represents similarity - closer vectors mean more similar meanings

The Bottom Line

Embeddings are essentially a way to teach computers about relationships and similarities. By converting words, images, or any data into mathematical vectors, we enable machines to understand that "cat" and "dog" are more similar than "cat" and "car" - not because we programmed this knowledge, but because the machine learned it from patterns in the data.

This fundamental concept has revolutionized artificial intelligence, making it possible for machines to understand context, meaning, and relationships in ways that seemed impossible just a few decades ago. And it all starts with the simple idea of representing things as points in mathematical space.

Next time you use a search engine, get a recommendation, or chat with an AI, remember - it's all just vectors finding their neighbors in high-dimensional space!

Fubar Ramblings

Search This Blog