Welcome to Generative Magic. Every few weeks, we highlight one project, research concept, or product in a way that is digestible for anyone without a Ph.D. in computer science.
Whether you're an academic researcher, industry practitioner, or simply fascinated with AI's creative potential, Generative Magic cuts through the hype and noise to bring you actionable ideas and advice.
If you're passionate about the next frontier of artificial intelligence and machine creativity, subscribe here:
Machine learning may seem complicated, but it's built on basic blocks anyone can grasp.
The most fundamental is embeddings. Luckily, we think you can learn how they work in the next ten minutes.
Embeddings sit at the core of the most exciting workflows and enable:
Fast semantic search: Finding results that match the meaning and intent behind user queries, not just the keywords.
Realistic image generation: Synthesizing new images by interpolating between embeddings of real images. Midjourney, anyone?
Grounding large language models in truth: Anchoring the representations in gigantic language models like GPT-4 to real-world knowledge by using text embeddings.
Embeddings are the Swiss army knife of modern machine learning. Let’s get started.
What’s in an embedding?
An embedding is …(drumroll please)… just a short list of numbers (specifically floating point numbers, like 1.0
, 4.2
, 182.92
, etc.) that represents a higher dimensional concept. The list is usually no bigger than 1024.
In math, such a list is called a vector. The size of a vector is its dimension. An encoder takes an input from a high dimensional space (such as a sentence, image, or snippet of a song) and compresses into a smaller dimension while retaining as much of its original meaning as possible.
In the machine learning world, the compressed vector is called an embedding or latent representation of the original input (sometimes just latent for short).
For example, let’s say we have a few pictures of animals that we use as an input to an encoder. The encoder calculates two numeric features, such as:
number of legs: animals can have 2 or 4 legs
color: we will assign white to 1, brown to 2, and grey to 3
Let’s just assume that our encoder is fully capable of detecting these two features, but nothing more.
Some examples:
Given an input llama, our imaginary encoder will return a vector
[4, 1]
. A llama has 4 legs and has white fur (hence a 1 in the second dimension).Given an input kangaroo, our imaginary encoder will return a vector
[2, 2]
.What will we get for rabbit?
[2, 1]
.
These vectors sit in two-dimensional space. Now here’s a slightly contrived question: is a rabbit more like a llama or a kangaroo?
You might have some preconceived notions that come to mind:
a kangaroo and a rabbit both have two legs
a llama and a rabbit both have white fur
a rabbit might be closer in size to a kangaroo than a llama
If we only care about the first two features we accounted for with our encoder, we can mathematically calculate the “correct” answer to this question by finding the distance between the vector for rabbit and the other two vectors.
Using the Pythagorean theorem:
Based on the embeddings generated by our encoder, a rabbit is closer to a kangaroo than a llama1. We just computed our first semantic search based on a nearest neighbor approach.
The same idea can be scaled to building a large-scale semantic recommendation engine. There are many approximate nearest neighbors algorithms, such as HNSW, that enable searching through millions of large embeddings.
An embedding is simply an efficient way to represent complex information as a series of numbers. They capture relationships between concepts, images, documents or whatever data you're working with. Similar items have closer numbers, more distant items have more different numbers.
In practice, your encoder will likely compute features that are more more nuanced than the simplistic ones we described above, and the features will generally be calculated by a neural network. Therefore, the first element in each embedding might not map cleanly to “how many legs does this animal have?”, but something more abstract.
The dimensionality will also be much higher, in the hundreds or even thousands. But the core idea remains the same: condense your data into a numeric vector space where distance corresponds to similarity.
Word Embeddings
Word embeddings encode words as vectors so that similar words like "cat" and "dog" are close to each other. Models like Word2Vec and GloVe produce word embeddings by analyzing word usage in large datasets.
Researchers discovered that by training a neural network to predict words from their contexts, they obtained vector representations of words that captured their semantic meanings.
Word2Vec works by training a shallow neural network on a large corpus of text. The model is shown a word along with its context words within a sentence, and learns to predict the central word from its context.
In the process of training, the model assigns each word a vector representation such that words with similar contexts end up with similar vectors. In the end, Word2Vec generates high quality word embeddings by examining word co-occurrence (how often does X occur with Y in the same window of a few words?) statistics in datasets of hundreds of billions of words.
Recall that you can do basic arithmetic with vectors. For the first time, embeddings enabled results like "king - man + woman = queen".
If you want to play with text embeddings, I highly recommend checking out SentenceTransformers, and specifically the all-MiniLM-L6-v2
model. Some of the encoder models available in this package are free, fast, and on-par with OpenAI’s text-embedding-ada-002
model.
Another wonderful upshot of embeddings is that they can easily be visualized. You can use algorithms like t-SNE or PCA to project high-dimensional embeddings to 2D or 3D space. Here’s a quick project I worked on recently that used SentenceTransformers and t-SNE to visualize every Stanford course taught this past year.
Image Embeddings Power Diffusion Models
Take a couple minutes and play around on same.energy.
When you click on an image, the engine looks for similar images relatively quickly. Under the hood, same.energy uses (surprise) image embeddings, specifically OpenAI’s CLIP encoder.
CLIP was trained on a massive dataset of 400 million image-text pairs to predict if an image and text complemented each other. During training, CLIP learned a shared embedding space for natural images and English text through contrastive learning. In this space, embeddings for visually similar images and semantically similar text are close together, enabling zero-shot transfer between modalities.
Fun fact: CLIP stands for Contrastive Language-Image Pre-Training.
Contrastive learning works by maximizing agreement between embeddings of complementary image-text pairs while minimizing agreement between non-matching pairs. This pushes embeddings of related concepts together and unrelated concepts apart, allowing CLIP to discover rich relationships between embedded visual and linguistic concepts implicitly.
In other words: CLIP embeds text and images in the same embedding space, enabling you to compare images to words and vice-versa.
The text "a colorful sunset over the ocean" may have an embedding near images of striking sunsets. Though CLIP was not explicitly trained on attributes like colors, objects, or scenes, its huge dataset and contrastive loss allowed it to learn these implicit relationships between embedded concepts.
The key insight of CLIP was that by tying image and text embeddings together, visual and linguistic information could mutually reinforce each other. CLIP embeddings capture a rich, multi-modal understanding of the world that generalizes beyond its training categories.
CLIP enabled new forms of image generation like text-driven image synthesis. By providing a mechanism to smoothly interpolate between embeddings, CLIP's representation space could be used for image diffusion models like DALL·E and Stable Diffusion to generate new images from scratch.
Here is an explanation of how CLIP works for image diffusion:
Start with a text prompt, like "a colorful sunset over the ocean"
Pass the text through an encoder to get its embedding in CLIP's shared image-text space. This text embedding represents the attributes and semantics of the prompt. This is the target embedding.
Start with an initial random noise image and pass it through CLIP's vision encoder to get its embedding. This first embedding will be far from the target text embedding.
Apply a small random step to the noise image that slightly perturbs the image and gets a new embedding from CLIP. Measure the distance between this new image embedding and the target text embedding. The diffusion step is randomly sampled but guided by the gradient towards the text embedding. Subtract some noise from the image as well to gradually refine details, much like a photo coming into focus.
Repeat step 4, selecting diffusion steps that minimize the distance between the current image embedding and target text embedding. This will "drift" the image towards resembling the prompt.
Once the image embedding and text embedding are close enough, use CLIP to decode the final image from its embedding. This results in an image that matches the semantics of the initial prompt.
This process of "diffusing" an image by iteratively applying small changes while measuring and optimizing embedding similarity allows models to generate realistic images from scratch using only text as a guide. CLIP enables diffusion models to hallucinate new images guided by nothing more than text and the visual knowledge within its embedding space.
While hallucination is great when you want to create new images based off a text prompt, there is one example of where hallucination is quite bad.
Preventing Hallucination in Large Language Models
Large language models like GPT-4 are prone to generating plausible but untrue or nonsensical text — they hallucinate. Hallucination occurs when these models have learned statistical patterns in language that do not reflect reality.
Vector retrieval is one approach to mitigating hallucination in these models. It works by using text embeddings to ground the model in factual knowledge.
When a prompt could lead the model to hallucinate, vectors for key terms in the prompt are retrieved from a vector database, that may be hosted (like Pinecone) or stored in memory (like Chroma). The model then generates a response that accounts for these factual knowledge vectors, producing something grounded and truthful.
For example, if responding to "What is the largest city in the United States?", GPT-4 might hallucinate and say "San Francisco".
But with vector retrieval, we can embed the query and find its k-nearest neighbors in the embedding space. These data points can then be injected into the prompt.
If our embedded data has facts about the population of different cities in the U.S., we can expect GPT-4 to respond with "New York" instead. Vector retreivel anchors the model to the fact that New York is the most populous US city.
The prompt may now look like:
1: New York City: Largest city in the US with a population of over 8.4 million people.
2: Los Angeles: Second-largest city in the US, known for its thriving entertainment industry.
3: Chicago: Third-largest city in the US and a major hub for finance, commerce, and transportation.
Based on these three retrieved facts, what is the largest city in the United States?
The key is that by initializing generation with vectors that encode factual information, the model is guided towards a reliable, logically grounded response. The retrieved knowledge acts as a primer, reorienting the model away from superficial language patterns that could lead to hallucination.
Conclusion
Embeddings are a simple yet powerful idea that sits at the core of modern machine learning. Though just lists of numbers, embeddings provide models a flexible and universal representation that captures the richness of complex data. They enable algorithms to explore relationships and meanings that would otherwise be obscured by surface form alone.
We explored how embeddings power technologies like semantic search, image generation, and grounding language models in truth. They represent a breakthrough in machine reasoning - a means of encoding information that retains vital relationships in a form mathematical models can navigate.
This is only a short (but hopefully instructive) explanation, so here are some links to learn more:
Check out SentenceTransformers. It’s a fantastic Python package to get your feet wet with using text embeddings.
On the image embedding side, you can run some experiments on CLIP. The model is available on Hugging Face with a good bit of code to get started.
OpenAI has a great guide on using embeddings with some code as well.
I’ve given a workshop on grounding LLMs in truth a couple times now. You can find all the materials (including code) here.
Thanks for reading! If you enjoyed this piece, be sure to subscribe to get the next one.
If you have questions, ideas for a future post, or just want to chat, please reach out to vnshenoy@stanford.edu or team@generativemagic.com.
Twitter: @generativemagic
You can also define the distance function to be something other than the Euclidean distance, such as Manhattan distance. In literature, these are often referred to as the L1 and L2 distance or norm, respectively. In practice, cosine similarity is most commonly used for embeddings.