Embeddings is the knowledge that exists in the “brain” of an LLM system. What an embedding does is simple: It gets as input a text, and it creates a vector for this text.
E.g, it gets the text “Hi, my name is Fotis” and it creates a vector [0.2, 0.1, ….-0.3].
Then, it stores all these vectors in a multidimensional space.
In the below example from OpenAI, all the green dots are vectors (embeddings) for “athlete”. It uses 3 dimensions, to help us with the visualization in space.
Each vector contains a text. E.g:
Embeddings help an application to understand how different information is grouped together. And how similar two pieces of information are.
Ofcourse, we are not limited to text. We can do the same trick for Videos, Images, Speech and other objects. This is how a search engine finds similar images or how Shazam is finding the song from an audio.
How to Create Embeddings
OpenAI provides a few models that help you turn text into embeddings. Eg the ADA model. You create an API call, where you send the text and the model replies with the “vector” or the embedding for it.
Where do We Store Embeddings?
Since the embeddings are vectors, we need to store them in a Vector Database. A vector database stores similar information close to each other. E.g, the vectors “Fotis is a Human”, “Fotis is Greek”, “Fotis is an engineer”, will be stored close to each other in the database.
The OpenAI doesn’t provide unfortunately a way to store the embeddings in a vector database. You can store the embeddings in Cosmos DB on Azure. Or another database is the Singlestore Vector DB, which also runs on Azure (https://www.singlestore.com/built-in-vector-database/).
Embeddings Use Cases
Text Search
This is similar to what a search engine does. You enter a query. This query becomes an embedding (vector). And then this vector compares to all the other vectors stored in the database. The text search functionality tries to find the (stored) vectors which are closer to your query vector.
Code Search
It’s like text search but for code. If a developer asks “where is the code that adds up two numbers?”, this will become an embedding (vector), and it will try to find the closest/most similar vectors (embeddings), which in this case are code snippets.
Text Similarity
In this use case we try to find how similar two different pieces of text are. Similar pieces of text are stored close to each other. Here is an example (source: )
What is the Max Size of text for which I can create an embedding?
The Ada model in OpenAI, supports up to 8191 max input tokens. In the Greek language, one token is one character, so you can create an embedding text of up to 8k characters. If an average of a word is 4 characters, this is 2000 words, or around 8 word doc pages of text.
How to Create your First Embedding
Let’s Create our First Embedding.
Step 1: Go to Postman, create an account
Step 2: Create a new Workspace, go to “Authorization” and place your OpenAI API key. Then select “Post”
Step 3: The Post request, from the OpenAI site information, is: https://api.openai.com/v1/embeddings .
Paste it at the related text and select “Body” –> “Raw” and “Jason”
There are only two parameters you need to use:
- Input (here you put your text, eg “Hi, I am Fotis”
- Model: Here you specify which embedding model you will use, eg the embedding-3-small.
Here is the code:
{
“input”: “Hi, I am Fotis”,
“model”: “text-embedding-3-small”
}
And as you can see above, we just created our first embedding, for this text.
How Much Do Embeddings Cost?
The OpenAI embedding pricing for Ada v2 is
$0.00010 / 1K tokens. Let’s see an example:
- Let’s suppose you have 100 docs of 50 pages each in Greek.
- Each page is 350 words, with an average of 4 letters per word.
- In Greek, each letter (character) is one token. Capital letters are 2 tokens, but let’s not take this under consideration.
- This means that you will have a total of 100*50*350*4= 7M tokens.
- Your cost to create embeddings for all the text will be: (0.0001/1000)*7M=0.7USD
Azure OpenAI on Your Data
There is a new service from Microsoft, called “Azure OpenAI on your Data”. You can learn more about it here.
What it does in simple terms, is that you can give it your files (words/pdfs/text etc), it automatically breaks the text into chunks, creates the embeddings and stores it in a vector database. It integrates the search functionality and it can also launch a web application e.g if you want to have a chatbot to provide answers to your data.
So, it hides a lot of the complexity of using embeddings, using out-of-the-box functionality.
This allows you to easily create an application based on your own data, and fine tune it using the options provided. You will have to experiment with the different types of search that it provides (semantic/vector/hybrid) to find out which one works better for you.
Obviously, you don’t have control on how the embeddings are made. E.g, you may want to create one embedding per 1000 words, or break your text in embeddings based on chapters. In this case you will have to create the embeddings based on your own rational.