Following the announcement of Gemini's release by Google this week, and the bad buzz around the semi-faked demo from the tech giant, I'm nonetheless incredibly excited about the dawn of the multimodal era (kicked off, among others, by GPT-4 Vision) in the world of foundation models, meaning AI models capable of performing tasks across a very broad spectrum.
I can't wait for Gemini to become available so I can form my own opinion of its capabilities! Whether Google's PR was botched isn't the point; all major tech companies, researchers, and the open-source community are engaged in a frantic race toward AGI (Artificial General Intelligence), which frightens some and thrills others (I'm in the second camp). The new Holy Grail is multimodality.
So, whether the model's capabilities are a resounding success or not, the approach of creating a model that is multimodal in its architecture is, in my view, a precursor to extraordinary inventions. As I kept telling my colleagues this week, I feel like a contemporary of Edison or Tesla, who during the era of early electrical innovations discovered, with wide-eyed amazement, the dawn of a technological miracle that would change the face of the world.
That said, you might be getting tired of the term "multimodal," repeated in every paragraph since the beginning of this article, so let's start there!
A modality, for an AI model, simply refers to the type of data fed as input to the model. For example, GPT-4 was initially unimodal, meaning it was "only" capable of understanding text entered by the user in order to generate its output. Since then, a "Vision" component was added to the model and it became multimodal, meaning data of several types (here, text and images) can be fed to it; you can now send it images. This is what enables photo descriptions by GPT-4 today.
That's why, as you'll notice in the title of this article, I'm not talking about LLM (Large Language Model) but rather LMM (Large Multimodal Model). What makes me think that Gemini's release could well be a landmark date in AI's evolution is that Google is already offering a multimodal embeddings feature accessible programmatically via GCP APIs.
Let's start with the term embedding: it refers to a way of representing data in vector form. If you're like me and you slept through math class in high school, a vector is simply a way of representing points and the direction they take in an n-dimensional space.
For example, let's consider two vectors on a two-dimensional plane:
Thus, any pair of coordinates (x, y) can be represented as follows using this matrix:
This allows you to apply, for example, transformations to geometric shapes using their vector representation.
In the transformation above, which actually performs none (identity transformation), the 1st column of the matrix answers the question "what transformation do we want to apply to the horizontal axis?" and the second answers "what transformation do we want to apply to the vertical axis?" ... But, Yacine, what does this have to do with LMMs, you might ask?
Well, in very broad strokes, this is how an LLM sees the world: every piece of data, of any kind, is transformed into vectors of numbers across hundreds of dimensions (not just 2D or 3D). For example, the BERT model represents text in 768 dimensions! So for each word, image, etc. (referring to the nature of the modality), we can establish a magnitude and a direction for the item in question.
The high number of dimensions in these vectors makes it possible to capture nuances between concepts: this is what enables semantic relationships in a textual context. TL;DR: vectors derived from text data (embeddings) enable fine-grained encoding of semantic information, a unit of meaning, in other words.
In the vector of my movement, if I had to represent the idea "I walked 3 kilometers northeast," the magnitude would be the distance traveled and the direction would be "northeast," i.e., an angle. As you'll see, this matters.
We're now going to show an example of generating embeddings from textual content. What you need to remember: data, of any kind, fed to an AI, is transformed into vectors of real numbers.
You can find here code snippets that you can run at home to test the concept. In the provided example, I'm testing Google's AI APIs (Vertex AI) to generate embeddings, or vector representations of data (in this case, text) =>
In this example, I send 2 related sentences, in their context, to the Vertex AI embeddings generation endpoint; you'll notice that the last one has absolutely nothing to do with the first two, and that's intentional 😉.
I then display the first five rows of these embeddings for each input text; this produces the three tables separated by commas that you see here =>
This is where the "magic" of LLMs (and now LMMs) happens: a similarity calculation is performed between these vectors to determine whether they are semantically close (remember, they were originally words). For this, mathematicians and their professors (to whom I apologize for the little attention I gave them) have provided us with a toolkit that notably includes:
Euclidean distance
cosine similarity
It really isn't "rocket science," as one of my AI squad colleagues likes to say, and I encourage you to brush up on your math, if it's still fuzzy for you, on Khan Academy's website, which is excellent for this purpose.
These two tools take different approaches to assessing similarity between two vectors: the first (Euclidean) focuses on the distance between points, while the other focuses on the angle (the direction, versus the magnitude) between those points.
In practice, in machine learning, cosine similarity is preferred because, in this type of calculation, the vector's magnitude carries less weight than in the Euclidean distance calculation => whether the text is longer or shorter has less impact on the semantic relationships that can be drawn from the text in one case versus the other, which directly affects the performance of the model in question!
I tried to demonstrate this by continuing the example started above =>
Euclidean distance
cosine similarity
In the example using Euclidean distance, we can see that the distance between the first two texts is shorter than between the 1st and 3rd texts. In the other example, it's reversed because cosine similarity returns a number between -1 and 1, where -1 represents orthogonal similarity between the two vectors.
There's nothing magical about it; it's the training data of the models (an immense corpus for the latest models) and the way they were labeled, how the training was conducted, etc. that established these relationships.
This inspires admiration and, a brief aside, it also compels us to be genuinely wary of the potentially negative impact that biases can have during this crucial training phase. For example, a xenophobic AI could have been trained, through its input data, to systematically associate a negative value (in our understanding) whenever community A or B is mentioned. These are risks that every AI user and solution designer must be aware of.
These phenomena of reinforcing preconceptions already exist naturally in human societies, but the industrialized, automatic, and scalable nature of their reinforcement is, I believe, one of the challenges of our century.
In the somewhat less serious case of our Shakespearean example, the embeddings related to theater and Shakespeare return the expected result because the model's training corpus (here PaLM2) likely received Shakespeare's texts and plays as input. It is therefore able, based on the input text, to identify a similarity between what you're telling it and these domains.
Until now, the most popular commercial application of embeddings and its most celebrated API version has been OpenAI's with ChatGPT Plus and its API endpoints for generating embeddings.
However, at present, even though you can obtain embeddings programmatically with the OpenAI API, you cannot create multimodal embeddings, since OpenAI only accepts text.
Where Google reclaims center stage in the AI race, failed demo or not, is that you can already create this type of embeddings with Vertex AI.
https://www.youtube.com/watch?v=2M43pIOo77Y
Note that this video has nothing to do with Gemini, the multimodal model; it's "simply" a feature that makes it very easy to calculate similarity between data of completely different types. And that is absolutely mind-blowing!
Being able to establish similarity relationships between vectors derived from textual data, or from images, was already helping to accomplish several tasks whose execution, once primarily algorithmic, has been profoundly transformed since the industrialization of neural networks:
semantic search, the best-known use case => search engines
recommendation systems => "I watched content x, so I'm suggested content y based on its semantic similarity"
question/answer systems => a response is generated by identifying the knowledge domain contained within the user's question
targeted advertising
image recognition
etc.
Being able to now send both an image and text, for example, to an embeddings API endpoint to retrieve these number vectors makes it possible, for instance, to search for a product using just a photo 🤯!
How does it work?
you send the information "product A," with its description, etc. to generate a standard text embedding
you also send one or more images associated with that product, which also return embeddings
Where previously you had to manage the relationship between two completely different embedding spaces yourself, the technical achievement lies in the fact that these similarities are now natively supported between data of completely different types. This significantly increases the performance of establishing similarity links between text and image, image and image, etc. And this foreshadows vastly enhanced abstraction capabilities from models, whether Gemini or others, in their understanding of the world.
In the product search example, this means there's no longer a need to explicitly associate an image with a label, the number of labels being inherently limited, and then query a database to establish a similarity relationship between the image's label (and not the image itself) and what you typed in the search bar. We find ourselves at an entirely different level of nuance where by sending an image as input, we can obtain, with much finer granularity, more relevant results across two modalities at once: text and image. This eliminates the need to worry about image labeling when creating your product dataset, for example, and instead focus solely on "which image goes with which product?"
A concrete example: I run an e-commerce store and I want my users to be able to very easily find a product exactly, but also find similar products that match as closely as possible with their query. In 2023, I would have assembled my dataset, with the product title, its description on one hand; and a set of images on the other. I would have had to design, or fine-tune, two different AIs to enable the flexibility of a search without a precise identifier on my platform:
an LLM trained on or with access to my product database
an image classification model trained to recognize bikes, helmets, etc.
The result would not always have been accurate because the user might upload a photo of bike x and get bike y on the first page of results, which isn't at all the same product, all heavily dependent on the data labeling work (and you don't always have hundreds of bikes of the same model to feed as input to an AI to differentiate them during training).
In 2024, to accomplish the same task:
you send your text content about the product along with one or several images to an embeddings endpoint (Vertex AI and soon other providers)
you store these embeddings, for each product, in a vector database
... and that's it
Now, when the user no longer has the product reference, for example, they'll send an image and the exact product will be instantly found in the store. Similarly, if they want a product but want to describe it in their own words, they just type an approximate text in the search bar and the results will be much closer to reality than those produced by necessarily limited data labeling.
The scalability and speed of establishing relationships between structurally different data are dramatically improved. This is what the precision of encoding data into vectors of real numbers enables! And this is just a glimpse of what can be done... I encourage you to watch the Google Next video included in this article to explore the possibilities offered by the "embed everything" approach used by a company invited to the event to discuss their use case.
Finally, the ability to:
store multimodal embeddings
quickly search them for similarity
... will make it possible to store complete user experiences in databases over many years: the hyper-personalization of product UX is only just beginning!
There you have it. In this article, I've attempted a simplified explanation of multimodality and embeddings in neural networks. Don't hesitate to suggest edits to this text or point out anything that seems unclear in the comments!
CTO de la scale-up LAMALO, Yacine est un développeur fullstack qui ne tient pas en place : JavaScript, Node.js, Python, LLM, voice UX... Toujours en veille, il transforme les dernières innovations en solutions concrètes !
LinkedInGet our best articles every month.
Débloquer la valeur cachée dans des milliers de documents. Un projet bancaire qui transforme la recherche documentaire en quelques secondes.
ProjectDébloquer l'extraction de données hétérogènes. Un projet utilisant l'IA multimodale pour 9 marques.
ProjectDoubler la capacité de production d'audits grâce à l'automatisation intelligente.
TrainingFondamentaux ML, scikit-learn, premiers modèles supervisés et non supervisés.
ServiceDe l'audit au déploiement. Diagnostic, formation, POC, audit 360°, projet complet.
ServiceFormateurs opérationnels. IA, data science, développement web. Certifié Qualiopi.