Following the announcement of Gemini's release by Google this week, and the bad buzz around the semi-faked demo from the tech giant, I'm nonetheless incredibly excited about the dawn of the multimodal era (kicked off, among others, by GPT-4 Vision) in the world of foundation models, meaning AI models capable of performing tasks across a very broad spectrum.
I can't wait for Gemini to become available so I can form my own opinion about its capabilities! Whether Google's communications were botched isn't the point. All major tech companies, research institutions, and the open-source community are engaged in a frantic race toward AGI (Artificial General Intelligence), which terrifies some and thrills others (I'm in the second camp). The new Holy Grail is multimodality.
So whether the model's capabilities are a smashing success or not, the approach of building a model that is multimodal by architecture is the precursor to extraordinary inventions, in my opinion. As I kept telling my colleagues this week, I feel like a contemporary of Edison or Tesla, who, at the time of the first electrical innovations, discovered with wide eyes a technological miracle that was about to change the face of the world.
That said, you might be getting tired of the term "multimodal" repeated in every paragraph since the beginning of this article, so let's start there!
A modality, for an AI model, simply refers to the type of data passed as input to the model. For example, GPT-4 was initially unimodal, meaning it was "only" capable of understanding the text entered as input by the user to generate its output. Since then, a "Vision" component was added to the model and it became multimodal, meaning data of multiple types (here text and images) can be fed to it. You can now send it images. That's what enables photo descriptions by GPT-4 today.
That's why, as you may have noticed in the title of this article, I don't refer to LLMs (Large Language Models) but rather to LMMs (Large Multimodal Models). What makes me think that the release of Gemini could be a landmark date in AI's evolution is that Google is already offering a multimodal embedding feature accessible programmatically through GCP APIs.
Let's start with the term embedding: it refers to a way of representing data in vector form. If you're like me and you slept through math class in high school, a vector is simply a way of representing points and the direction they take in a space of n dimensions.
For example, let's consider two vectors on a two-dimensional plane:
Any pair of coordinates (x, y) can therefore be represented as follows with this matrix:
This allows us to apply, for example, transformations to geometric shapes using their vector representation.
In the transformation above, which actually applies none (identity transformation), the first column of the matrix answers the question "what transformation do we want to apply to the horizontal axis?" and the second column answers "what transformation do we want to apply to the vertical axis?" ... But, Yacine, what does this have to do with LMMs, you might ask?
Well, put very simply, this is how an LLM sees the world: every piece of data, of any nature, is transformed into number vectors across hundreds of dimensions (not just 2D or 3D). For example, the BERT model represents text in 768 dimensions! So for each word, image, etc. (referring to the nature of the modality), we can establish a magnitude and a direction for the item in question.
The high number of dimensions in these vectors allows capturing nuances between concepts: this is what establishes semantic relationships in a textual context. TL;DR: vectors derived from text data (embeddings) allow for the fine encoding of semantic information, a unit of meaning.
In the vector of my movement, if I had to represent the idea "I walked 3 kilometers northeast," the magnitude would be the distance traveled and the direction would be "northeast," that is, an angle. As you'll see, this matters.
We're now going to show an example of generating embeddings from textual content. What you need to remember: data of any kind, when passed to an AI, is transformed into vectors of numbers.
You can find here code snippets you can run at home to test the concept. In the provided example, I test Google's AI APIs (Vertex AI) to generate embeddings, or vector representations of data (here, text) =>
In this example, I send 2 related sentences, in their context, to Vertex AI's embedding generation endpoint; you'll notice that the last one has absolutely nothing to do with the first two. That's intentional.
I then display the first five rows of these embeddings for each input text; this gives us the three arrays separated by commas that you can see here =>
This is where the "magic" of LLMs (and now LMMs) happens: a similarity calculation is performed between these vectors to determine whether they are semantically close (remember, they started as words). For this, mathematicians and their professors (to whom I apologize for the lack of attention I gave them) have provided us with a toolkit that includes:
Euclidean distance
cosine similarity
This really isn't "rocket science," as one of my AI Squad colleagues likes to say, and I encourage you to brush up on your math, if this is still fuzzy, on Khan Academy's website, which is very educational in this regard.
These two tools take different approaches to assessing similarity between two vectors: the first (Euclidean) focuses on the distance between points, while the other focuses on the angle (the direction, versus the magnitude) between those points.
In practice, in machine learning, cosine similarity is preferred because in this type of calculation, vector magnitude matters less than in the Euclidean distance calculation => whether the text is longer or shorter will have less impact on the semantic relationships that can be drawn from the text in one case versus the other, which directly affects the performance of the model in question!
I tried to demonstrate this by continuing the example started above =>
Euclidean distance
cosine similarity
In the example using Euclidean distance, we can see that the distance between the first two texts is shorter than between the first and third texts. In the other example, it's the opposite because cosine similarity returns a number between -1 and 1, where -1 represents an orthogonal relationship in the similarity between the two vectors.
There's nothing magical about it. It's the training data of the models (an immense corpus for the latest models) and how it was labeled, how the training was conducted, etc. that established these relationships.
This inspires admiration, and a brief aside: it also calls for real caution about the potentially negative impact of biases during this crucial training phase. For example, a xenophobic AI could have been trained to systematically, through its input data, associate a negative value (in our understanding) whenever community A or B is mentioned. These are risks that every AI user and solution designer must be aware of.
These reinforcement phenomena of existing conceptions already occur naturally in human societies, but the industrialized, automatic, and scalable nature of their reinforcement is, I believe, one of our century's great challenges.
In the somewhat less serious case of our Shakespearean example, the embeddings related to theater and Shakespeare return the expected result because the model's training corpus (here PaLM2) likely received Shakespeare's texts and plays as input. It's therefore able, based on the text given as input, to locate a similarity between what you're telling it and these domains.
Until now, the most popular commercial application of embeddings and its API version has been OpenAI's with ChatGPT Plus and its API endpoints for generating embeddings.
But currently, even though you can obtain embeddings programmatically through OpenAI's API, you can't create multimodal embeddings, as OpenAI only accepts text.
Where Google returns to the forefront of the AI race, failed demo or not, is that you can already create this type of embedding with Vertex AI.
https://www.youtube.com/watch?v=2M43pIOo77Y
Note: this video has nothing to do with Gemini, the multimodal model. It's "simply" a feature that makes it very easy to compute similarity between data of completely different types. And that is absolutely mind-blowing!
The ability to establish similarity relationships between vectors derived from text data, or images, already helped accomplish several tasks. The execution of these tasks, once primarily algorithmic, has been profoundly changed since the industrialization of neural networks:
semantic search, the most well-known case => search engines
recommendation systems => "I watched content x, so I'm recommended content y based on its semantic similarity"
question/answer systems => a response is generated by locating the domain of knowledge contained in the user's question
targeted advertising
image recognition
etc.
Being able to now send both an image and text, for example, to an embedding API endpoint to retrieve these number vectors makes it possible, for instance, to search for a product using just a photo!
How does it work?
you send the information "product A," with its description, etc. to generate a classic text embedding
you also send one or more images associated with this product, which also return embeddings
Where previously you had to manage the relationship between two completely different embedding spaces yourself, the technical achievement lies in the fact that these similarities are now natively supported between data of completely different types. This significantly improves the performance of establishing similarity links between text and image, image and image, etc. And it foreshadows dramatically increased abstraction capabilities from models, whether Gemini or others, in their understanding of the world.
In the product search example, this means there's no longer a need to explicitly associate an image with a label (the number of labels being necessarily limited), then query a database to establish a similarity relationship between the image's label (not the image itself) and what you typed in the search bar. We're at a completely different level of nuance where by sending an image as input, you can get much more precise and relevant results across two modalities at once: text and image. This means you don't have to worry about labeling images when creating your product dataset, for example, and can just focus on "which image goes with which product?"
A concrete example: I run an e-commerce store and I want my users to easily find a product exactly, but also similar products that best match their query. In 2023, I would have built my dataset with the product title and description on one hand, and a set of images on the other. I would have been forced to design or fine-tune two different AIs to enable flexible search without a specific identifier on my platform:
an LLM trained on or with access to my product database
an image classification model trained to recognize bikes, helmets, etc.
The result wouldn't always have been precise because the user could upload a photo of bike x and get bike y on the first page of results, which is not at all the right product. The whole thing would depend heavily on the data labeling work (and you don't always have hundreds of bikes of the same model to feed to an AI to differentiate them during training).
In 2024, to accomplish the same task:
you send to an embedding endpoint (Vertex AI and soon other providers) your text content about the product along with one or more images
you store these embeddings, for each product, in a vector database
... and that's it
Now, when users no longer have the product reference, for example, they can send an image and the exact product will be found instantly in the store. Similarly, if they want a product but want to describe it in their own words, they just type an approximate text in the search bar and the results will be much closer to reality than those generated by necessarily limited data labeling.
The scalability and speed of establishing relationships between structurally different data are dramatically improved. This is what the precision of encoding data into vectors of real numbers enables! And this is just a preview of what can be done... I invite you to watch the Google Next video included in this article to explore the possibilities offered by the "embed everything" approach used by a company invited to the event to discuss their use case.
Finally, the ability to:
store multimodal embeddings
quickly search them for similarity
... will make it possible to store complete user experiences in databases over several years: hyper-personalization of a product's UX is only just beginning!
There you go. In this article, I attempted an accessible explanation of multimodality and embeddings in neural networks. Don't hesitate to suggest edits to this text or flag any points that seem unclear in the comments!
CTO de la scale-up LAMALO, Yacine est un développeur fullstack qui ne tient pas en place : JavaScript, Node.js, Python, LLM, voice UX... Toujours en veille, il transforme les dernières innovations en solutions concrètes !
LinkedInGet our best articles every month.
Débloquer la valeur cachée dans des milliers de documents. Un projet bancaire qui transforme la recherche documentaire en quelques secondes.
ProjectDébloquer l'extraction de données hétérogènes. Un projet utilisant l'IA multimodale pour 9 marques.
ProjectDoubler la capacité de production d'audits grâce à l'automatisation intelligente.
TrainingFondamentaux ML, scikit-learn, premiers modèles supervisés et non supervisés.
ServiceDe l'audit au déploiement. Diagnostic, formation, POC, audit 360°, projet complet.
ServiceFormateurs opérationnels. IA, data science, développement web. Certifié Qualiopi.