Using Multimodal LLMs
Multimodal Large Language Models (LLMs) have the capability to process and generate responses based on various types of inputs, including text and images. aiXplain provides an interface to interact with multimodal models using JSON-formatted inputs. This guide demonstrates how to use multimodal LLMs to process images along with textual prompts.
Below is an example of how to retrieve a multimodal model and send a query that includes an image.
Example 1: Basic Image Query
The following example demonstrates how to ask the model about the contents of an image using the Gemini 1.5 Pro model.
from aixplain.factories import ModelFactory
# Retrieve the model
model = ModelFactory.get("66ef42e56eb56335ca302621")
# Define the input with text and image
result = model.run({"data": [{
"role": "user",
"content": [
{"type": "text", "text": "What’s in this image?"},
{
"type": "image_url",
"image_url": {
"url": "https://4.img-dpreview.com/files/p/E~TS590x0~articles/3925134721/0266554465.jpeg"
}
}
]
}]
})
print(result)
Example 2: Multi-turn Conversation
You can also engage in a multi-turn conversation with the model, providing additional context based on previous interactions.
result = model.run({"data": [{
"role": "user",
"content": [
{"type": "text", "text": "What’s in this image?"},
{
"type": "image_url",
"image_url": {
"url": "https://4.img-dpreview.com/files/p/E~TS590x0~articles/3925134721/0266554465.jpeg"
}
}
]
},
{
"role": "assistant",
"content": [
{"type": "text", "text": "It’s a bird"},
]
},
{
"role": "user",
"content": [
{"type": "text", "text": "What kind of bird"}
]
}]
})
print(result)
JSON Structure Explanation
The input JSON follows a structured format:
- role: Defines whether the sender is a
user
orassistant
. - content: A list of content elements, which can be:
- text: Standard textual input.
- image_url: A dictionary containing a
url
key pointing to an image.
Sample JSON Input
{
"data": [
{
"role": "user",
"content": [
{"type": "text", "text": "What’s in this image?"},
{
"type": "image_url", "image_url": {"url": "IMAGE_URL_HERE"}}
]
},
{
"role": "assistant",
"content": [
{"type": "text", "text": "It’s a cat"}
]
},
{
"role": "user",
"content": [
{"type": "text", "text": "What breed of cat?"}
]
}
]
}
Using multimodal LLMs with aiXplain allows seamless integration of text and images for a more interactive AI experience. Try different inputs and expand the possibilities of AI-driven applications!