Using Multimodal LLMs

Multimodal Large Language Models (LLMs) have the capability to process and generate responses based on various types of inputs, including text and images. aiXplain provides an interface to interact with multimodal models using JSON-formatted inputs. This guide demonstrates how to use multimodal LLMs to process images along with textual prompts.

Below is an example of how to retrieve a multimodal model and send a query that includes an image.

Example 1: Basic Image Query

The following example demonstrates how to ask the model about the contents of an image using the Gemini 1.5 Pro model.

from aixplain.factories import ModelFactory

# Retrieve the model
model = ModelFactory.get("66ef42e56eb56335ca302621")

# Define the input with text and image
result = model.run({"data": [{
      "role": "user",
      "content": [
        {"type": "text", "text": "What’s in this image?"},
        {
          "type": "image_url",
          "image_url": {
            "url": "https://4.img-dpreview.com/files/p/E~TS590x0~articles/3925134721/0266554465.jpeg"
          }
        }
      ]
    }]
})

print(result)

Example 2: Multi-turn Conversation

You can also engage in a multi-turn conversation with the model, providing additional context based on previous interactions.

result = model.run({"data": [{
      "role": "user",
      "content": [
        {"type": "text", "text": "What’s in this image?"},
        {
          "type": "image_url",
          "image_url": {
            "url": "https://4.img-dpreview.com/files/p/E~TS590x0~articles/3925134721/0266554465.jpeg"
          }
        }
      ]
    },
    {
      "role": "assistant",
      "content": [
        {"type": "text", "text": "It’s a bird"},
      ]
    },
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "What kind of bird"}
      ]
    }]
})

print(result)

JSON Structure Explanation

The input JSON follows a structured format:

role: Defines whether the sender is a user or assistant.
content: A list of content elements, which can be:
- text: Standard textual input.
- image_url: A dictionary containing a url key pointing to an image.

Sample JSON Input

{
  "data": [
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "What’s in this image?"},
        {
          "type": "image_url", "image_url": {"url": "IMAGE_URL_HERE"}}
      ]
    },
    {
      "role": "assistant",
      "content": [
        {"type": "text", "text": "It’s a cat"}
      ]
    },
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "What breed of cat?"}
      ]
    }
  ]
}

Using multimodal LLMs with aiXplain allows seamless integration of text and images for a more interactive AI experience. Try different inputs and expand the possibilities of AI-driven applications!

Example 1: Basic Image Query​

Example 2: Multi-turn Conversation​

JSON Structure Explanation​

Sample JSON Input​

Example 1: Basic Image Query

Example 2: Multi-turn Conversation

JSON Structure Explanation

Sample JSON Input