How to benchmark a model

This guide explains how to benchmark models using the aiXplain SDK. You'll learn to select datasets, models, and metrics, and create benchmarks for tasks. For more information, refer to this guide.

Generic Example (Template)

from aixplain.factories import (
    BenchmarkFactory,
    DatasetFactory,
    MetricFactory,
    ModelFactory,
)

datasets = DatasetFactory.list("...")  # specify Data ID
metrics = MetricFactory.list("...")  # specify Metric ID
models = ModelFactory.list("...")  # specify Model ID
benchmark = BenchmarkFactory.create(
    "benchmark_name", dataset_list=datasets, model_list=models, metric_list=metrics
)

benchmark_job = benchmark.start()
status = benchmark_job.check_status()

results_path = benchmark_job.download_results_as_csv()

Benchmark Examples

The following examples show Benchmarking applied to Text Generation, Translation and Speech Recognition using differing approaches.

Imports

from aixplain.factories import BenchmarkFactory, DatasetFactory, MetricFactory, ModelFactory
from aixplain.enums import Function, Language # for search

Select Models, Datasets & Metrics

info

Datasets are currently private, so you must first onboard the datasets in the examples below (or similar) to follow along.
See our guide on How to upload a dataset.

Text generation
Translation
Speech recognition

Models

# Choose 'one or more' models
model_list = ModelFactory.list(function=Function.TEXT_GENERATION)["results"]
for model in model_list:
  print(model.__dict__)

Show output

model_gpt4 = ModelFactory.get("6414bd3cd09663e9225130e8")
model_llama3_70b = ModelFactory.get("6626a3a8c8f1d089790cf5a2")

Datasets

# Choose 'exactly one' dataset
dataset_list = DatasetFactory.list(function=Function.TEXT_GENERATION)["results"]
for dataset in dataset_list:
  print(dataset.__dict__)

Show output

dataset_pubmed_test = DatasetFactory.get("65e92213763f9f09ec1cf529")
dataset_pubmed_test.__dict__

Show output

Metrics

# Choose 'one or more' metrics
metric_list = MetricFactory.list()["results"]
for metric in metric_list:
  print(metric.__dict__)

Show output

metric_wer = MetricFactory.get("646d371caec2a04700e61945")
metric_bleu = MetricFactory.get("639874ab506c987b1ae1acc6")

Models

# Choose 'one or more' models
model_list = ModelFactory.list(
  function=Function.TRANSLATION,
  source_languages=Language.English,
  target_languages=Language.Spanish,
  suppliers=[Supplier.AWS, Supplier.GOOGLE, Supplier.MICROSOFT],
)['results']

for model in model_list:
  print(model.__dict__)

Show output

models = [
    ModelFactory.get(id)
    for id in [
        "60ddefd98d38c51c58860ad6",
        "617048f83a3ab842ec0804c2",
        "61b097551efecf30109d3316",
    ]
]

Datasets

# Choose 'exactly one' dataset
dataset = DatasetFactory.list("opus")["results"][0]
for dataset in dataset_list:
  print(dataset.__dict__)

Show output

dataset_opus100_en_es_200 = DatasetFactory.get("651886916fd4ecd622045542")
dataset_opus100_en_es_200.__dict__

Show output

Metrics

# Choose 'one or more' metrics that are supported
metrics_list = MetricFactory.list()['results']

metrics = [metric for metric in metrics_list if "COMET" in metric.name]

metrics

Show output

Models

# Choose 'one or more' models
models = ModelFactory.list(
function=Function.SPEECH_RECOGNITION,
source_languages=Language.English_UNITED_STATES,
page_size=2
)['results']

models

Show output

Datasets

 # Choose 'exactly one' dataset
 datasets = DatasetFactory.list(
   function=Function.SPEECH_RECOGNITION,
   source_languages=Language.English_UNITED_STATES,
   page_size=1
 )['results']

 datasets

Show output

Metrics

# Choose 'one or more' metrics that are supported
metrics = MetricFactory.list(
  model_id=models[0].id, # filter for metrics compatible with the first model
  page_size=2
)['results']

metrics

Creating a Benchmark

Text generation
Translation
Speech recognition

benchmark = BenchmarkFactory.create(
  "UNIQUE_BENCHMARK_NAME",
  dataset_list=[dataset_pubmed_test],
  model_list=[model_gpt4, model_llama3_70b],
  metric_list=[metric_wer, metric_bleu]
)

benchmark.__dict__

Show output

benchmark = BenchmarkFactory.create(
  "UNIQUE_BENCHMARK_NAME",
  dataset_list=[dataset_opus100_en_es_200],
  model_list=models,
  metric_list=metrics
)

benchmark.__dict__

Show output

benchmark = BenchmarkFactory.create(
  "UNIQUE_BENCHMARK_NAME",
  dataset_list=datasets,
  model_list=models,
  metric_list=metrics
)

benchmark.__dict__

Show output

Starting a Benchmark

Call the start method to begin benchmarking. You can check the status by calling the check_benchmark_status method or downloading the current results as a CSV (even for an in-progress benchmarking job).

benchmark_model = benchmark.start()

status = benchmark_model.check_benchmark_status()

tip

You can start multiple jobs on a single Benchmark.

tip

You can use a loop to check the status.

import time

while status != "completed":
  status = benchmark_model.check_benchmark_status()
  print(f"Current status: {status}")
  time.sleep(10)

You can view your Benchmark once it's ready by

visiting the Platform and locating it in your Benchmark assets, or
calling the download_results_as_csv method.

results_path = benchmark_job.download_results_as_csv()

Normalization

We have methods that specialize in handling text data from various languages, providing both general and tailored preprocessing techniques for each language's unique characteristics. These are called normalization options. The normalization process transforms raw text data into a standardized format, enabling a fair and exact performance evaluation across diverse models. A few examples are 'removing numbers' and 'lowercase text'. To get the list of supported normalization options, we need the metric and model we will use in benchmarking.

supported_options = BenchmarkFactory.list_normalization_options(metric, model)

Note: These options can be different for each metric in the same benchmark

You have the flexibility to choose multiple normalization options for each performance metric. You can also opt for the same metric with varying normalization options. This adaptability provides a thorough way to compare model performance.

selected_options = [<option 1>....<option N>]
metric.add_normalization_options(selected_options)

You can even select multiple configurations for the same metric

selected_options_config_1 = [<option 1>, <option 2>, <option 3>]
selected_options_config_2 = [<option 3>, <option 4>]
metric.add_normalization_options(selected_options_config_1)
metric.add_normalization_options(selected_options_config_2)

After this you can create the benchmark normally

benchmark = BenchmarkFactory.create(
    <UNIQUE_NAME_OF_BENCHMARK>,
    dataset_list=datasets,
    model_list=models,
    metric_list=metrics_with_normalization
)

By following this guide, you can effectively benchmark models to assess their performance across various tasks. Utilize these to optimize your agents based on reliable evaluation metrics.

Generic Example (Template)​

Benchmark Examples​

Imports​

Select Models, Datasets & Metrics​

Models​

Datasets​

Metrics​

Models​

Datasets​

Metrics​

Models​

Datasets​

Metrics​

Creating a Benchmark​

Starting a Benchmark​

Normalization​

Generic Example (Template)

Benchmark Examples

Imports

Select Models, Datasets & Metrics

Models

Datasets

Metrics

Models

Datasets

Metrics

Models

Datasets

Metrics

Creating a Benchmark

Starting a Benchmark

Normalization