Skip to main content

How to benchmark a model

This guide explains how to benchmark models using the aiXplain SDK. You'll learn to select datasets, models, and metrics, and create benchmarks for tasks. For more information, refer to this guide.

Generic Example (Template)

from aixplain.factories import (
BenchmarkFactory,
DatasetFactory,
MetricFactory,
ModelFactory,
)

datasets = DatasetFactory.list("...") # specify Data ID
metrics = MetricFactory.list("...") # specify Metric ID
models = ModelFactory.list("...") # specify Model ID
benchmark = BenchmarkFactory.create(
"benchmark_name", dataset_list=datasets, model_list=models, metric_list=metrics
)

benchmark_job = benchmark.start()
status = benchmark_job.check_status()

results_path = benchmark_job.download_results_as_csv()

Benchmark Examples

The following examples show Benchmarking applied to Text Generation, Translation and Speech Recognition using differing approaches.

Imports

from aixplain.factories import BenchmarkFactory, DatasetFactory, MetricFactory, ModelFactory
from aixplain.enums import Function, Language # for search

Select Models, Datasets & Metrics

info

Datasets are currently private, so you must first onboard the datasets in the examples below (or similar) to follow along.
See our guide on How to upload a dataset.

Models

# Choose 'one or more' models
model_list = ModelFactory.list(function=Function.TEXT_GENERATION)["results"]
for model in model_list:
print(model.__dict__)
Show output
model_gpt4 = ModelFactory.get("6414bd3cd09663e9225130e8")
model_llama3_70b = ModelFactory.get("6626a3a8c8f1d089790cf5a2")

Datasets

# Choose 'exactly one' dataset
dataset_list = DatasetFactory.list(function=Function.TEXT_GENERATION)["results"]
for dataset in dataset_list:
print(dataset.__dict__)
Show output
dataset_pubmed_test = DatasetFactory.get("65e92213763f9f09ec1cf529")
dataset_pubmed_test.__dict__
Show output

Metrics

# Choose 'one or more' metrics
metric_list = MetricFactory.list()["results"]
for metric in metric_list:
print(metric.__dict__)
Show output
metric_wer = MetricFactory.get("646d371caec2a04700e61945")
metric_bleu = MetricFactory.get("639874ab506c987b1ae1acc6")

Creating a Benchmark

benchmark = BenchmarkFactory.create(
"UNIQUE_BENCHMARK_NAME",
dataset_list=[dataset_pubmed_test],
model_list=[model_gpt4, model_llama3_70b],
metric_list=[metric_wer, metric_bleu]
)

benchmark.__dict__
Show output

Starting a Benchmark

Call the start method to begin benchmarking. You can check the status by calling the check_benchmark_status method or downloading the current results as a CSV (even for an in-progress benchmarking job).

benchmark_model = benchmark.start()
status = benchmark_model.check_benchmark_status()
tip

You can start multiple jobs on a single Benchmark.

tip

You can use a loop to check the status.

import time

while status != "completed":
status = benchmark_model.check_benchmark_status()
print(f"Current status: {status}")
time.sleep(10)

You can view your Benchmark once it's ready by

  1. visiting the Platform and locating it in your Benchmark assets, or
  2. calling the download_results_as_csv method.
results_path = benchmark_job.download_results_as_csv()

Normalization

We have methods that specialize in handling text data from various languages, providing both general and tailored preprocessing techniques for each language's unique characteristics. These are called normalization options. The normalization process transforms raw text data into a standardized format, enabling a fair and exact performance evaluation across diverse models. A few examples are 'removing numbers' and 'lowercase text'. To get the list of supported normalization options, we need the metric and model we will use in benchmarking.

supported_options = BenchmarkFactory.list_normalization_options(metric, model)

Note: These options can be different for each metric in the same benchmark

You have the flexibility to choose multiple normalization options for each performance metric. You can also opt for the same metric with varying normalization options. This adaptability provides a thorough way to compare model performance.

selected_options = [<option 1>....<option N>]
metric.add_normalization_options(selected_options)

You can even select multiple configurations for the same metric

selected_options_config_1 = [<option 1>, <option 2>, <option 3>]
selected_options_config_2 = [<option 3>, <option 4>]
metric.add_normalization_options(selected_options_config_1)
metric.add_normalization_options(selected_options_config_2)

After this you can create the benchmark normally

benchmark = BenchmarkFactory.create(
<UNIQUE_NAME_OF_BENCHMARK>,
dataset_list=datasets,
model_list=models,
metric_list=metrics_with_normalization
)

By following this guide, you can effectively benchmark models to assess their performance across various tasks. Utilize these to optimize your agents based on reliable evaluation metrics.