Skip to main content

Model benchmarking

Not OnPrem

How to benchmark a model

This guide explains how to benchmark models using the aiXplain SDK. You'll learn to select datasets, models, and metrics, and create benchmarks for tasks.

Create a Benchmark Job

from aixplain.factories import (
BenchmarkFactory,
DatasetFactory,
MetricFactory,
ModelFactory,
)
from aixplain.enums import Function, Language

datasets = DatasetFactory.list(function=Function.TEXT_GENERAtion)["results"]
metrics = MetricFactory.list()
models = ModelFactory.list(function=Function.TEXT_GENERATION)["results"]
benchmark = BenchmarkFactory.create(
"UNIQUE_BENCHMARK_NAME", dataset_list=datasets, model_list=models, metric_list=metrics
)

benchmark_job = benchmark.start()
status = benchmark_job.check_status()

results_path = benchmark_job.download_results_as_csv()

Benchmark Examples

The following examples show Benchmarking applied to Text Generation, Translation and Speech Recognition using differing approaches.

Imports

from aixplain.factories import BenchmarkFactory, DatasetFactory, MetricFactory, ModelFactory
from aixplain.enums import Function, Language # for search

Select Models, Datasets & Metrics

info

Datasets are currently private, so you must first onboard the datasets in the examples below (or similar) to follow along.
See our guide on How to upload a dataset.

Models

# Choose 'one or more' models
model_list = ModelFactory.list(function=Function.TEXT_GENERATION)["results"]
for model in model_list:
print(model.__dict__)
Show output
model_gpt4 = ModelFactory.get("6414bd3cd09663e9225130e8")
model_llama3_70b = ModelFactory.get("6626a3a8c8f1d089790cf5a2")

Datasets

# Choose 'exactly one' dataset
dataset_list = DatasetFactory.list(function=Function.TEXT_GENERATION)["results"]
for dataset in dataset_list:
print(dataset.__dict__)
Show output
dataset_pubmed_test = DatasetFactory.get("65e92213763f9f09ec1cf529")
dataset_pubmed_test.__dict__
Show output

Metrics

# Choose 'one or more' metrics
metric_list = MetricFactory.list()["results"]
for metric in metric_list:
print(metric.__dict__)
Show output
metric_wer = MetricFactory.get("646d371caec2a04700e61945")
metric_bleu = MetricFactory.get("639874ab506c987b1ae1acc6")

Creating a Benchmark

benchmark = BenchmarkFactory.create(
"UNIQUE_BENCHMARK_NAME",
dataset_list=[dataset_pubmed_test],
model_list=[model_gpt4, model_llama3_70b],
metric_list=[metric_wer, metric_bleu]
)

benchmark.__dict__
Show output

Starting a Benchmark Job

Call the start method to begin benchmarking. You can check the status by calling the check_benchmark_status method or downloading the current results as a CSV (even for an in-progress benchmarking job).

benchmark_job = benchmark.start()
status = benchmark_job.check_benchmark_status()
tip

You can start multiple jobs on a single Benchmark.

tip

You can use a loop to check the status.

import time

while status != "completed":
status = benchmark_job.check_benchmark_status()
print(f"Current status: {status}")
time.sleep(10)

You can view your Benchmark once it's ready by

  1. visiting the Platform and locating it in your Benchmark assets, or
  2. calling the download_results_as_csv method.
results_path = benchmark_job.download_results_as_csv()

Normalization

We have methods that specialize in handling text data from various languages, providing both general and tailored preprocessing techniques for each language's unique characteristics. These are called normalization options. The normalization process transforms raw text data into a standardized format, enabling a fair and exact performance evaluation across diverse models. A few examples are 'removing numbers' and 'lowercase text'. To get the list of supported normalization options, we need the metric and model we will use in benchmarking.

metric = MetricFactory.get("639874ab506c987b1ae1acc6")  # BLEU
model = ModelFactory.get("61b097551efecf30109d32da") # Sample Model

supported_options = BenchmarkFactory.list_normalization_options(metric, model)

Note: These options can be different for each metric in the same benchmark

You have the flexibility to choose multiple normalization options for each performance metric. You can also opt for the same metric with varying normalization options. This adaptability provides a thorough way to compare model performance.

selected_options = ["option 1",...."option N"]
metric.add_normalization_options(selected_options)

You can even select multiple configurations for the same metric

selected_options_config_1 = ["option 1", "option 2", "option 3"]
selected_options_config_2 = ["option 3", "option 4"]
metric.add_normalization_options(selected_options_config_1)
metric.add_normalization_options(selected_options_config_2)

After this you can create the benchmark normally

benchmark = BenchmarkFactory.create(
"UNIQUE_NAME_OF_BENCHMARK",
dataset_list=datasets,
model_list=models,
metric_list=metrics_with_normalization
)

Prompt Benchmarking

Prompt benchmarking lets you test how different prompts affect model performance using the same dataset and metric.

Example - Poetry Sentiment Classification

The example below uses Google's poem sentiment dataset which categorizes each line's emotion as positive, negative, or neutral (no impact).

Select Dataset, Metric, and Base Model

Each variant uses the same underlying model with a different prompt configuration.

from aixplain.factories import DatasetFactory, MetricFactory, ModelFactory, BenchmarkFactory

datasets = [DatasetFactory.get("67eebc80ff3b3998834d0023")]

model1 = ModelFactory.get("669a63646eb56306647e1091")
model2 = ModelFactory.get("669a63646eb56306647e1091")
model3 = ModelFactory.get("669a63646eb56306647e1091")
model4 = ModelFactory.get("669a63646eb56306647e1091")

model1.add_additional_info_for_benchmark(display_name="No Prompt", configuration={"prompt": ""})
model2.add_additional_info_for_benchmark(display_name="Simple Prompt", configuration={"prompt": "Analyze the sentiment of the following text:"})
model3.add_additional_info_for_benchmark(display_name="Specific Prompt", configuration={"prompt": "Classify the text into 'no_impact', 'negative', or 'positive':"})
model4.add_additional_info_for_benchmark(display_name="Specific Prompt with Output Format", configuration={"prompt": "Classify the text into 'no_impact', 'negative', or 'positive'. Only output the answer and nothing else:"})
models = [model1, model2, model3, model4]

metrics = [MetricFactory.get("65e1d5ac95487dea3023a0b8")]

Create and Run Benchmark

benchmark = BenchmarkFactory.create(name="Poem Sentiment Prompt Run", dataset_list=datasets, model_list=models, metric_list=metrics)

benchmark_job = benchmark.start()

Analyze Results

# Simplified score output
scores = benchmark_job.get_scores()
print(scores)

# Detailed results
results_df = benchmark_job.download_results_as_csv(return_dataframe=True)

# View metric comparison by prompt
results_df.groupby("DisplayName")["ROUGE by HuggingFace"].describe()

# Preview input and reference pairs
results_df[["Input", "Reference 0"]].head()
note

Clear and well-structured prompts can significantly impact model accuracy and consistency across tasks.