Model benchmarking
How to benchmark a model
This guide explains how to benchmark models using the aiXplain SDK. You'll learn to select datasets, models, and metrics, and create benchmarks for tasks.
Create a Benchmark Job
from aixplain.factories import (
BenchmarkFactory,
DatasetFactory,
MetricFactory,
ModelFactory,
)
from aixplain.enums import Function, Language
datasets = DatasetFactory.list(function=Function.TEXT_GENERAtion)["results"]
metrics = MetricFactory.list()
models = ModelFactory.list(function=Function.TEXT_GENERATION)["results"]
benchmark = BenchmarkFactory.create(
"UNIQUE_BENCHMARK_NAME", dataset_list=datasets, model_list=models, metric_list=metrics
)
benchmark_job = benchmark.start()
status = benchmark_job.check_status()
results_path = benchmark_job.download_results_as_csv()
Benchmark Examples
The following examples show Benchmarking applied to Text Generation, Translation and Speech Recognition using differing approaches.
Imports
from aixplain.factories import BenchmarkFactory, DatasetFactory, MetricFactory, ModelFactory
from aixplain.enums import Function, Language # for search
Select Models, Datasets & Metrics
Datasets are currently private, so you must first onboard the datasets in the examples below (or similar) to follow along.
See our guide on How to upload a dataset.
- Text generation
- Translation
- Speech recognition
Models
# Choose 'one or more' models
model_list = ModelFactory.list(function=Function.TEXT_GENERATION)["results"]
for model in model_list:
print(model.__dict__)
model_gpt4 = ModelFactory.get("6414bd3cd09663e9225130e8")
model_llama3_70b = ModelFactory.get("6626a3a8c8f1d089790cf5a2")
Datasets
# Choose 'exactly one' dataset
dataset_list = DatasetFactory.list(function=Function.TEXT_GENERATION)["results"]
for dataset in dataset_list:
print(dataset.__dict__)
dataset_pubmed_test = DatasetFactory.get("65e92213763f9f09ec1cf529")
dataset_pubmed_test.__dict__
Metrics
# Choose 'one or more' metrics
metric_list = MetricFactory.list()["results"]
for metric in metric_list:
print(metric.__dict__)
metric_wer = MetricFactory.get("646d371caec2a04700e61945")
metric_bleu = MetricFactory.get("639874ab506c987b1ae1acc6")
Models
# Choose 'one or more' models
model_list = ModelFactory.list(
function=Function.TRANSLATION,
source_languages=Language.English,
target_languages=Language.Spanish,
suppliers=[Supplier.AWS, Supplier.GOOGLE, Supplier.MICROSOFT],
)['results']
for model in model_list:
print(model.__dict__)
models = [
ModelFactory.get(id)
for id in [
"60ddefd98d38c51c58860ad6",
"617048f83a3ab842ec0804c2",
"61b097551efecf30109d3316",
]
]
Datasets
# Choose 'exactly one' dataset
dataset = DatasetFactory.list("opus")["results"][0]
for dataset in dataset_list:
print(dataset.__dict__)
dataset_opus100_en_es_200 = DatasetFactory.get("651886916fd4ecd622045542")
dataset_opus100_en_es_200.__dict__
Metrics
# Choose 'one or more' metrics that are supported
metrics_list = MetricFactory.list()['results']
metrics = [metric for metric in metrics_list if "COMET" in metric.name]
metrics
Models
# Choose 'one or more' models
models = ModelFactory.list(
function=Function.SPEECH_RECOGNITION,
source_languages=Language.English_UNITED_STATES,
page_size=2
)['results']
models
Datasets
# Choose 'exactly one' dataset
datasets = DatasetFactory.list(
function=Function.SPEECH_RECOGNITION,
source_languages=Language.English_UNITED_STATES,
page_size=1
)['results']
datasets
Metrics
# Choose 'one or more' metrics that are supported
metrics = MetricFactory.list(
model_id=models[0].id, # filter for metrics compatible with the first model
page_size=2
)['results']
metrics
Creating a Benchmark
- Text generation
- Translation
- Speech recognition
benchmark = BenchmarkFactory.create(
"UNIQUE_BENCHMARK_NAME",
dataset_list=[dataset_pubmed_test],
model_list=[model_gpt4, model_llama3_70b],
metric_list=[metric_wer, metric_bleu]
)
benchmark.__dict__
benchmark = BenchmarkFactory.create(
"UNIQUE_BENCHMARK_NAME",
dataset_list=[dataset_opus100_en_es_200],
model_list=models,
metric_list=metrics
)
benchmark.__dict__
benchmark = BenchmarkFactory.create(
"UNIQUE_BENCHMARK_NAME",
dataset_list=datasets,
model_list=models,
metric_list=metrics
)
benchmark.__dict__
Starting a Benchmark Job
Call the start
method to begin benchmarking. You can check the status by calling the check_benchmark_status
method or downloading the current results as a CSV (even for an in-progress benchmarking job).
benchmark_job = benchmark.start()
status = benchmark_job.check_benchmark_status()
You can start multiple jobs on a single Benchmark.
You can use a loop to check the status.
import time
while status != "completed":
status = benchmark_job.check_benchmark_status()
print(f"Current status: {status}")
time.sleep(10)
You can view your Benchmark once it's ready by
- visiting the Platform and locating it in your Benchmark assets, or
- calling the
download_results_as_csv
method.
results_path = benchmark_job.download_results_as_csv()
Normalization
We have methods that specialize in handling text data from various languages, providing both general and tailored preprocessing techniques for each language's unique characteristics. These are called normalization options. The normalization process transforms raw text data into a standardized format, enabling a fair and exact performance evaluation across diverse models. A few examples are 'removing numbers' and 'lowercase text'. To get the list of supported normalization options, we need the metric and model we will use in benchmarking.
metric = MetricFactory.get("639874ab506c987b1ae1acc6") # BLEU
model = ModelFactory.get("61b097551efecf30109d32da") # Sample Model
supported_options = BenchmarkFactory.list_normalization_options(metric, model)
Note: These options can be different for each metric in the same benchmark
You have the flexibility to choose multiple normalization options for each performance metric. You can also opt for the same metric with varying normalization options. This adaptability provides a thorough way to compare model performance.
selected_options = ["option 1",...."option N"]
metric.add_normalization_options(selected_options)
You can even select multiple configurations for the same metric
selected_options_config_1 = ["option 1", "option 2", "option 3"]
selected_options_config_2 = ["option 3", "option 4"]
metric.add_normalization_options(selected_options_config_1)
metric.add_normalization_options(selected_options_config_2)
After this you can create the benchmark normally
benchmark = BenchmarkFactory.create(
"UNIQUE_NAME_OF_BENCHMARK",
dataset_list=datasets,
model_list=models,
metric_list=metrics_with_normalization
)
Prompt Benchmarking
Prompt benchmarking lets you test how different prompts affect model performance using the same dataset and metric.
Example - Poetry Sentiment Classification
The example below uses Google's poem sentiment dataset which categorizes each line's emotion as positive, negative, or neutral (no impact).
Select Dataset, Metric, and Base Model
Each variant uses the same underlying model with a different prompt configuration.
from aixplain.factories import DatasetFactory, MetricFactory, ModelFactory, BenchmarkFactory
datasets = [DatasetFactory.get("67eebc80ff3b3998834d0023")]
model1 = ModelFactory.get("669a63646eb56306647e1091")
model2 = ModelFactory.get("669a63646eb56306647e1091")
model3 = ModelFactory.get("669a63646eb56306647e1091")
model4 = ModelFactory.get("669a63646eb56306647e1091")
model1.add_additional_info_for_benchmark(display_name="No Prompt", configuration={"prompt": ""})
model2.add_additional_info_for_benchmark(display_name="Simple Prompt", configuration={"prompt": "Analyze the sentiment of the following text:"})
model3.add_additional_info_for_benchmark(display_name="Specific Prompt", configuration={"prompt": "Classify the text into 'no_impact', 'negative', or 'positive':"})
model4.add_additional_info_for_benchmark(display_name="Specific Prompt with Output Format", configuration={"prompt": "Classify the text into 'no_impact', 'negative', or 'positive'. Only output the answer and nothing else:"})
models = [model1, model2, model3, model4]
metrics = [MetricFactory.get("65e1d5ac95487dea3023a0b8")]
Create and Run Benchmark
benchmark = BenchmarkFactory.create(name="Poem Sentiment Prompt Run", dataset_list=datasets, model_list=models, metric_list=metrics)
benchmark_job = benchmark.start()
Analyze Results
# Simplified score output
scores = benchmark_job.get_scores()
print(scores)
# Detailed results
results_df = benchmark_job.download_results_as_csv(return_dataframe=True)
# View metric comparison by prompt
results_df.groupby("DisplayName")["ROUGE by HuggingFace"].describe()
# Preview input and reference pairs
results_df[["Input", "Reference 0"]].head()
Clear and well-structured prompts can significantly impact model accuracy and consistency across tasks.